예전에 사 놓았던 케글 우승작으로 배우는 머신러닝 책을 오랜만에 다시 보게되었다.
전에 구매하고 조금 하다가 이래저래 못하다가 갑자기 생각이 나서 다시 해 보게 되었다.
예전 케글 내용이지만 오랜만에 접속해서 데이터를 내려 받고 하나 시작을 해 보았다.
처음 해 본 것은 스페인의 산탄데르 은행이 제시한 은행방문고객에게 제품을 추천해주는 내용의 모델을 만드는 프로젝트이다.
트레이닝데이터가 13만개, 변수가 48개이다.
전반적인 내용을 둘러보는 내용까지만 해 보았는데 오랜만에 해 보니 쉽진 않았다.
import pandas as pd import numpy as np trn = pd.read_csv('train_ver2.csv')
|
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (5,8,11,15) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
In [2]:
trn.shape
|
Out[2]:
(13647309, 48)
In [3]:
trn.head()
|
Out[3]:
|
fecha_dato
|
ncodpers
|
ind_empleado
|
pais_residencia
|
sexo
|
age
|
fecha_alta
|
ind_nuevo
|
antiguedad
|
indrel
|
...
|
ind_hip_fin_ult1
|
ind_plan_fin_ult1
|
ind_pres_fin_ult1
|
ind_reca_fin_ult1
|
ind_tjcr_fin_ult1
|
ind_valo_fin_ult1
|
ind_viv_fin_ult1
|
ind_nomina_ult1
|
ind_nom_pens_ult1
|
ind_recibo_ult1
|
0
|
2015-01-28
|
1375586
|
N
|
ES
|
H
|
35
|
2015-01-12
|
0.0
|
6
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
1
|
2015-01-28
|
1050611
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
2
|
2015-01-28
|
1050612
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
3
|
2015-01-28
|
1050613
|
N
|
ES
|
H
|
22
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
4
|
2015-01-28
|
1050614
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
5 rows × 48 columns
In [4]:
for col in trn.columns: print('{}\n'.format(trn[col].head()))
|
0 2015-01-28 1 2015-01-28 2 2015-01-28 3 2015-01-28 4 2015-01-28 Name: fecha_dato, dtype: object 0 1375586 1 1050611 2 1050612 3 1050613 4 1050614 Name: ncodpers, dtype: int64 0 N 1 N 2 N 3 N 4 N Name: ind_empleado, dtype: object 0 ES 1 ES 2 ES 3 ES 4 ES Name: pais_residencia, dtype: object 0 H 1 V 2 V 3 H 4 V Name: sexo, dtype: object 0 35 1 23 2 23 3 22 4 23 Name: age, dtype: object 0 2015-01-12 1 2012-08-10 2 2012-08-10 3 2012-08-10 4 2012-08-10 Name: fecha_alta, dtype: object 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nuevo, dtype: float64 0 6 1 35 2 35 3 35 4 35 Name: antiguedad, dtype: object 0 1.0 1 1.0 2 1.0 3 1.0 4 1.0 Name: indrel, dtype: float64 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: ult_fec_cli_1t, dtype: object 0 1 1 1 2 1 3 1 4 1 Name: indrel_1mes, dtype: object 0 A 1 I 2 I 3 I 4 A Name: tiprel_1mes, dtype: object 0 S 1 S 2 S 3 S 4 S Name: indresi, dtype: object 0 N 1 S 2 N 3 N 4 N Name: indext, dtype: object 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: conyuemp, dtype: object 0 KHL 1 KHE 2 KHE 3 KHD 4 KHE Name: canal_entrada, dtype: object 0 N 1 N 2 N 3 N 4 N Name: indfall, dtype: object 0 1.0 1 1.0 2 1.0 3 1.0 4 1.0 Name: tipodom, dtype: float64 0 29.0 1 13.0 2 13.0 3 50.0 4 50.0 Name: cod_prov, dtype: float64 0 MALAGA 1 CIUDAD REAL 2 CIUDAD REAL 3 ZARAGOZA 4 ZARAGOZA Name: nomprov, dtype: object 0 1.0 1 0.0 2 0.0 3 0.0 4 1.0 Name: ind_actividad_cliente, dtype: float64 0 87218.10 1 35548.74 2 122179.11 3 119775.54 4 NaN Name: renta, dtype: float64 0 02 - PARTICULARES 1 03 - UNIVERSITARIO 2 03 - UNIVERSITARIO 3 03 - UNIVERSITARIO 4 03 - UNIVERSITARIO Name: segmento, dtype: object 0 0 1 0 2 0 3 0 4 0 Name: ind_ahor_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_aval_fin_ult1, dtype: int64 0 1 1 1 2 1 3 0 4 1 Name: ind_cco_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_cder_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_cno_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctju_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctma_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctop_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctpp_fin_ult1, dtype: int64 0 0 1 0 2 0 3 1 4 0 Name: ind_deco_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_deme_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_dela_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ecue_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_fond_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_hip_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_plan_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_pres_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_reca_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_tjcr_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_valo_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_viv_fin_ult1, dtype: int64 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nomina_ult1, dtype: float64 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nom_pens_ult1, dtype: float64 0 0 1 0 2 0 3 0 4 0 Name: ind_recibo_ult1, dtype: int64
In [5]:
trn.info()
|
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13647309 entries, 0 to 13647308 Data columns (total 48 columns): fecha_dato object ncodpers int64 ind_empleado object pais_residencia object sexo object age object fecha_alta object ind_nuevo float64 antiguedad object indrel float64 ult_fec_cli_1t object indrel_1mes object tiprel_1mes object indresi object indext object conyuemp object canal_entrada object indfall object tipodom float64 cod_prov float64 nomprov object ind_actividad_cliente float64 renta float64 segmento object ind_ahor_fin_ult1 int64 ind_aval_fin_ult1 int64 ind_cco_fin_ult1 int64 ind_cder_fin_ult1 int64 ind_cno_fin_ult1 int64 ind_ctju_fin_ult1 int64 ind_ctma_fin_ult1 int64 ind_ctop_fin_ult1 int64 ind_ctpp_fin_ult1 int64 ind_deco_fin_ult1 int64 ind_deme_fin_ult1 int64 ind_dela_fin_ult1 int64 ind_ecue_fin_ult1 int64 ind_fond_fin_ult1 int64 ind_hip_fin_ult1 int64 ind_plan_fin_ult1 int64 ind_pres_fin_ult1 int64 ind_reca_fin_ult1 int64 ind_tjcr_fin_ult1 int64 ind_valo_fin_ult1 int64 ind_viv_fin_ult1 int64 ind_nomina_ult1 float64 ind_nom_pens_ult1 float64 ind_recibo_ult1 int64 dtypes: float64(8), int64(23), object(17) memory usage: 4.9+ GB
In [6]:
num_cols=[col for col in trn.columns[:24] if trn[col].dtype in ['int64', 'float64']] trn[num_cols].describe()
|
Out[6]:
|
ncodpers
|
ind_nuevo
|
indrel
|
tipodom
|
cod_prov
|
ind_actividad_cliente
|
renta
|
count
|
1.364731e+07
|
1.361958e+07
|
1.361958e+07
|
13619574.0
|
1.355372e+07
|
1.361958e+07
|
1.085293e+07
|
mean
|
8.349042e+05
|
5.956184e-02
|
1.178399e+00
|
1.0
|
2.657147e+01
|
4.578105e-01
|
1.342543e+05
|
std
|
4.315650e+05
|
2.366733e-01
|
4.177469e+00
|
0.0
|
1.278402e+01
|
4.982169e-01
|
2.306202e+05
|
min
|
1.588900e+04
|
0.000000e+00
|
1.000000e+00
|
1.0
|
1.000000e+00
|
0.000000e+00
|
1.202730e+03
|
25%
|
4.528130e+05
|
0.000000e+00
|
1.000000e+00
|
1.0
|
1.500000e+01
|
0.000000e+00
|
6.871098e+04
|
50%
|
9.318930e+05
|
0.000000e+00
|
1.000000e+00
|
1.0
|
2.800000e+01
|
0.000000e+00
|
1.018500e+05
|
75%
|
1.199286e+06
|
0.000000e+00
|
1.000000e+00
|
1.0
|
3.500000e+01
|
1.000000e+00
|
1.559560e+05
|
max
|
1.553689e+06
|
1.000000e+00
|
9.900000e+01
|
1.0
|
5.200000e+01
|
1.000000e+00
|
2.889440e+07
|
In [9]:
cat_cols=[col for col in trn.columns[:24] if trn[col].dtype in ['O']] trn[cat_cols].describe()
|
Out[9]:
|
fecha_dato
|
ind_empleado
|
pais_residencia
|
sexo
|
age
|
fecha_alta
|
antiguedad
|
ult_fec_cli_1t
|
indrel_1mes
|
tiprel_1mes
|
indresi
|
indext
|
conyuemp
|
canal_entrada
|
indfall
|
nomprov
|
segmento
|
count
|
13647309
|
13619575
|
13619575
|
13619505
|
13647309
|
13619575
|
13647309
|
24793
|
13497528.0
|
13497528
|
13619575
|
13619575
|
1808
|
13461183
|
13619575
|
13553718
|
13457941
|
unique
|
17
|
5
|
118
|
2
|
235
|
6756
|
507
|
223
|
13.0
|
5
|
2
|
2
|
2
|
162
|
2
|
52
|
3
|
top
|
2016-05-28
|
N
|
ES
|
V
|
23
|
2014-07-28
|
0
|
2015-12-24
|
1.0
|
I
|
S
|
N
|
N
|
KHE
|
N
|
MADRID
|
02 - PARTICULARES
|
freq
|
931453
|
13610977
|
13553710
|
7424252
|
542682
|
57389
|
134335
|
763
|
7277607.0
|
7304875
|
13553711
|
12974839
|
1791
|
4055270
|
13584813
|
4409600
|
7960220
|
In [10]:
for col in cat_cols: uniq = np.unique(trn[col].astype(str)) print('-'*50) print('# col {}, n_uniq {}, uniq {}'.format(col, len(uniq), uniq))
|
-------------------------------------------------- # col fecha_dato, n_uniq 17, uniq ['2015-01-28' '2015-02-28' '2015-03-28' '2015-04-28' '2015-05-28' '2015-06-28' '2015-07-28' '2015-08-28' '2015-09-28' '2015-10-28' '2015-11-28' '2015-12-28' '2016-01-28' '2016-02-28' '2016-03-28' '2016-04-28' '2016-05-28'] -------------------------------------------------- # col ind_empleado, n_uniq 6, uniq ['A' 'B' 'F' 'N' 'S' 'nan'] -------------------------------------------------- # col pais_residencia, n_uniq 119, uniq ['AD' 'AE' 'AL' 'AO' 'AR' 'AT' 'AU' 'BA' 'BE' 'BG' 'BM' 'BO' 'BR' 'BY' 'BZ' 'CA' 'CD' 'CF' 'CG' 'CH' 'CI' 'CL' 'CM' 'CN' 'CO' 'CR' 'CU' 'CZ' 'DE' 'DJ' 'DK' 'DO' 'DZ' 'EC' 'EE' 'EG' 'ES' 'ET' 'FI' 'FR' 'GA' 'GB' 'GE' 'GH' 'GI' 'GM' 'GN' 'GQ' 'GR' 'GT' 'GW' 'HK' 'HN' 'HR' 'HU' 'IE' 'IL' 'IN' 'IS' 'IT' 'JM' 'JP' 'KE' 'KH' 'KR' 'KW' 'KZ' 'LB' 'LT' 'LU' 'LV' 'LY' 'MA' 'MD' 'MK' 'ML' 'MM' 'MR' 'MT' 'MX' 'MZ' 'NG' 'NI' 'NL' 'NO' 'NZ' 'OM' 'PA' 'PE' 'PH' 'PK' 'PL' 'PR' 'PT' 'PY' 'QA' 'RO' 'RS' 'RU' 'SA' 'SE' 'SG' 'SK' 'SL' 'SN' 'SV' 'TG' 'TH' 'TN' 'TR' 'TW' 'UA' 'US' 'UY' 'VE' 'VN' 'ZA' 'ZW' 'nan'] -------------------------------------------------- # col sexo, n_uniq 3, uniq ['H' 'V' 'nan'] -------------------------------------------------- # col age, n_uniq 219, uniq [' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' ' 10' ' 11' ' 12' ' 13' ' 14' ' 15' ' 16' ' 17' ' 18' ' 19' ' 20' ' 21' ' 22' ' 23' ' 24' ' 25' ' 26' ' 27' ' 28' ' 29' ' 30' ' 31' ' 32' ' 33' ' 34' ' 35' ' 36' ' 37' ' 38' ' 39' ' 40' ' 41' ' 42' ' 43' ' 44' ' 45' ' 46' ' 47' ' 48' ' 49' ' 50' ' 51' ' 52' ' 53' ' 54' ' 55' ' 56' ' 57' ' 58' ' 59' ' 60' ' 61' ' 62' ' 63' ' 64' ' 65' ' 66' ' 67' ' 68' ' 69' ' 70' ' 71' ' 72' ' 73' ' 74' ' 75' ' 76' ' 77' ' 78' ' 79' ' 80' ' 81' ' 82' ' 83' ' 84' ' 85' ' 86' ' 87' ' 88' ' 89' ' 90' ' 91' ' 92' ' 93' ' 94' ' 95' ' 96' ' 97' ' 98' ' 99' ' NA' '10' '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '11' '110' '111' '112' '113' '114' '115' '116' '117' '12' '126' '127' '13' '14' '15' '16' '163' '164' '17' '18' '19' '2' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '3' '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '4' '40' '41' '42' '43' '44' '45' '46' '47' '48' '49' '5' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '6' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '7' '70' '71' '72' '73' '74' '75' '76' '77' '78' '79' '8' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '9' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99'] -------------------------------------------------- # col fecha_alta, n_uniq 6757, uniq ['1995-01-16' '1995-01-17' '1995-01-23' ... '2016-05-30' '2016-05-31' 'nan'] -------------------------------------------------- # col antiguedad, n_uniq 506, uniq [' 0' ' 1' ' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' ' 10' ' 11' ' 12' ' 13' ' 14' ' 15' ' 16' ' 17' ' 18' ' 19' ' 20' ' 21' ' 22' ' 23' ' 24' ' 25' ' 26' ' 27' ' 28' ' 29' ' 30' ' 31' ' 32' ' 33' ' 34' ' 35' ' 36' ' 37' ' 38' ' 39' ' 40' ' 41' ' 42' ' 43' ' 44' ' 45' ' 46' ' 47' ' 48' ' 49' ' 50' ' 51' ' 52' ' 53' ' 54' ' 55' ' 56' ' 57' ' 58' ' 59' ' 60' ' 61' ' 62' ' 63' ' 64' ' 65' ' 66' ' 67' ' 68' ' 69' ' 70' ' 71' ' 72' ' 73' ' 74' ' 75' ' 76' ' 77' ' 78' ' 79' ' 80' ' 81' ' 82' ' 83' ' 84' ' 85' ' 86' ' 87' ' 88' ' 89' ' 90' ' 91' ' 92' ' 93' ' 94' ' 95' ' 96' ' 97' ' 98' ' 99' ' NA' ' 100' ' 101' ' 102' ' 103' ' 104' ' 105' ' 106' ' 107' ' 108' ' 109' ' 110' ' 111' ' 112' ' 113' ' 114' ' 115' ' 116' ' 117' ' 118' ' 119' ' 120' ' 121' ' 122' ' 123' ' 124' ' 125' ' 126' ' 127' ' 128' ' 129' ' 130' ' 131' ' 132' ' 133' ' 134' ' 135' ' 136' ' 137' ' 138' ' 139' ' 140' ' 141' ' 142' ' 143' ' 144' ' 145' ' 146' ' 147' ' 148' ' 149' ' 150' ' 151' ' 152' ' 153' ' 154' ' 155' ' 156' ' 157' ' 158' ' 159' ' 160' ' 161' ' 162' ' 163' ' 164' ' 165' ' 166' ' 167' ' 168' ' 169' ' 170' ' 171' ' 172' ' 173' ' 174' ' 175' ' 176' ' 177' ' 178' ' 179' ' 180' ' 181' ' 182' ' 183' ' 184' ' 185' ' 186' ' 187' ' 188' ' 189' ' 190' ' 191' ' 192' ' 193' ' 194' ' 195' ' 196' ' 197' ' 198' ' 199' ' 200' ' 201' ' 202' ' 203' ' 204' ' 205' ' 206' ' 207' ' 208' ' 209' ' 210' ' 211' ' 212' ' 213' ' 214' ' 215' ' 216' ' 217' ' 218' ' 219' ' 220' ' 221' ' 222' ' 223' ' 224' ' 225' ' 226' ' 227' ' 228' ' 229' ' 230' ' 231' ' 232' ' 233' ' 234' ' 235' ' 236' ' 237' ' 238' ' 239' ' 240' ' 241' ' 242' ' 243' ' 244' ' 245' ' 246' '-999999' '0' '1' '10' '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '11' '110' '111' '112' '113' '114' '115' '116' '117' '118' '119' '12' '120' '121' '122' '123' '124' '125' '126' '127' '128' '129' '13' '130' '131' '132' '133' '134' '135' '136' '137' '138' '139' '14' '140' '141' '142' '143' '144' '145' '146' '147' '148' '149' '15' '150' '151' '152' '153' '154' '155' '156' '157' '158' '159' '16' '160' '161' '162' '163' '164' '165' '166' '167' '168' '169' '17' '170' '171' '172' '173' '174' '175' '176' '177' '178' '179' '18' '180' '181' '182' '183' '184' '185' '186' '187' '188' '189' '19' '190' '191' '192' '193' '194' '195' '196' '197' '198' '199' '2' '20' '200' '201' '202' '203' '204' '205' '206' '207' '208' '209' '21' '210' '211' '212' '213' '214' '215' '216' '217' '218' '219' '22' '220' '221' '222' '223' '224' '225' '226' '227' '228' '229' '23' '230' '231' '232' '233' '234' '235' '236' '237' '238' '239' '24' '240' '241' '242' '243' '244' '245' '246' '247' '248' '249' '25' '250' '251' '252' '253' '254' '255' '256' '26' '27' '28' '29' '3' '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '4' '40' '41' '42' '43' '44' '45' '46' '47' '48' '49' '5' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '6' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '7' '70' '71' '72' '73' '74' '75' '76' '77' '78' '79' '8' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '9' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99'] -------------------------------------------------- # col ult_fec_cli_1t, n_uniq 224, uniq ['2015-07-01' '2015-07-02' '2015-07-03' '2015-07-06' '2015-07-07' '2015-07-08' '2015-07-09' '2015-07-10' '2015-07-13' '2015-07-14' '2015-07-15' '2015-07-16' '2015-07-17' '2015-07-20' '2015-07-21' '2015-07-22' '2015-07-23' '2015-07-24' '2015-07-27' '2015-07-28' '2015-07-29' '2015-07-30' '2015-08-03' '2015-08-04' '2015-08-05' '2015-08-06' '2015-08-07' '2015-08-10' '2015-08-11' '2015-08-12' '2015-08-13' '2015-08-14' '2015-08-17' '2015-08-18' '2015-08-19' '2015-08-20' '2015-08-21' '2015-08-24' '2015-08-25' '2015-08-26' '2015-08-27' '2015-08-28' '2015-09-01' '2015-09-02' '2015-09-03' '2015-09-04' '2015-09-07' '2015-09-08' '2015-09-09' '2015-09-10' '2015-09-11' '2015-09-14' '2015-09-15' '2015-09-16' '2015-09-17' '2015-09-18' '2015-09-21' '2015-09-22' '2015-09-23' '2015-09-24' '2015-09-25' '2015-09-28' '2015-09-29' '2015-10-01' '2015-10-02' '2015-10-05' '2015-10-06' '2015-10-07' '2015-10-08' '2015-10-09' '2015-10-13' '2015-10-14' '2015-10-15' '2015-10-16' '2015-10-19' '2015-10-20' '2015-10-21' '2015-10-22' '2015-10-23' '2015-10-26' '2015-10-27' '2015-10-28' '2015-10-29' '2015-11-02' '2015-11-03' '2015-11-04' '2015-11-05' '2015-11-06' '2015-11-09' '2015-11-10' '2015-11-11' '2015-11-12' '2015-11-13' '2015-11-16' '2015-11-17' '2015-11-18' '2015-11-19' '2015-11-20' '2015-11-23' '2015-11-24' '2015-11-25' '2015-11-26' '2015-11-27' '2015-12-01' '2015-12-02' '2015-12-03' '2015-12-04' '2015-12-07' '2015-12-09' '2015-12-10' '2015-12-11' '2015-12-14' '2015-12-15' '2015-12-16' '2015-12-17' '2015-12-18' '2015-12-21' '2015-12-22' '2015-12-23' '2015-12-24' '2015-12-28' '2015-12-29' '2015-12-30' '2016-01-04' '2016-01-05' '2016-01-07' '2016-01-08' '2016-01-11' '2016-01-12' '2016-01-13' '2016-01-14' '2016-01-15' '2016-01-18' '2016-01-19' '2016-01-20' '2016-01-21' '2016-01-22' '2016-01-25' '2016-01-26' '2016-01-27' '2016-01-28' '2016-02-01' '2016-02-02' '2016-02-03' '2016-02-04' '2016-02-05' '2016-02-08' '2016-02-09' '2016-02-10' '2016-02-11' '2016-02-12' '2016-02-15' '2016-02-16' '2016-02-17' '2016-02-18' '2016-02-19' '2016-02-22' '2016-02-23' '2016-02-24' '2016-02-25' '2016-02-26' '2016-03-01' '2016-03-02' '2016-03-03' '2016-03-04' '2016-03-07' '2016-03-08' '2016-03-09' '2016-03-10' '2016-03-11' '2016-03-14' '2016-03-15' '2016-03-16' '2016-03-17' '2016-03-18' '2016-03-21' '2016-03-22' '2016-03-23' '2016-03-24' '2016-03-28' '2016-03-29' '2016-03-30' '2016-04-01' '2016-04-04' '2016-04-05' '2016-04-06' '2016-04-07' '2016-04-08' '2016-04-11' '2016-04-12' '2016-04-13' '2016-04-14' '2016-04-15' '2016-04-18' '2016-04-19' '2016-04-20' '2016-04-21' '2016-04-22' '2016-04-25' '2016-04-26' '2016-04-27' '2016-04-28' '2016-05-02' '2016-05-03' '2016-05-04' '2016-05-05' '2016-05-06' '2016-05-09' '2016-05-10' '2016-05-11' '2016-05-12' '2016-05-13' '2016-05-16' '2016-05-17' '2016-05-18' '2016-05-19' '2016-05-20' '2016-05-23' '2016-05-24' '2016-05-25' '2016-05-26' '2016-05-27' '2016-05-30' 'nan'] -------------------------------------------------- # col indrel_1mes, n_uniq 10, uniq ['1' '1.0' '2' '2.0' '3' '3.0' '4' '4.0' 'P' 'nan'] -------------------------------------------------- # col tiprel_1mes, n_uniq 6, uniq ['A' 'I' 'N' 'P' 'R' 'nan'] -------------------------------------------------- # col indresi, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col indext, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col conyuemp, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col canal_entrada, n_uniq 163, uniq ['004' '007' '013' '025' 'K00' 'KAA' 'KAB' 'KAC' 'KAD' 'KAE' 'KAF' 'KAG' 'KAH' 'KAI' 'KAJ' 'KAK' 'KAL' 'KAM' 'KAN' 'KAO' 'KAP' 'KAQ' 'KAR' 'KAS' 'KAT' 'KAU' 'KAV' 'KAW' 'KAY' 'KAZ' 'KBB' 'KBD' 'KBE' 'KBF' 'KBG' 'KBH' 'KBJ' 'KBL' 'KBM' 'KBN' 'KBO' 'KBP' 'KBQ' 'KBR' 'KBS' 'KBU' 'KBV' 'KBW' 'KBX' 'KBY' 'KBZ' 'KCA' 'KCB' 'KCC' 'KCD' 'KCE' 'KCF' 'KCG' 'KCH' 'KCI' 'KCJ' 'KCK' 'KCL' 'KCM' 'KCN' 'KCO' 'KCP' 'KCQ' 'KCR' 'KCS' 'KCT' 'KCU' 'KCV' 'KCX' 'KDA' 'KDB' 'KDC' 'KDD' 'KDE' 'KDF' 'KDG' 'KDH' 'KDI' 'KDL' 'KDM' 'KDN' 'KDO' 'KDP' 'KDQ' 'KDR' 'KDS' 'KDT' 'KDU' 'KDV' 'KDW' 'KDX' 'KDY' 'KDZ' 'KEA' 'KEB' 'KEC' 'KED' 'KEE' 'KEF' 'KEG' 'KEH' 'KEI' 'KEJ' 'KEK' 'KEL' 'KEM' 'KEN' 'KEO' 'KEQ' 'KES' 'KEU' 'KEV' 'KEW' 'KEY' 'KEZ' 'KFA' 'KFB' 'KFC' 'KFD' 'KFE' 'KFF' 'KFG' 'KFH' 'KFI' 'KFJ' 'KFK' 'KFL' 'KFM' 'KFN' 'KFP' 'KFR' 'KFS' 'KFT' 'KFU' 'KFV' 'KGC' 'KGN' 'KGU' 'KGV' 'KGW' 'KGX' 'KGY' 'KHA' 'KHC' 'KHD' 'KHE' 'KHF' 'KHK' 'KHL' 'KHM' 'KHN' 'KHO' 'KHP' 'KHQ' 'KHR' 'KHS' 'RED' 'nan'] -------------------------------------------------- # col indfall, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col nomprov, n_uniq 53, uniq ['ALAVA' 'ALBACETE' 'ALICANTE' 'ALMERIA' 'ASTURIAS' 'AVILA' 'BADAJOZ' 'BALEARS, ILLES' 'BARCELONA' 'BIZKAIA' 'BURGOS' 'CACERES' 'CADIZ' 'CANTABRIA' 'CASTELLON' 'CEUTA' 'CIUDAD REAL' 'CORDOBA' 'CORUÑA, A' 'CUENCA' 'GIPUZKOA' 'GIRONA' 'GRANADA' 'GUADALAJARA' 'HUELVA' 'HUESCA' 'JAEN' 'LEON' 'LERIDA' 'LUGO' 'MADRID' 'MALAGA' 'MELILLA' 'MURCIA' 'NAVARRA' 'OURENSE' 'PALENCIA' 'PALMAS, LAS' 'PONTEVEDRA' 'RIOJA, LA' 'SALAMANCA' 'SANTA CRUZ DE TENERIFE' 'SEGOVIA' 'SEVILLA' 'SORIA' 'TARRAGONA' 'TERUEL' 'TOLEDO' 'VALENCIA' 'VALLADOLID' 'ZAMORA' 'ZARAGOZA' 'nan'] -------------------------------------------------- # col segmento, n_uniq 4, uniq ['01 - TOP' '02 - PARTICULARES' '03 - UNIVERSITARIO' 'nan']
In [15]:
import matplotlib import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
|
In [17]:
skip_cols =['ncodpers','renta'] for col in trn.columns: if col in skip_cols: continue print('-'*50) print('col : ', col) f, ax =plt.subplots(figsize=(20,15)) sns.countplot(x=col, data=trn, alpha=0.5) plt.show()
|
-------------------------------------------------- col : fecha_dato
-------------------------------------------------- col : ind_empleado
-------------------------------------------------- col : pais_residencia
-------------------------------------------------- col : sexo
-------------------------------------------------- col : age
-------------------------------------------------- col : fecha_alta
-------------------------------------------------- col : ind_nuevo
-------------------------------------------------- col : antiguedad
-------------------------------------------------- col : indrel
-------------------------------------------------- col : ult_fec_cli_1t
-------------------------------------------------- col : indrel_1mes
-------------------------------------------------- col : tiprel_1mes
-------------------------------------------------- col : indresi
-------------------------------------------------- col : indext
-------------------------------------------------- col : conyuemp
-------------------------------------------------- col : canal_entrada
-------------------------------------------------- col : indfall
-------------------------------------------------- col : tipodom
-------------------------------------------------- col : cod_prov
-------------------------------------------------- col : nomprov
-------------------------------------------------- col : ind_actividad_cliente
-------------------------------------------------- col : segmento
-------------------------------------------------- col : ind_ahor_fin_ult1
-------------------------------------------------- col : ind_aval_fin_ult1
-------------------------------------------------- col : ind_cco_fin_ult1
-------------------------------------------------- col : ind_cder_fin_ult1
-------------------------------------------------- col : ind_cno_fin_ult1
-------------------------------------------------- col : ind_ctju_fin_ult1
-------------------------------------------------- col : ind_ctma_fin_ult1
-------------------------------------------------- col : ind_ctop_fin_ult1
-------------------------------------------------- col : ind_ctpp_fin_ult1
-------------------------------------------------- col : ind_deco_fin_ult1
-------------------------------------------------- col : ind_deme_fin_ult1
-------------------------------------------------- col : ind_dela_fin_ult1
-------------------------------------------------- col : ind_ecue_fin_ult1
-------------------------------------------------- col : ind_fond_fin_ult1
-------------------------------------------------- col : ind_hip_fin_ult1
-------------------------------------------------- col : ind_plan_fin_ult1
-------------------------------------------------- col : ind_pres_fin_ult1
-------------------------------------------------- col : ind_reca_fin_ult1
-------------------------------------------------- col : ind_tjcr_fin_ult1
-------------------------------------------------- col : ind_valo_fin_ult1
-------------------------------------------------- col : ind_viv_fin_ult1
-------------------------------------------------- col : ind_nomina_ult1
-------------------------------------------------- col : ind_nom_pens_ult1
-------------------------------------------------- col : ind_recibo_ult1
(차트는 네이버 블로그로 긁을 수 없게 되어있나보다. 복사가 되질 않는다.
차트를 블로그에 올릴 수 있는 방법을 찾아봐야 겠다. 기분이니 차트 하나만 캡쳐해서 넣어본다.
'상상의 창 블로그 > 배움의 창' 카테고리의 다른 글
[빅데이터분석기사] 2. 데이터 탐색 (1) | 2024.05.30 |
---|---|
[빅데이터분석기사] 1. 빅데이터 분석기획 (0) | 2024.05.30 |
[원가관리회계] II. 제품원가계산(원가회계) #.1 (0) | 2024.05.30 |
[원가관리회계] I. 원가관리회계의 개념 (0) | 2024.05.30 |
스타트업 비즈니스 모델 정리(아웃스탠딩 자료) 2. 광고 (0) | 2024.05.30 |