예전에 사 놓았던 케글 우승작으로 배우는 머신러닝 책을 오랜만에 다시 보게되었다.
전에 구매하고 조금 하다가 이래저래 못하다가 갑자기 생각이 나서 다시 해 보게 되었다.
예전 케글 내용이지만 오랜만에 접속해서 데이터를 내려 받고 하나 시작을 해 보았다.
처음 해 본 것은 스페인의 산탄데르 은행이 제시한 은행방문고객에게 제품을 추천해주는 내용의 모델을 만드는 프로젝트이다.
트레이닝데이터가 13만개, 변수가 48개이다.
전반적인 내용을 둘러보는 내용까지만 해 보았는데 오랜만에 해 보니 쉽진 않았다.
import pandas as pd import numpy as np trn = pd.read_csv('train_ver2.csv')
|
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (5,8,11,15) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
In [2]:
trn.shape
|
Out[2]:
(13647309, 48)
In [3]:
trn.head()
|
Out[3]:
|
fecha_dato
|
ncodpers
|
ind_empleado
|
pais_residencia
|
sexo
|
age
|
fecha_alta
|
ind_nuevo
|
antiguedad
|
indrel
|
...
|
ind_hip_fin_ult1
|
ind_plan_fin_ult1
|
ind_pres_fin_ult1
|
ind_reca_fin_ult1
|
ind_tjcr_fin_ult1
|
ind_valo_fin_ult1
|
ind_viv_fin_ult1
|
ind_nomina_ult1
|
ind_nom_pens_ult1
|
ind_recibo_ult1
|
0
|
2015-01-28
|
1375586
|
N
|
ES
|
H
|
35
|
2015-01-12
|
0.0
|
6
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
1
|
2015-01-28
|
1050611
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
2
|
2015-01-28
|
1050612
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
3
|
2015-01-28
|
1050613
|
N
|
ES
|
H
|
22
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
4
|
2015-01-28
|
1050614
|
N
|
ES
|
V
|
23
|
2012-08-10
|
0.0
|
35
|
1.0
|
...
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.0
|
0.0
|
0
|
5 rows × 48 columns
In [4]:
for col in trn.columns: print('{}\n'.format(trn[col].head()))
|
0 2015-01-28 1 2015-01-28 2 2015-01-28 3 2015-01-28 4 2015-01-28 Name: fecha_dato, dtype: object 0 1375586 1 1050611 2 1050612 3 1050613 4 1050614 Name: ncodpers, dtype: int64 0 N 1 N 2 N 3 N 4 N Name: ind_empleado, dtype: object 0 ES 1 ES 2 ES 3 ES 4 ES Name: pais_residencia, dtype: object 0 H 1 V 2 V 3 H 4 V Name: sexo, dtype: object 0 35 1 23 2 23 3 22 4 23 Name: age, dtype: object 0 2015-01-12 1 2012-08-10 2 2012-08-10 3 2012-08-10 4 2012-08-10 Name: fecha_alta, dtype: object 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nuevo, dtype: float64 0 6 1 35 2 35 3 35 4 35 Name: antiguedad, dtype: object 0 1.0 1 1.0 2 1.0 3 1.0 4 1.0 Name: indrel, dtype: float64 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: ult_fec_cli_1t, dtype: object 0 1 1 1 2 1 3 1 4 1 Name: indrel_1mes, dtype: object 0 A 1 I 2 I 3 I 4 A Name: tiprel_1mes, dtype: object 0 S 1 S 2 S 3 S 4 S Name: indresi, dtype: object 0 N 1 S 2 N 3 N 4 N Name: indext, dtype: object 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: conyuemp, dtype: object 0 KHL 1 KHE 2 KHE 3 KHD 4 KHE Name: canal_entrada, dtype: object 0 N 1 N 2 N 3 N 4 N Name: indfall, dtype: object 0 1.0 1 1.0 2 1.0 3 1.0 4 1.0 Name: tipodom, dtype: float64 0 29.0 1 13.0 2 13.0 3 50.0 4 50.0 Name: cod_prov, dtype: float64 0 MALAGA 1 CIUDAD REAL 2 CIUDAD REAL 3 ZARAGOZA 4 ZARAGOZA Name: nomprov, dtype: object 0 1.0 1 0.0 2 0.0 3 0.0 4 1.0 Name: ind_actividad_cliente, dtype: float64 0 87218.10 1 35548.74 2 122179.11 3 119775.54 4 NaN Name: renta, dtype: float64 0 02 - PARTICULARES 1 03 - UNIVERSITARIO 2 03 - UNIVERSITARIO 3 03 - UNIVERSITARIO 4 03 - UNIVERSITARIO Name: segmento, dtype: object 0 0 1 0 2 0 3 0 4 0 Name: ind_ahor_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_aval_fin_ult1, dtype: int64 0 1 1 1 2 1 3 0 4 1 Name: ind_cco_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_cder_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_cno_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctju_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctma_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctop_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ctpp_fin_ult1, dtype: int64 0 0 1 0 2 0 3 1 4 0 Name: ind_deco_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_deme_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_dela_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_ecue_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_fond_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_hip_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_plan_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_pres_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_reca_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_tjcr_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_valo_fin_ult1, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: ind_viv_fin_ult1, dtype: int64 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nomina_ult1, dtype: float64 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: ind_nom_pens_ult1, dtype: float64 0 0 1 0 2 0 3 0 4 0 Name: ind_recibo_ult1, dtype: int64
In [5]:
trn.info()
|
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13647309 entries, 0 to 13647308 Data columns (total 48 columns): fecha_dato object ncodpers int64 ind_empleado object pais_residencia object sexo object age object fecha_alta object ind_nuevo float64 antiguedad object indrel float64 ult_fec_cli_1t object indrel_1mes object tiprel_1mes object indresi object indext object conyuemp object canal_entrada object indfall object tipodom float64 cod_prov float64 nomprov object ind_actividad_cliente float64 renta float64 segmento object ind_ahor_fin_ult1 int64 ind_aval_fin_ult1 int64 ind_cco_fin_ult1 int64 ind_cder_fin_ult1 int64 ind_cno_fin_ult1 int64 ind_ctju_fin_ult1 int64 ind_ctma_fin_ult1 int64 ind_ctop_fin_ult1 int64 ind_ctpp_fin_ult1 int64 ind_deco_fin_ult1 int64 ind_deme_fin_ult1 int64 ind_dela_fin_ult1 int64 ind_ecue_fin_ult1 int64 ind_fond_fin_ult1 int64 ind_hip_fin_ult1 int64 ind_plan_fin_ult1 int64 ind_pres_fin_ult1 int64 ind_reca_fin_ult1 int64 ind_tjcr_fin_ult1 int64 ind_valo_fin_ult1 int64 ind_viv_fin_ult1 int64 ind_nomina_ult1 float64 ind_nom_pens_ult1 float64 ind_recibo_ult1 int64 dtypes: float64(8), int64(23), object(17) memory usage: 4.9+ GB
In [6]:
num_cols=[col for col in trn.columns[:24] if trn[col].dtype in ['int64', 'float64']] trn[num_cols].describe()
|
Out[6]:
|
ncodpers
|
ind_nuevo
|
indrel
|
tipodom
|
cod_prov
|
ind_actividad_cliente
|
renta
|
count
|
1.364731e+07
|
1.361958e+07
|
1.361958e+07
|
13619574.0
|
1.355372e+07
|
1.361958e+07
|
1.085293e+07
|
mean
|
8.349042e+05
|
5.956184e-02
|
1.178399e+00
|
1.0
|
2.657147e+01
|
4.578105e-01
|
1.342543e+05
|
std
|
4.315650e+05
|
2.366733e-01
|
4.177469e+00
|
0.0
|
1.278402e+01
|
4.982169e-01
|
2.306202e+05
|
min
|
1.588900e+04
|
0.000000e+00
|
1.000000e+00
|
1.0
|
1.000000e+00
|
0.000000e+00
|
1.202730e+03
|
25%
|
4.528130e+05
|
0.000000e+00
|
1.000000e+00
|
1.0
|
1.500000e+01
|
0.000000e+00
|
6.871098e+04
|
50%
|
9.318930e+05
|
0.000000e+00
|
1.000000e+00
|
1.0
|
2.800000e+01
|
0.000000e+00
|
1.018500e+05
|
75%
|
1.199286e+06
|
0.000000e+00
|
1.000000e+00
|
1.0
|
3.500000e+01
|
1.000000e+00
|
1.559560e+05
|
max
|
1.553689e+06
|
1.000000e+00
|
9.900000e+01
|
1.0
|
5.200000e+01
|
1.000000e+00
|
2.889440e+07
|
In [9]:
cat_cols=[col for col in trn.columns[:24] if trn[col].dtype in ['O']] trn[cat_cols].describe()
|
Out[9]:
|
fecha_dato
|
ind_empleado
|
pais_residencia
|
sexo
|
age
|
fecha_alta
|
antiguedad
|
ult_fec_cli_1t
|
indrel_1mes
|
tiprel_1mes
|
indresi
|
indext
|
conyuemp
|
canal_entrada
|
indfall
|
nomprov
|
segmento
|
count
|
13647309
|
13619575
|
13619575
|
13619505
|
13647309
|
13619575
|
13647309
|
24793
|
13497528.0
|
13497528
|
13619575
|
13619575
|
1808
|
13461183
|
13619575
|
13553718
|
13457941
|
unique
|
17
|
5
|
118
|
2
|
235
|
6756
|
507
|
223
|
13.0
|
5
|
2
|
2
|
2
|
162
|
2
|
52
|
3
|
top
|
2016-05-28
|
N
|
ES
|
V
|
23
|
2014-07-28
|
0
|
2015-12-24
|
1.0
|
I
|
S
|
N
|
N
|
KHE
|
N
|
MADRID
|
02 - PARTICULARES
|
freq
|
931453
|
13610977
|
13553710
|
7424252
|
542682
|
57389
|
134335
|
763
|
7277607.0
|
7304875
|
13553711
|
12974839
|
1791
|
4055270
|
13584813
|
4409600
|
7960220
|
In [10]:
for col in cat_cols: uniq = np.unique(trn[col].astype(str)) print('-'*50) print('# col {}, n_uniq {}, uniq {}'.format(col, len(uniq), uniq))
|
-------------------------------------------------- # col fecha_dato, n_uniq 17, uniq ['2015-01-28' '2015-02-28' '2015-03-28' '2015-04-28' '2015-05-28' '2015-06-28' '2015-07-28' '2015-08-28' '2015-09-28' '2015-10-28' '2015-11-28' '2015-12-28' '2016-01-28' '2016-02-28' '2016-03-28' '2016-04-28' '2016-05-28'] -------------------------------------------------- # col ind_empleado, n_uniq 6, uniq ['A' 'B' 'F' 'N' 'S' 'nan'] -------------------------------------------------- # col pais_residencia, n_uniq 119, uniq ['AD' 'AE' 'AL' 'AO' 'AR' 'AT' 'AU' 'BA' 'BE' 'BG' 'BM' 'BO' 'BR' 'BY' 'BZ' 'CA' 'CD' 'CF' 'CG' 'CH' 'CI' 'CL' 'CM' 'CN' 'CO' 'CR' 'CU' 'CZ' 'DE' 'DJ' 'DK' 'DO' 'DZ' 'EC' 'EE' 'EG' 'ES' 'ET' 'FI' 'FR' 'GA' 'GB' 'GE' 'GH' 'GI' 'GM' 'GN' 'GQ' 'GR' 'GT' 'GW' 'HK' 'HN' 'HR' 'HU' 'IE' 'IL' 'IN' 'IS' 'IT' 'JM' 'JP' 'KE' 'KH' 'KR' 'KW' 'KZ' 'LB' 'LT' 'LU' 'LV' 'LY' 'MA' 'MD' 'MK' 'ML' 'MM' 'MR' 'MT' 'MX' 'MZ' 'NG' 'NI' 'NL' 'NO' 'NZ' 'OM' 'PA' 'PE' 'PH' 'PK' 'PL' 'PR' 'PT' 'PY' 'QA' 'RO' 'RS' 'RU' 'SA' 'SE' 'SG' 'SK' 'SL' 'SN' 'SV' 'TG' 'TH' 'TN' 'TR' 'TW' 'UA' 'US' 'UY' 'VE' 'VN' 'ZA' 'ZW' 'nan'] -------------------------------------------------- # col sexo, n_uniq 3, uniq ['H' 'V' 'nan'] -------------------------------------------------- # col age, n_uniq 219, uniq [' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' ' 10' ' 11' ' 12' ' 13' ' 14' ' 15' ' 16' ' 17' ' 18' ' 19' ' 20' ' 21' ' 22' ' 23' ' 24' ' 25' ' 26' ' 27' ' 28' ' 29' ' 30' ' 31' ' 32' ' 33' ' 34' ' 35' ' 36' ' 37' ' 38' ' 39' ' 40' ' 41' ' 42' ' 43' ' 44' ' 45' ' 46' ' 47' ' 48' ' 49' ' 50' ' 51' ' 52' ' 53' ' 54' ' 55' ' 56' ' 57' ' 58' ' 59' ' 60' ' 61' ' 62' ' 63' ' 64' ' 65' ' 66' ' 67' ' 68' ' 69' ' 70' ' 71' ' 72' ' 73' ' 74' ' 75' ' 76' ' 77' ' 78' ' 79' ' 80' ' 81' ' 82' ' 83' ' 84' ' 85' ' 86' ' 87' ' 88' ' 89' ' 90' ' 91' ' 92' ' 93' ' 94' ' 95' ' 96' ' 97' ' 98' ' 99' ' NA' '10' '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '11' '110' '111' '112' '113' '114' '115' '116' '117' '12' '126' '127' '13' '14' '15' '16' '163' '164' '17' '18' '19' '2' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '3' '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '4' '40' '41' '42' '43' '44' '45' '46' '47' '48' '49' '5' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '6' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '7' '70' '71' '72' '73' '74' '75' '76' '77' '78' '79' '8' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '9' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99'] -------------------------------------------------- # col fecha_alta, n_uniq 6757, uniq ['1995-01-16' '1995-01-17' '1995-01-23' ... '2016-05-30' '2016-05-31' 'nan'] -------------------------------------------------- # col antiguedad, n_uniq 506, uniq [' 0' ' 1' ' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' ' 10' ' 11' ' 12' ' 13' ' 14' ' 15' ' 16' ' 17' ' 18' ' 19' ' 20' ' 21' ' 22' ' 23' ' 24' ' 25' ' 26' ' 27' ' 28' ' 29' ' 30' ' 31' ' 32' ' 33' ' 34' ' 35' ' 36' ' 37' ' 38' ' 39' ' 40' ' 41' ' 42' ' 43' ' 44' ' 45' ' 46' ' 47' ' 48' ' 49' ' 50' ' 51' ' 52' ' 53' ' 54' ' 55' ' 56' ' 57' ' 58' ' 59' ' 60' ' 61' ' 62' ' 63' ' 64' ' 65' ' 66' ' 67' ' 68' ' 69' ' 70' ' 71' ' 72' ' 73' ' 74' ' 75' ' 76' ' 77' ' 78' ' 79' ' 80' ' 81' ' 82' ' 83' ' 84' ' 85' ' 86' ' 87' ' 88' ' 89' ' 90' ' 91' ' 92' ' 93' ' 94' ' 95' ' 96' ' 97' ' 98' ' 99' ' NA' ' 100' ' 101' ' 102' ' 103' ' 104' ' 105' ' 106' ' 107' ' 108' ' 109' ' 110' ' 111' ' 112' ' 113' ' 114' ' 115' ' 116' ' 117' ' 118' ' 119' ' 120' ' 121' ' 122' ' 123' ' 124' ' 125' ' 126' ' 127' ' 128' ' 129' ' 130' ' 131' ' 132' ' 133' ' 134' ' 135' ' 136' ' 137' ' 138' ' 139' ' 140' ' 141' ' 142' ' 143' ' 144' ' 145' ' 146' ' 147' ' 148' ' 149' ' 150' ' 151' ' 152' ' 153' ' 154' ' 155' ' 156' ' 157' ' 158' ' 159' ' 160' ' 161' ' 162' ' 163' ' 164' ' 165' ' 166' ' 167' ' 168' ' 169' ' 170' ' 171' ' 172' ' 173' ' 174' ' 175' ' 176' ' 177' ' 178' ' 179' ' 180' ' 181' ' 182' ' 183' ' 184' ' 185' ' 186' ' 187' ' 188' ' 189' ' 190' ' 191' ' 192' ' 193' ' 194' ' 195' ' 196' ' 197' ' 198' ' 199' ' 200' ' 201' ' 202' ' 203' ' 204' ' 205' ' 206' ' 207' ' 208' ' 209' ' 210' ' 211' ' 212' ' 213' ' 214' ' 215' ' 216' ' 217' ' 218' ' 219' ' 220' ' 221' ' 222' ' 223' ' 224' ' 225' ' 226' ' 227' ' 228' ' 229' ' 230' ' 231' ' 232' ' 233' ' 234' ' 235' ' 236' ' 237' ' 238' ' 239' ' 240' ' 241' ' 242' ' 243' ' 244' ' 245' ' 246' '-999999' '0' '1' '10' '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '11' '110' '111' '112' '113' '114' '115' '116' '117' '118' '119' '12' '120' '121' '122' '123' '124' '125' '126' '127' '128' '129' '13' '130' '131' '132' '133' '134' '135' '136' '137' '138' '139' '14' '140' '141' '142' '143' '144' '145' '146' '147' '148' '149' '15' '150' '151' '152' '153' '154' '155' '156' '157' '158' '159' '16' '160' '161' '162' '163' '164' '165' '166' '167' '168' '169' '17' '170' '171' '172' '173' '174' '175' '176' '177' '178' '179' '18' '180' '181' '182' '183' '184' '185' '186' '187' '188' '189' '19' '190' '191' '192' '193' '194' '195' '196' '197' '198' '199' '2' '20' '200' '201' '202' '203' '204' '205' '206' '207' '208' '209' '21' '210' '211' '212' '213' '214' '215' '216' '217' '218' '219' '22' '220' '221' '222' '223' '224' '225' '226' '227' '228' '229' '23' '230' '231' '232' '233' '234' '235' '236' '237' '238' '239' '24' '240' '241' '242' '243' '244' '245' '246' '247' '248' '249' '25' '250' '251' '252' '253' '254' '255' '256' '26' '27' '28' '29' '3' '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '4' '40' '41' '42' '43' '44' '45' '46' '47' '48' '49' '5' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '6' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '7' '70' '71' '72' '73' '74' '75' '76' '77' '78' '79' '8' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '9' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99'] -------------------------------------------------- # col ult_fec_cli_1t, n_uniq 224, uniq ['2015-07-01' '2015-07-02' '2015-07-03' '2015-07-06' '2015-07-07' '2015-07-08' '2015-07-09' '2015-07-10' '2015-07-13' '2015-07-14' '2015-07-15' '2015-07-16' '2015-07-17' '2015-07-20' '2015-07-21' '2015-07-22' '2015-07-23' '2015-07-24' '2015-07-27' '2015-07-28' '2015-07-29' '2015-07-30' '2015-08-03' '2015-08-04' '2015-08-05' '2015-08-06' '2015-08-07' '2015-08-10' '2015-08-11' '2015-08-12' '2015-08-13' '2015-08-14' '2015-08-17' '2015-08-18' '2015-08-19' '2015-08-20' '2015-08-21' '2015-08-24' '2015-08-25' '2015-08-26' '2015-08-27' '2015-08-28' '2015-09-01' '2015-09-02' '2015-09-03' '2015-09-04' '2015-09-07' '2015-09-08' '2015-09-09' '2015-09-10' '2015-09-11' '2015-09-14' '2015-09-15' '2015-09-16' '2015-09-17' '2015-09-18' '2015-09-21' '2015-09-22' '2015-09-23' '2015-09-24' '2015-09-25' '2015-09-28' '2015-09-29' '2015-10-01' '2015-10-02' '2015-10-05' '2015-10-06' '2015-10-07' '2015-10-08' '2015-10-09' '2015-10-13' '2015-10-14' '2015-10-15' '2015-10-16' '2015-10-19' '2015-10-20' '2015-10-21' '2015-10-22' '2015-10-23' '2015-10-26' '2015-10-27' '2015-10-28' '2015-10-29' '2015-11-02' '2015-11-03' '2015-11-04' '2015-11-05' '2015-11-06' '2015-11-09' '2015-11-10' '2015-11-11' '2015-11-12' '2015-11-13' '2015-11-16' '2015-11-17' '2015-11-18' '2015-11-19' '2015-11-20' '2015-11-23' '2015-11-24' '2015-11-25' '2015-11-26' '2015-11-27' '2015-12-01' '2015-12-02' '2015-12-03' '2015-12-04' '2015-12-07' '2015-12-09' '2015-12-10' '2015-12-11' '2015-12-14' '2015-12-15' '2015-12-16' '2015-12-17' '2015-12-18' '2015-12-21' '2015-12-22' '2015-12-23' '2015-12-24' '2015-12-28' '2015-12-29' '2015-12-30' '2016-01-04' '2016-01-05' '2016-01-07' '2016-01-08' '2016-01-11' '2016-01-12' '2016-01-13' '2016-01-14' '2016-01-15' '2016-01-18' '2016-01-19' '2016-01-20' '2016-01-21' '2016-01-22' '2016-01-25' '2016-01-26' '2016-01-27' '2016-01-28' '2016-02-01' '2016-02-02' '2016-02-03' '2016-02-04' '2016-02-05' '2016-02-08' '2016-02-09' '2016-02-10' '2016-02-11' '2016-02-12' '2016-02-15' '2016-02-16' '2016-02-17' '2016-02-18' '2016-02-19' '2016-02-22' '2016-02-23' '2016-02-24' '2016-02-25' '2016-02-26' '2016-03-01' '2016-03-02' '2016-03-03' '2016-03-04' '2016-03-07' '2016-03-08' '2016-03-09' '2016-03-10' '2016-03-11' '2016-03-14' '2016-03-15' '2016-03-16' '2016-03-17' '2016-03-18' '2016-03-21' '2016-03-22' '2016-03-23' '2016-03-24' '2016-03-28' '2016-03-29' '2016-03-30' '2016-04-01' '2016-04-04' '2016-04-05' '2016-04-06' '2016-04-07' '2016-04-08' '2016-04-11' '2016-04-12' '2016-04-13' '2016-04-14' '2016-04-15' '2016-04-18' '2016-04-19' '2016-04-20' '2016-04-21' '2016-04-22' '2016-04-25' '2016-04-26' '2016-04-27' '2016-04-28' '2016-05-02' '2016-05-03' '2016-05-04' '2016-05-05' '2016-05-06' '2016-05-09' '2016-05-10' '2016-05-11' '2016-05-12' '2016-05-13' '2016-05-16' '2016-05-17' '2016-05-18' '2016-05-19' '2016-05-20' '2016-05-23' '2016-05-24' '2016-05-25' '2016-05-26' '2016-05-27' '2016-05-30' 'nan'] -------------------------------------------------- # col indrel_1mes, n_uniq 10, uniq ['1' '1.0' '2' '2.0' '3' '3.0' '4' '4.0' 'P' 'nan'] -------------------------------------------------- # col tiprel_1mes, n_uniq 6, uniq ['A' 'I' 'N' 'P' 'R' 'nan'] -------------------------------------------------- # col indresi, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col indext, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col conyuemp, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col canal_entrada, n_uniq 163, uniqnan'] -------------------------------------------------- # col indfall, n_uniq 3, uniq ['N' 'S' 'nan'] -------------------------------------------------- # col nomprov, n_uniq 53, uniq ['ALAVA' 'ALBACETE' 'ALICANTE' 'ALMERIA' 'ASTURIAS' 'AVILA' 'BADAJOZ' 'BALEARS, ILLES' 'BARCELONA' 'BIZKAIA' 'BURGOS' 'CACERES' 'CADIZ' 'CANTABRIA' 'CASTELLON' 'CEUTA' 'CIUDAD REAL' 'CORDOBA' 'CORUÑA, A' 'CUENCA' 'GIPUZKOA' 'GIRONA' 'GRANADA' 'GUADALAJARA' 'HUELVA' 'HUESCA' 'JAEN' 'LEON' 'LERIDA' 'LUGO' 'MADRID' 'MALAGA' 'MELILLA' 'MURCIA' 'NAVARRA' 'OURENSE' 'PALENCIA' 'PALMAS, LAS' 'PONTEVEDRA' 'RIOJA, LA' 'SALAMANCA' 'SANTA CRUZ DE TENERIFE' 'SEGOVIA' 'SEVILLA' 'SORIA' 'TARRAGONA' 'TERUEL' 'TOLEDO' 'VALENCIA' 'VALLADOLID' 'ZAMORA' 'ZARAGOZA' 'nan'] -------------------------------------------------- # col segmento, n_uniq 4, uniq ['01 - TOP' '02 - PARTICULARES' '03 - UNIVERSITARIO' 'nan']
In [15]:
import matplotlib import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
|
In [17]:
skip_cols =['ncodpers','renta'] for col in trn.columns: if col in skip_cols: continue print('-'*50) print('col : ', col) f, ax =plt.subplots(figsize=(20,15)) sns.countplot(x=col, data=trn, alpha=0.5) plt.show()
|
-------------------------------------------------- col : fecha_dato
-------------------------------------------------- col : ind_empleado
-------------------------------------------------- col : pais_residencia
-------------------------------------------------- col : sexo
-------------------------------------------------- col : age
-------------------------------------------------- col : fecha_alta
-------------------------------------------------- col : ind_nuevo
-------------------------------------------------- col : antiguedad
-------------------------------------------------- col : indrel
-------------------------------------------------- col : ult_fec_cli_1t
-------------------------------------------------- col : indrel_1mes
-------------------------------------------------- col : tiprel_1mes
-------------------------------------------------- col : indresi
-------------------------------------------------- col : indext
-------------------------------------------------- col : conyuemp
-------------------------------------------------- col : canal_entrada
-------------------------------------------------- col : indfall
-------------------------------------------------- col : tipodom
-------------------------------------------------- col : cod_prov
-------------------------------------------------- col : nomprov
-------------------------------------------------- col : ind_actividad_cliente
-------------------------------------------------- col : segmento
-------------------------------------------------- col : ind_ahor_fin_ult1
-------------------------------------------------- col : ind_aval_fin_ult1
-------------------------------------------------- col : ind_cco_fin_ult1
-------------------------------------------------- col : ind_cder_fin_ult1
-------------------------------------------------- col : ind_cno_fin_ult1
-------------------------------------------------- col : ind_ctju_fin_ult1
-------------------------------------------------- col : ind_ctma_fin_ult1
-------------------------------------------------- col : ind_ctop_fin_ult1
-------------------------------------------------- col : ind_ctpp_fin_ult1
-------------------------------------------------- col : ind_deco_fin_ult1
-------------------------------------------------- col : ind_deme_fin_ult1
-------------------------------------------------- col : ind_dela_fin_ult1
-------------------------------------------------- col : ind_ecue_fin_ult1
-------------------------------------------------- col : ind_fond_fin_ult1
-------------------------------------------------- col : ind_hip_fin_ult1
-------------------------------------------------- col : ind_plan_fin_ult1
-------------------------------------------------- col : ind_pres_fin_ult1
-------------------------------------------------- col : ind_reca_fin_ult1
-------------------------------------------------- col : ind_tjcr_fin_ult1
-------------------------------------------------- col : ind_valo_fin_ult1
-------------------------------------------------- col : ind_viv_fin_ult1
-------------------------------------------------- col : ind_nomina_ult1
-------------------------------------------------- col : ind_nom_pens_ult1
-------------------------------------------------- col : ind_recibo_ult1
(차트는 네이버 블로그로 긁을 수 없게 되어있나보다. 복사가 되질 않는다.
차트를 블로그에 올릴 수 있는 방법을 찾아봐야 겠다. 기분이니 차트 하나만 캡쳐해서 넣어본다.
'상상의 창 블로그 > 배움의 창' 카테고리의 다른 글
[빅데이터분석기사] 2. 데이터 탐색 (1) | 2024.05.30 |
---|---|
[빅데이터분석기사] 1. 빅데이터 분석기획 (0) | 2024.05.30 |
[원가관리회계] II. 제품원가계산(원가회계) #.1 (0) | 2024.05.30 |
[원가관리회계] I. 원가관리회계의 개념 (0) | 2024.05.30 |
스타트업 비즈니스 모델 정리(아웃스탠딩 자료) 2. 광고 (0) | 2024.05.30 |