Risk Analysis Study
lanquetuit.cyril@gmail.com, Université de Cergy Pontoise
Table des matières
1 Data management and visualisation 3
1.1 Week 1 - Code Book . . . 3
1.2 Week 2 - First python programme . . . 5
1.3 Week 3 - Quantiles . . . 11
1.4 Week 4 - First plot . . . 12
2 Data Analysis Tool 13 2.1 Week 1 - Ordinary Least Square . . . 13
2.2 Week 2 - Chi square analysis . . . 15
2.3 Week 3 - Correlation coecient . . . 16
2.4 Week 4 - Statistical interaction . . . 17
3 Machine Learning 19 3.1 Week 1 - Classication Tree . . . 19
3.2 Week 2 - Classication Random Forest . . . 20
Abstract
We will use a condential Data Set which is not authorised to be published, so we will only highlight our futur result, not the whole data.
We will study how to explain risk aversion of individu in function of multiple variables such as Socio-Professionnal Category (csp) or capital or date (number of month since january 2000 past until people answer questionnary).
Our data are collected on a representative sample of french population but condentiality restriction does not allow us to publish all the experimental pro- cedure set in place to collect our data.
Our problematic can be summarize as is :
"How and Why people are risk averse ?"
About the study...
http://cyrillanquetuit.free.fr/PDF/EconometricsInShort.pdf
Figure 1 Feynman & NASA [1]
1 Data management and visualisation
1.1 Week 1 - Code Book
V ariableLabel academic agec age2c age3c
retireageage de la reraite reg_domDOM
reg11IDF
reg21Champagne-Ardenne reg22Picardie
reg23Haute-Normandie reg24Centre
reg25Basse-Normandie reg26Bourgogne
reg31Nord-Pas-de-Calais reg41Lorraine
reg42Alsace
reg43Franche-Comte reg52Pays-de-la-Loire reg53Bretagne
reg54Poitou-Charentes reg72Aquitaine reg73Midi-Pyrenees reg74limousin reg82Rhone-Alpes reg83auvergne
reg91Languedoc-Roussillon reg93PACA
reg94Corse reg_miss codetuu0
codetuu1moins de 5 000 hab codetuu25 000 à 9 999 hab.
codetuu310 000 à 19 999 hab.
codetuu420 000 à 49 999 hab.
codetuu550 000 à 99 999 hab.
codetuu6100 000 à 199 999 hab.
codetuu7200 000 à 1 999 999 hab.
codetuu8Unité urbaine de Paris codetuu_miss
dateNb mois a partir de janv 2000 date_f romaug2008
date_f romnov2010 min(0,date- date QE de FED)
date_f romjanv2015 min(0,date- date menace Grexit)
educ1Aucun diplôme educ2Brevet des colleges
educ7Diplôme professionnel : CAP et BEP
educ3Baccalaureat
educ4Bac +1 à 3, type Licence educ5_6Bac +4 et +, type master, ecole de commerce, ecole d ingenieur, doctorat
educ_miss
matri1Marie avec contrat matri2marie sans contrat matri3PACS
matri4Vie maritale matri5Divorce matri6Veuf
matri1_2_3_4 en couple, sans precision matri5_6_8 Separe, sans precision
matri5_6_8 matri7Celibataire matri_miss gender2 Femme
dependentenb personnes a charge csp1( Etudiant )
csp2(Recherche_emp ) csp3( Agriculteur) csp4( Ouvrier ) csp5(Employe) csp6( Technicien ) csp7(Agent_de_Mait ) csp8( Prof_intermed ) csp9( Artisan ) csp10( Cadre_sup ) csp11( Prof_liberale ) csp12( Chef_entreprise ) csp13(Retraite )
csp14(Sans_activite ) csp_miss
f publicfonctionnaire
product1Connaissance Actions product2Connaissance Obligations product3Connaissance Livrets product4 Connaissance Produits structures
product5 Connaissance FC- PIFCPR
product6Connaissance Aucun product7Connaissance OPCVM product8 Connaissance Assurance vie
product9Connaissance PEA productown1 Detention Actions productown2 Detention Obliga- tions
productown3 Detention Livrets productown4 Detention Produits structures
productown5 Detention FC- PI/FCPR
productown6 Detention Aucun productown7 Detention OPCVM productown8 Detention Assurance vie
productown9 Detention PEA logCreditEstateln(credit immobi- lier)
CreditEstate_miss
logAssetEstate ln(patrimoine im- mobilier total)
AssetEstate_miss
logAssetF inancial ln(patrimoine nancier total)
AssetF inancial_miss
logamountln(montant investi) logincome
logincome_miss
logcapitalln(montant loterie) durationhorizon
duration2horizon2 duration3horizon3
experobj1 Experience<1 an experobj2_3_4 Experience 1 a 5 ans
experobj5_6Experience=4 a 5 ans experobj7 Experience =6-10 ans (5-10 ans pour AGORA)
experobj8 Experience=11-15 ans (10-15 pour pour AGORA)
experobj9 Experience=16-20 ans (15-20 ans pour AGORA)
experobj10 Experience=21-25 ans (20-25 ans pour AGORA)
experobj11Experience>25 ans experobj_miss
expersubj1Novice
expersubj2Plutôt Experimente expersubj3Experimente expersubj4Tres Experimente expersubj_miss
ecoclim1Nettement meilleur ecoclim2Un peu meilleur ecoclim3A peu pres le meme ecoclim4Moins bon
ecoclim5Nettement moins bon ecoclim_miss
f luctu1 Intolerance aux uctua- tions
f luctu2 Toleranceaux uctuations faibles
f luctu3Tolerance aux uctuations moderees
f luctu4Tolerance aux uctuations importantes
gainloss1Liquide immediatement gainloss2 Liquide seulement apres une perte signicative
gainloss3Ne liquide pas, conserve gainloss4Re-investis
motivmain1Transmission motivmain2Retraite
motivmain3Frais de formation motivmain4Matelas de securite motivmain5 Revenus supplemen- taires
motivmain6Revenus reguliers motivmain7 Acheter/renover une maison
motivmain8Acheter des biens du- rables
motivmain9 Fructier mon patri- moine
motivmain10Autre motivmain_miss
remove1 liquide, avant T/2 remove2liquide mais après les T/2 remove3 Ne liquide pas
remove_miss
1.2 Week 2 - First python programme
Exploring data
# -*- coding : utf -8 -*-
"""
Created on Fri Feb 02 19:24:43 2018
@author : lanquetuit
"""
import pandas import numpy
# any additional libraries would be imported here
data = pandas.read_csv(' estim20171220 . csv ', low_memory=False) print (len(data)) # number of observations ( rows )
print (len(data.columns)) # number of variables ( columns )
# setting variables you will be working with to numeric
data['choice '] = data['choice '].convert_objects(convert_numeric=True) data['date '] = data['date '].convert_objects(convert_numeric=True)
data['logamount '] = data['logamount '].convert_objects(convert_numeric=True) data['capital '] = data['capital '].convert_objects(convert_numeric=True) data['age '] = data['age '].convert_objects(convert_numeric=True)
# counts and percentages (i.e. frequency distributions ) for each variable print 'counts for choice : answer from safe =1 to risky =8 '
c1 = data['choice '].value_counts(sort=False) print (c1)
print ' percentages for choice '
p1 = data['choice '].value_counts(sort=False, normalize=True) print (p1)
print 'counts for date : number of month after 1/1/2000 when people answer the questionnary ' c2 = data['date '].value_counts(sort=False)
print(c2)
print ' percentages for date '
p2 = data['date '].value_counts(sort=False, normalize=True) print (p2)
print 'counts for logamount : log of amount of money ' c3 = data['logamount '].value_counts(sort=False)
print(c3)
print ' percentages for logamount '
p3 = data['logamount '].value_counts(sort=False, normalize=True) print (p3)
print 'counts for capital '
c4 = data['capital '].value_counts(sort=False) print(c4)
print ' percentages for capital '
p4 = data['capital '].value_counts(sort=False, normalize=True) print (p4)
# freqeuncy disributions using the 'bygroup ' function ct1= data.groupby('choice ').size()
print ct1
pt1 = data.groupby('choice ').size() * 100 / len(data) print pt1
ct2= data.groupby('date ').size() print ct2
pt2 = data.groupby('date ').size() * 100 / len(data) print pt2
ct3= data.groupby('logamount ').size() print ct3
pt3 = data.groupby('logamount ').size() * 100 / len(data) print pt3
ct4= data.groupby('capital ').size() print ct4
pt4 = data.groupby('capital ').size() * 100 / len(data) print pt4
# subset data to young adults age 18 to 25
sub1=data[(data['age '] >=18) & (data['age '] <=25) & (data['date '] >1)]
# make a copy of my new subsetted data sub2 = sub1.copy()
# frequency distritions on new sub2 data frame print 'counts for age '
c5 = sub2['age '].value_counts(sort=False) print(c5)
print ' percentages for age '
p5 = sub2['age '].value_counts(sort=False, normalize=True) print (p5)
print 'counts for date '
c6 = sub2['date '].value_counts(sort=False) print(c6)
print ' percentages for date '
p6 = sub2['date '].value_counts(sort=False, normalize=True) print (p6)
#upper - case all DataFrame column names : place afer code for loading data aboave
data.columns = map(str.upper, data.columns)
# bug fix for display formats to avoid run time errors : put after code for loading data above pandas.set_option('display . float_format ', lambda x:'%f'%x)
Variables description 23233
287
counts for choice - answer from safe=1 to risky=8 1 3471
2 2054 3 2001 4 2372 5 2305 6 2041 7 2601 8 6388 dtype: int64
percentages for choice 1 0.149400
2 0.088409 3 0.086127 4 0.102096 5 0.099212 6 0.087849 7 0.111953 8 0.274954 dtype: float64
counts for date - number of month after 1/1/2000 when people answer the questionnary
70 329
71 140
72 42
73 189
74 84
...
202 56
203 805
204 119
205 21
206 105
Length: 71, dtype: int64 percentages for date 70 0.014161
71 0.006026 72 0.001808 73 0.008135 74 0.003616 ...202 0.002410 203 0.034649 204 0.005122 205 0.000904 206 0.004519
Length: 71, dtype: float64
counts for logamount - log of amount of money 0.000000 630
10.126671 259
13.081543 7
10.085851 7
13.892473 7
17.216708 35 ...19.673444 7 14.220976 56
10.373522 7
13.017005 28
9.741027 14
Length: 123 , dtype: int64 percentages for logamount 0.000000 0.027117 10.126671 0.011148 13.081543 0.000301 10.085851 0.000301 13.892473 0.000301 17.216708 0.001506 ...
19.673444 0.000301 14.220976 0.002410 10.373522 0.000301 13.017005 0.001205 9.741027 0.000603
Length: 123 , dtype: float64 counts for capital
25000 1371
500 191
50000 859
5000 1143
300000 311
1000 813
75000 331
30000 1379
26000 48
1500 795
22000 59
18000 175
100000 731
14000 193
10000 1126
6000 973
2000 1211
35000 838
2500 592
150000 792
15000 889
7000 558
3000 1543
40000 198
200000 468
28000 44
3500 352
24000 51
450000 290 20000 1771
16000 186
12000 199
8000 795
4000 1604
45000 354
dtype: int64
percentages for capital 25000 0.059011
500 0.008221
50000 0.036973 5000 0.049197 300000 0.013386 1000 0.034993 75000 0.014247 30000 0.059355 26000 0.002066 1500 0.034219 22000 0.002539 18000 0.007532 100000 0.031464 14000 0.008307 10000 0.048466 6000 0.041880 2000 0.052124 35000 0.036069 2500 0.025481 150000 0.034089 15000 0.038265 7000 0.024018 3000 0.066414 40000 0.008522 200000 0.020144 28000 0.001894 3500 0.015151 24000 0.002195 450000 0.012482 20000 0.076228 16000 0.008006 12000 0.008565 8000 0.034219 4000 0.069040 45000 0.015237 dtype: float64 counts for age
12 14
16 7
17 7
18 63
19 364
20 3325
21 6860
22 5362
23 2093
24 1050
25 504
26 329
27 273
28 224
29 154
30 77
...68 14
87 7
91 7
107 7
111 7
Length: 71, dtype: int64 percentages for age age
12 0.060259 16 0.030130 17 0.030130 18 0.271166 19 1.566737 20 14.311540 21 29.526966 22 23.079241 23 9.008738 24 4.519434 25 2.169328 26 1.416089 27 1.175053 28 0.964146 29 0.662850 30 0.331425 ...68 0.060259 87 0.030130 91 0.030130 107 0.030130 111 0.030130
Length: 55, dtype: float64 counts for date
date
70 329
71 140
72 42
73 189
74 84
...202 56
203 805
204 119
205 21
206 105
Length: 71, dtype: int64 percentages for date date70 1.416089 71 0.602591 72 0.180777 73 0.813498 74 0.361555 ...
202 0.241036 203 3.464899 204 0.512202 205 0.090389 206 0.451943
Length: 71, dtype: float64
1.3 Week 3 - Quantiles
Collapsing age variable in 3 category
# quantile split ( use qcut function & ask for 3 groups - gives you quantile split ) print 'AGE - categories - quantiles '
data['AGEGROUP ']=pandas.qcut(data['AGE '], 3, labels=False)# ["1= quantile1 " ,"2= quantile2 " ,"3= quantile3 "]) q = data['AGEGROUP '].value_counts(sort=False)#, dropna = True )
print(q)
# crosstabs evaluating which ages were put into which AGEGROUP3 print (pandas.crosstab(data['AGEGROUP '], data['AGE ']))
# categorize quantitative variable based on customized splits using cut function splits into 3 groups (18 -20 , 21, 22 -99) data['AGEGROUP3 '] = pandas.cut(data['AGE '], [18 , 20, 21, 99])
q3 = data['AGEGROUP3 '].value_counts(sort=False)#, dropna = True ) print(q3)
Groups description
AGE - categories - quantiles 0 10612
1 5362
2 7217
dtype: int64
age 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000 \
AGEGROUP
0 63 364 3325 6860
0 0
1 0 0 0 0 5362
0
2 0 0 0 0
0 2093
age 24.000000 25.000000 26.000000 27.000000 28.000000 29.000000 \
AGEGROUP
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 1050 504 329 273 224
154
age 30.000000 31.000000 32.000000 33.000000 34.000000 35.000000 \
AGEGROUP
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 77 91 119 133
84 91
age 36.000000 37.000000 AGEGROUP
0 0 0 ...
1 0 0 ...
2 42 35 ...
[3 rows x 50 columns]
(18 , 20] 3689 (20 , 21] 6860 (21 , 99] 12579 dtype: int64
1.4 Week 4 - First plot
Trying to see inuence of age on choice
# basic scatterplot : Q->Q
scat = seaborn.regplot(x=data[" AGE "], y=data[" CHOICE "], fit_reg=False) plt.xlabel('Age ')
plt.ylabel('choice ')
plt.title(' Scatterplot for the association between age and choice ') plt.show()
# bivariate bar graph C->Q
seaborn.factorplot(x='AGEGROUP ', y='CHOICE ', data=data, kind=" bar ", ci=None) plt.xlabel('Age ')
plt.ylabel('choice ')
plt.title(' Association between age and choice ') plt.show()
Graphical output
Figure 2 regression of choice on age
Figure 3 Collapsing age variable in 3 group show riskier younger choice
2 Data Analysis Tool
2.1 Week 1 - Ordinary Least Square
Try to see if means of choice that is to say risk taken level depend signicantly of age group : [18, 21, 22, 99]
sub3 = data[['CHOICE ', 'AGEGROUP ']].dropna()
model2 = smf.ols(formula='CHOICE ~ C( AGEGROUP )', data=sub3).fit() print (model2.summary())
print ('means for CHOICE by AGE status ') m2= sub3.groupby('AGEGROUP ').mean() print (m2)
print ('standard deviations for choice by age status ') sd2 = sub3.groupby('AGEGROUP ').std()
print (sd2)
The p-value is below 0.05, null hypothesis should be rejected : choice means depends on age group
OLS Regression Results
==============================================================================
Dep. Variable: CHOICE R-squared: 0.011
Model: OLS Adj. R-squared:
0.011
Method: Least Squares F-statistic: 128.8
Date: Fri, 23 Feb 2018 Prob (F-statistic):
2.40e-56
Time: 18:23:32 Log-Likelihood:
-54835.
No. Observations: 23233 AIC: 1.097e+05
Df Residuals: 23230 BIC:
1.097e+05
Df Model: 2
====================================================================================
coef std err t P>|t| [95.0\% Conf. Int.]
---
Intercept 5.2154 0.025 209.868 0.000
5.167 5.264
C(AGEGROUP)[T.1] -0.1100 0.043 -2.563 0.010 -0.194 -0.026
C(AGEGROUP)[T.2] -0.6123 0.039 -15.673 0.000 -0.689 -0.536
==============================================================================
Omnibus: 174.414 Durbin-Watson:
1.360
Prob(Omnibus): 0.000 Jarque-Bera (JB):
2026.222
Skew: -0.215 Prob(JB):
0.00
Kurtosis: 1.618 Cond. No.
3.33
==============================================================================
means for CHOICE by AGE status CHOICE
AGEGROUP
0 5.215414
1 5.105371
2 4.603098
[3 rows x 1 columns]
standard deviations for choice by age status CHOICE
AGEGROUP
0 2.399484
1 2.558091
2 2.790690
[3 rows x 1 columns]
2.2 Week 2 - Chi square analysis
Chi square test independance is performed to learn if there is a dependancy between choice and age category : if the chi value is to high (small p value), the null hypothesis should be rejected and we can consider that there is a certain relationship between choice and age that is to say the distribution of percentages per column is too far from the expected one if there is no dependency between those two variable : choice and age.
# contingency table of observed counts
ct1=pandas.crosstab(sub3['CHOICE '], sub3['AGEGROUP ']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi - square
print ('chi - square value , p value , expected counts ') cs1= scipy.stats.chi2_contingency(ct1)
print (cs1) Below the results,
AGEGROUP 0 1 2
CHOICE
1.000000 1009 711 1751 2.000000 953 490 611 3.000000 997 463 541 4.000000 1137 546 689 5.000000 1212 536 557 6.000000 1152 449 440 7.000000 1374 599 628 8.000000 2806 1568 2014 [8 rows x 3 columns]
AGEGROUP 0 1 2
CHOICE
1.000000 0.094831 0.132600 0.242152 2.000000 0.089568 0.091384 0.084497 3.000000 0.093703 0.086348 0.074817 4.000000 0.106861 0.101828 0.095284 5.000000 0.113910 0.099963 0.077029
6.000000 0.108271 0.083737 0.060849 7.000000 0.129135 0.111712 0.086848 8.000000 0.263722 0.292428 0.278523 [8 rows x 3 columns]
chi-square value, p value, expected counts
(914.57765057415486 , 3.2472883109757998e-186 , 14, array([[ 1589.61132871 , 801.08044592 , 1080.30822537] ,
[ 940.66887617 , 474.0476047 , 639.28351913] , [ 916.39650497 , 461.81560711 , 622.78788792] , [ 1086.30310334 , 547.43959024 , 738.25730642] , [ 1055.6191624 , 531.97649895 , 717.40433866] , [ 934.71527569 , 471.0473034 , 635.23742091] , [ 1191.17806568 , 600.29105152 , 809.5308828 ], [ 2925.50768304 , 1474.30189816 , 1988.1904188 ]])) So we can conlude there is a signicant dependency between choice and age : according chi square test, those two variables are not independant.
2.3 Week 3 - Correlation coecient
Correlation between age and choice
scat2 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=data) plt.xlabel('Age ')
plt.ylabel('Choice ( increasing risk )')
plt.title(' Scatterplot for the Association Between Age and Choice ( risk decision level )') plt.show()
print (' association between age and choice ')
print (scipy.stats.pearsonr(data['AGE '], data['CHOICE '])) r coecient and p-Value
Figure 4 regression of choice on age
association between age and choice
( -0.13200566683109677 , 8.2912338219779939e-91)
The p-Value is low, the r coecient is signicatif and show a low decreasing relationship between age and choice : age variable can enable to predict about r2=1%of variable choice.
2.4 Week 4 - Statistical interaction
reltionship between gender and age to explain choice
data['GENDER2 '] = data['GENDER2 '].convert_objects(convert_numeric=True) sub1=data[(data['GENDER2 ']== 0)]
sub2=data[(data['GENDER2 ']== 1)]
print (' association between age and choice for men ') print (scipy.stats.pearsonr(sub1['AGE '], sub1['CHOICE ']))
print (' ')
print (' association between age and choice for women ') print (scipy.stats.pearsonr(sub2['AGE '], sub2['CHOICE ']))
scat1 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=sub1) plt.xlabel('AGE ')
plt.ylabel('CHOICE ')
plt.title(' Scatterplot for the Association Between age and choice for men ') plt.show()
scat2 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=sub2) plt.xlabel('AGE ')
plt.ylabel('CHOICE ')
plt.title(' Scatterplot for the Association Between age and choice for women ') plt.show()
sub4 = data[['CHOICE ', 'GENDER2 ']].dropna()
ct2=pandas.crosstab(sub4['CHOICE '], sub4['GENDER2 ']) print (ct2)
colsum2=ct2.sum(axis=0) colpct2=ct2/colsum2 print(colpct2)
print ('chi - square value , p value , expected counts ') cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
Correlation coecient between age and choice for men and women association between age and choice for men
( -0.098725407928702102 , 2.8739936968443809e-36) association between age and choice for women ( -0.14075296105345217 , 1.1294684721244681e-32) GENDER2 0.000000 1.000000
CHOICE
1.000000 1967 1504
2.000000 1336 718
3.000000 1240 761
4.000000 1639 733
5.000000 1733 572
6.000000 1543 498
7.000000 1996 605
8.000000 4695 1693
[8 rows x 2 columns]
GENDER2 0.000000 1.000000 CHOICE
1.000000 0.121803 0.212309 2.000000 0.082730 0.101355 3.000000 0.076785 0.107425 4.000000 0.101492 0.103473 5.000000 0.107313 0.080745 6.000000 0.095548 0.070299 7.000000 0.123599 0.085404 8.000000 0.290730 0.238989 [8 rows x 2 columns]
chi-square value, p value, expected counts
(526.03337708698655 , 2.022169027878708e-109 , 7, array([[ 2412.65351009 , 1058.34648991] ,
[ 1427.71256403 , 626.28743597] , [ 1390.87285327 , 610.12714673] , [ 1648.75082856 , 723.24917144] , [ 1602.17987346 , 702.82012654] , [ 1418.67640856 , 622.32359144] , [ 1807.92618259 , 793.07381741] , [ 4440.22777945 , 1947.77222055]])) Causality relation or not
Figure 5 regression of choice on age
Correlation between age and choice for both men and women are quite the same so we can not conclude to a particular causal statistical interaction between age and gender in explaining choice, but there is still a signicant correlation between gender and choice : women choose more often less risky option than men.
3 Machine Learning
3.1 Week 1 - Classication Tree
A classication tree was construct on a training sample (60% of 1000 ans- wers). 8 category of variable "choice" are predicted on a test sample of 400 answers.
from pandas import Series, DataFrame import pandas as pd
import numpy as np import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
# Load the dataset
data = pd.read_csv(' estim20171220 . csv ', low_memory=False)
# Split into training and testing sets
predic = data[['retireage ','date ','logamount ','age ',' logAssetEstate ',' logAssetFinancial ',' logCreditEstate ','logincome ',' motivmain1 ','motivmain2 ',' motivmain3 ','motivmain4 ',' motivmain5 ',' motivmain6 ','motivmain7 ',' motivmain8 ','motivmain9 ',' motivmain10 ','matri1 ','matri2 ','matri3 ','matri4 ','matri5 ','matri6 ','matri7 ','gender1 ','gender2 ','prof1 ','prof2 ','prof3 ','prof4 ','prof5 ','prof6 ','prof7 ','prof8 ','prof9 ','prof10 ','educ1 ','educ2 ','educ3 ','educ4 ','educ5 ','educ6 ','educ7 ',' expersubj1 ',' expersubj2 ',' expersubj3 ',' expersubj4 ','ecoclim1 ','ecoclim2 ','ecoclim3 ','ecoclim4 ','ecoclim5 ','experobj1 ',' experobj2 ','experobj3 ',' experobj4 ','experobj5 ','experobj6 ',' experobj7 ','experobj8 ',' experobj9 ',' experobj10 ',' experobj11 ','fluctu1 ','fluctu2 ','fluctu3 ','fluctu4 ','gainloss1 ',' gainloss2 ','gainloss3 ',' lossgain1 ','lossgain2 ',' lossgain3 ',' dependente ',' productown1 ',' productown2 ',' productown5 ',' productown7 ',' productown8 ',' productown9 ',' productown_spec ','fpublic ','csp1 ','csp2 ','csp3 ','csp4 ','csp5 ','csp6 ','csp7 ','csp8 ','csp9 ','csp10 ','csp11 ','csp12 ','csp13 ','csp14 ',' productown3 ',' productown4 ',' productown6 ','remove1 ','remove2 ','remove3 ','gainloss4 ','codetuu0 ','codetuu1 ','codetuu2 ','codetuu3 ','codetuu4 ','codetuu5 ','codetuu6 ','codetuu7 ','codetuu8 ']].head(1000) targets = data['choice '].head(1000)
predTrain,predTest,tarTrain,tarTest=train_test_split(predic,targets,test_size=.4)
# Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(predTrain,tarTrain) predictions=classifier.predict(predTest)
print(sklearn.metrics.confusion_matrix(tarTest,predictions)) print(sklearn.metrics.accuracy_score(tarTest, predictions))
Below the confusion matrix showing that 28 answers "1" for choice value and 44 for value "8" were correctly predicted. The total accuracy of the predic- tion is 27,25% it means that 109 answers were correctly predict on 400. This classication tree is able to predict the value of choice between "1" to "8" with 72% that is to say, knowing all their characteristics, we are able to dierency a risk lover from a risk averse people about 7 times on 10.
[[28 6 3 4 1 1 3 10]
[10 8 2 1 0 0 1 2]
[14 7 3 3 2 0 3 11]
[12 4 4 3 2 5 4 5]
[12 1 6 1 10 3 4 11]
[ 2 5 2 2 2 2 3 8]
[ 4 4 3 5 8 3 11 16]
[16 9 8 10 10 2 11 44]]
0.2725
3.2 Week 2 - Classication Random Forest
from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test)
print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions))
# fit an Extra Trees model to the data model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute print(model.feature_importances_)
"""
Running a different number of trees and see the effect of that on the accuracy of the prediction
"""
trees=range(25) accuracy=np.zeros(25)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla()
plt.plot(trees, accuracy) plt.xlabel('trees ')
plt.ylabel('accuracy ')
plt.title(" Classifier accuracy ") plt.show()
The classier is not very better than a single classication tree, the confusion mtrix and accuracy do not show very better results.
[[17 3 1 2 5 0 4 13]
[ 8 5 4 0 2 2 2 4]
[ 7 6 2 5 2 2 3 18]
[10 2 3 3 1 4 2 6]
[ 6 2 4 1 4 4 4 11]
[ 2 1 2 6 3 2 4 12]
[ 8 2 5 10 0 8 6 30]
[15 11 2 13 5 10 10 49]]
0.22[0.01127044 0.08322846 0.12768593 0.12410391 0.05945459 0.10912257 0. 0.01460772 0. 0.00652329 0.02024125 0.01633897
0.02443224 0.03289355 0.02465689 0.0154227 0.03441944 0.00441721 0.00627706 0.00585599 0.00852943 0.00823017 0.00158778 0.01039395 0.02814369 0.01828384 0.02470821 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.01408595
0.01213265 0. 0.02016427 0.01463633 0.00868712 0.0008746 0.0129996 0.00874455 0.00502364 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.00267585
0.00784494 0.01413203 0.00573719 0.00770269 0.01465239 0.01278192
0.00718586 0.00910914 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
Model features importances show that "logamount", "age" and "logAsset- Financial" are the most important variable to predict "choice" : age an fortune are linked with risk aversion.
the below graphic show that in our case the number of trees in the forest oes not improve clearly the accuracy of classier.
Figure 6 Classier Accuracy in function of forest size
Références
[1] Ottaviani & Myrick. Feynman. Vuibert, 2017.