Risk Analysis Study lanquetuit.cyril@gmail.com, Université de Cergy Pontoise

(1)

Risk Analysis Study

lanquetuit.cyril@gmail.com, Université de Cergy Pontoise

Table des matières

1 Data management and visualisation 3

1.1 Week 1 - Code Book . . . 3

1.2 Week 2 - First python programme . . . 5

1.3 Week 3 - Quantiles . . . 11

1.4 Week 4 - First plot . . . 12

2 Data Analysis Tool 13 2.1 Week 1 - Ordinary Least Square . . . 13

2.2 Week 2 - Chi square analysis . . . 15

2.3 Week 3 - Correlation coecient . . . 16

2.4 Week 4 - Statistical interaction . . . 17

3 Machine Learning 19 3.1 Week 1 - Classication Tree . . . 19

3.2 Week 2 - Classication Random Forest . . . 20

(2)

Abstract

We will use a condential Data Set which is not authorised to be published, so we will only highlight our futur result, not the whole data.

We will study how to explain risk aversion of individu in function of multiple variables such as Socio-Professionnal Category (csp) or capital or date (number of month since january 2000 past until people answer questionnary).

Our data are collected on a representative sample of french population but condentiality restriction does not allow us to publish all the experimental pro- cedure set in place to collect our data.

Our problematic can be summarize as is :

"How and Why people are risk averse ?"

About the study...

http://cyrillanquetuit.free.fr/PDF/EconometricsInShort.pdf

Figure 1 Feynman & NASA [1]

(3)

1 Data management and visualisation

1.1 Week 1 - Code Book

V ariableLabel academic agec age2c age3c

retireageage de la reraite reg_domDOM

reg11IDF

reg21Champagne-Ardenne reg22Picardie

reg23Haute-Normandie reg24Centre

reg25Basse-Normandie reg26Bourgogne

reg31Nord-Pas-de-Calais reg41Lorraine

reg42Alsace

reg43Franche-Comte reg52Pays-de-la-Loire reg53Bretagne

reg54Poitou-Charentes reg72Aquitaine reg73Midi-Pyrenees reg74limousin reg82Rhone-Alpes reg83auvergne

reg91Languedoc-Roussillon reg93PACA

reg94Corse reg_miss codetuu0

codetuu1moins de 5 000 hab codetuu25 000 à 9 999 hab.

codetuu310 000 à 19 999 hab.

codetuu420 000 à 49 999 hab.

codetuu550 000 à 99 999 hab.

codetuu6100 000 à 199 999 hab.

codetuu7200 000 à 1 999 999 hab.

codetuu8Unité urbaine de Paris codetuu_miss

dateNb mois a partir de janv 2000 date_f romaug2008

date_f romnov2010 min(0,date- date QE de FED)

date_f romjanv2015 min(0,date- date menace Grexit)

educ1Aucun diplôme educ2Brevet des colleges

educ7Diplôme professionnel : CAP et BEP

educ3Baccalaureat

educ4Bac +1 à 3, type Licence educ5_6Bac +4 et +, type master, ecole de commerce, ecole d ingenieur, doctorat

educ_miss

matri1Marie avec contrat matri2marie sans contrat matri3PACS

matri4Vie maritale matri5Divorce matri6Veuf

matri1_2_3_4 en couple, sans precision matri5_6_8 Separe, sans precision

matri5_6_8 matri7Celibataire matri_miss gender2 Femme

dependentenb personnes a charge csp1( Etudiant )

csp2(Recherche_emp ) csp3( Agriculteur) csp4( Ouvrier ) csp5(Employe) csp6( Technicien ) csp7(Agent_de_Mait ) csp8( Prof_intermed ) csp9( Artisan ) csp10( Cadre_sup ) csp11( Prof_liberale ) csp12( Chef_entreprise ) csp13(Retraite )

csp14(Sans_activite ) csp_miss

f publicfonctionnaire

product1Connaissance Actions product2Connaissance Obligations product3Connaissance Livrets product4 Connaissance Produits structures

(4)

product5 Connaissance FC- PIFCPR

product6Connaissance Aucun product7Connaissance OPCVM product8 Connaissance Assurance vie

product9Connaissance PEA productown1 Detention Actions productown2 Detention Obliga- tions

productown3 Detention Livrets productown4 Detention Produits structures

productown5 Detention FC- PI/FCPR

productown6 Detention Aucun productown7 Detention OPCVM productown8 Detention Assurance vie

productown9 Detention PEA logCreditEstateln(credit immobi- lier)

CreditEstate_miss

logAssetEstate ln(patrimoine im- mobilier total)

AssetEstate_miss

logAssetF inancial ln(patrimoine nancier total)

AssetF inancial_miss

logamountln(montant investi) logincome

logincome_miss

logcapitalln(montant loterie) durationhorizon

duration2horizon2 duration3horizon3

experobj1 Experience<1 an experobj2_3_4 Experience 1 a 5 ans

experobj5_6Experience=4 a 5 ans experobj7 Experience =6-10 ans (5-10 ans pour AGORA)

experobj8 Experience=11-15 ans (10-15 pour pour AGORA)

experobj9 Experience=16-20 ans (15-20 ans pour AGORA)

experobj10 Experience=21-25 ans (20-25 ans pour AGORA)

experobj11Experience>25 ans experobj_miss

expersubj1Novice

expersubj2Plutôt Experimente expersubj3Experimente expersubj4Tres Experimente expersubj_miss

ecoclim1Nettement meilleur ecoclim2Un peu meilleur ecoclim3A peu pres le meme ecoclim4Moins bon

ecoclim5Nettement moins bon ecoclim_miss

f luctu1 Intolerance aux uctuations

f luctu2 Toleranceaux uctuations faibles

f luctu3Tolerance aux uctuations moderees

f luctu4Tolerance aux uctuations importantes

gainloss1Liquide immediatement gainloss2 Liquide seulement apres une perte signicative

gainloss3Ne liquide pas, conserve gainloss4Re-investis

motivmain1Transmission motivmain2Retraite

motivmain3Frais de formation motivmain4Matelas de securite motivmain5 Revenus supplemen- taires

motivmain6Revenus reguliers motivmain7 Acheter/renover une maison

motivmain8Acheter des biens du- rables

motivmain9 Fructier mon patrimoine

motivmain10Autre motivmain_miss

remove1 liquide, avant T/2 remove2liquide mais après les T/2 remove3 Ne liquide pas

remove_miss

(5)

1.2 Week 2 - First python programme

Exploring data

# -*- coding : utf -8 -*-

"""

Created on Fri Feb 02 19:24:43 2018

@author : lanquetuit

"""

import pandas import numpy

# any additional libraries would be imported here

data = pandas.read_csv(' estim20171220 . csv ', low_memory=False) print (len(data)) # number of observations ( rows )

print (len(data.columns)) # number of variables ( columns )

# setting variables you will be working with to numeric

data['choice '] = data['choice '].convert_objects(convert_numeric=True) data['date '] = data['date '].convert_objects(convert_numeric=True)

data['logamount '] = data['logamount '].convert_objects(convert_numeric=True) data['capital '] = data['capital '].convert_objects(convert_numeric=True) data['age '] = data['age '].convert_objects(convert_numeric=True)

# counts and percentages (i.e. frequency distributions ) for each variable print 'counts for choice : answer from safe =1 to risky =8 '

c1 = data['choice '].value_counts(sort=False) print (c1)

print ' percentages for choice '

p1 = data['choice '].value_counts(sort=False, normalize=True) print (p1)

print 'counts for date : number of month after 1/1/2000 when people answer the questionnary ' c2 = data['date '].value_counts(sort=False)

print(c2)

print ' percentages for date '

p2 = data['date '].value_counts(sort=False, normalize=True) print (p2)

print 'counts for logamount : log of amount of money ' c3 = data['logamount '].value_counts(sort=False)

print(c3)

print ' percentages for logamount '

p3 = data['logamount '].value_counts(sort=False, normalize=True) print (p3)

print 'counts for capital '

c4 = data['capital '].value_counts(sort=False) print(c4)

(6)

print ' percentages for capital '

p4 = data['capital '].value_counts(sort=False, normalize=True) print (p4)

# freqeuncy disributions using the 'bygroup ' function ct1= data.groupby('choice ').size()

print ct1

pt1 = data.groupby('choice ').size() * 100 / len(data) print pt1

ct2= data.groupby('date ').size() print ct2

pt2 = data.groupby('date ').size() * 100 / len(data) print pt2

ct3= data.groupby('logamount ').size() print ct3

pt3 = data.groupby('logamount ').size() * 100 / len(data) print pt3

ct4= data.groupby('capital ').size() print ct4

pt4 = data.groupby('capital ').size() * 100 / len(data) print pt4

# subset data to young adults age 18 to 25

sub1=data[(data['age '] >=18) & (data['age '] <=25) & (data['date '] >1)]

# make a copy of my new subsetted data sub2 = sub1.copy()

# frequency distritions on new sub2 data frame print 'counts for age '

c5 = sub2['age '].value_counts(sort=False) print(c5)

print ' percentages for age '

p5 = sub2['age '].value_counts(sort=False, normalize=True) print (p5)

print 'counts for date '

c6 = sub2['date '].value_counts(sort=False) print(c6)

print ' percentages for date '

p6 = sub2['date '].value_counts(sort=False, normalize=True) print (p6)

#upper - case all DataFrame column names : place afer code for loading data aboave

(7)

data.columns = map(str.upper, data.columns)

# bug fix for display formats to avoid run time errors : put after code for loading data above pandas.set_option('display . float_format ', lambda x:'%f'%x)

Variables description 23233

287

counts for choice - answer from safe=1 to risky=8 1 3471

2 2054 3 2001 4 2372 5 2305 6 2041 7 2601 8 6388 dtype: int64

percentages for choice 1 0.149400

2 0.088409 3 0.086127 4 0.102096 5 0.099212 6 0.087849 7 0.111953 8 0.274954 dtype: float64

counts for date - number of month after 1/1/2000 when people answer the questionnary

70 329

71 140

72 42

73 189

74 84

...

202 56

203 805

204 119

205 21

206 105

Length: 71, dtype: int64 percentages for date 70 0.014161

71 0.006026 72 0.001808 73 0.008135 74 0.003616 ...202 0.002410 203 0.034649 204 0.005122 205 0.000904 206 0.004519

Length: 71, dtype: float64

(8)

counts for logamount - log of amount of money 0.000000 630

10.126671 259

13.081543 7

10.085851 7

13.892473 7

17.216708 35 ...19.673444 7 14.220976 56

10.373522 7

13.017005 28

9.741027 14

Length: 123 , dtype: int64 percentages for logamount 0.000000 0.027117 10.126671 0.011148 13.081543 0.000301 10.085851 0.000301 13.892473 0.000301 17.216708 0.001506 ...

19.673444 0.000301 14.220976 0.002410 10.373522 0.000301 13.017005 0.001205 9.741027 0.000603

Length: 123 , dtype: float64 counts for capital

25000 1371

500 191

50000 859

5000 1143

300000 311

1000 813

75000 331

30000 1379

26000 48

1500 795

22000 59

18000 175

100000 731

14000 193

10000 1126

6000 973

2000 1211

35000 838

2500 592

150000 792

15000 889

7000 558

3000 1543

40000 198

200000 468

(9)

28000 44

3500 352

24000 51

450000 290 20000 1771

16000 186

12000 199

8000 795

4000 1604

45000 354

dtype: int64

percentages for capital 25000 0.059011

500 0.008221

50000 0.036973 5000 0.049197 300000 0.013386 1000 0.034993 75000 0.014247 30000 0.059355 26000 0.002066 1500 0.034219 22000 0.002539 18000 0.007532 100000 0.031464 14000 0.008307 10000 0.048466 6000 0.041880 2000 0.052124 35000 0.036069 2500 0.025481 150000 0.034089 15000 0.038265 7000 0.024018 3000 0.066414 40000 0.008522 200000 0.020144 28000 0.001894 3500 0.015151 24000 0.002195 450000 0.012482 20000 0.076228 16000 0.008006 12000 0.008565 8000 0.034219 4000 0.069040 45000 0.015237 dtype: float64 counts for age

12 14

16 7

17 7

18 63

19 364

(10)

20 3325

21 6860

22 5362

23 2093

24 1050

25 504

26 329

27 273

28 224

29 154

30 77

...68 14

87 7

91 7

107 7

111 7

Length: 71, dtype: int64 percentages for age age

12 0.060259 16 0.030130 17 0.030130 18 0.271166 19 1.566737 20 14.311540 21 29.526966 22 23.079241 23 9.008738 24 4.519434 25 2.169328 26 1.416089 27 1.175053 28 0.964146 29 0.662850 30 0.331425 ...68 0.060259 87 0.030130 91 0.030130 107 0.030130 111 0.030130

Length: 55, dtype: float64 counts for date

date

70 329

71 140

72 42

73 189

74 84

...202 56

203 805

204 119

(11)

205 21

206 105

Length: 71, dtype: int64 percentages for date date70 1.416089 71 0.602591 72 0.180777 73 0.813498 74 0.361555 ...

202 0.241036 203 3.464899 204 0.512202 205 0.090389 206 0.451943

Length: 71, dtype: float64

1.3 Week 3 - Quantiles

Collapsing age variable in 3 category

# quantile split ( use qcut function & ask for 3 groups - gives you quantile split ) print 'AGE - categories - quantiles '

data['AGEGROUP ']=pandas.qcut(data['AGE '], 3, labels=False)# ["1= quantile1 " ,"2= quantile2 " ,"3= quantile3 "]) q = data['AGEGROUP '].value_counts(sort=False)#, dropna = True )

print(q)

# crosstabs evaluating which ages were put into which AGEGROUP3 print (pandas.crosstab(data['AGEGROUP '], data['AGE ']))

# categorize quantitative variable based on customized splits using cut function splits into 3 groups (18 -20 , 21, 22 -99) data['AGEGROUP3 '] = pandas.cut(data['AGE '], [18 , 20, 21, 99])

q3 = data['AGEGROUP3 '].value_counts(sort=False)#, dropna = True ) print(q3)

Groups description

AGE - categories - quantiles 0 10612

1 5362

2 7217

dtype: int64

age 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000 \

AGEGROUP

0 63 364 3325 6860

0 0

1 0 0 0 0 5362

0

2 0 0 0 0

0 2093

age 24.000000 25.000000 26.000000 27.000000 28.000000 29.000000 \

(12)

AGEGROUP

0 0 0 0 0

0 0

1 0 0 0 0

0 0

2 1050 504 329 273 224

154

age 30.000000 31.000000 32.000000 33.000000 34.000000 35.000000 \

AGEGROUP

0 0 0 0 0

0 0

1 0 0 0 0

0 0

2 77 91 119 133

84 91

age 36.000000 37.000000 AGEGROUP

0 0 0 ...

1 0 0 ...

2 42 35 ...

[3 rows x 50 columns]

(18 , 20] 3689 (20 , 21] 6860 (21 , 99] 12579 dtype: int64

1.4 Week 4 - First plot

Trying to see inuence of age on choice

# basic scatterplot : Q->Q

scat = seaborn.regplot(x=data[" AGE "], y=data[" CHOICE "], fit_reg=False) plt.xlabel('Age ')

plt.ylabel('choice ')

plt.title(' Scatterplot for the association between age and choice ') plt.show()

# bivariate bar graph C->Q

seaborn.factorplot(x='AGEGROUP ', y='CHOICE ', data=data, kind=" bar ", ci=None) plt.xlabel('Age ')

plt.ylabel('choice ')

plt.title(' Association between age and choice ') plt.show()

Graphical output

(13)

Figure 2 regression of choice on age

Figure 3 Collapsing age variable in 3 group show riskier younger choice

2 Data Analysis Tool

2.1 Week 1 - Ordinary Least Square

Try to see if means of choice that is to say risk taken level depend signicantly of age group : [18, 21, 22, 99]

sub3 = data[['CHOICE ', 'AGEGROUP ']].dropna()

model2 = smf.ols(formula='CHOICE ~ C( AGEGROUP )', data=sub3).fit() print (model2.summary())

(14)

print ('means for CHOICE by AGE status ') m2= sub3.groupby('AGEGROUP ').mean() print (m2)

print ('standard deviations for choice by age status ') sd2 = sub3.groupby('AGEGROUP ').std()

print (sd2)

The p-value is below 0.05, null hypothesis should be rejected : choice means depends on age group

OLS Regression Results

==============================================================================

Dep. Variable: CHOICE R-squared: 0.011

Model: OLS Adj. R-squared:

0.011

Method: Least Squares F-statistic: 128.8

Date: Fri, 23 Feb 2018 Prob (F-statistic):

2.40e-56

Time: 18:23:32 Log-Likelihood:

-54835.

No. Observations: 23233 AIC: 1.097e+05

Df Residuals: 23230 BIC:

1.097e+05

Df Model: 2

====================================================================================

coef std err t P>|t| [95.0\% Conf. Int.]

---

Intercept 5.2154 0.025 209.868 0.000

5.167 5.264

C(AGEGROUP)[T.1] -0.1100 0.043 -2.563 0.010 -0.194 -0.026

C(AGEGROUP)[T.2] -0.6123 0.039 -15.673 0.000 -0.689 -0.536

==============================================================================

Omnibus: 174.414 Durbin-Watson:

1.360

Prob(Omnibus): 0.000 Jarque-Bera (JB):

2026.222

Skew: -0.215 Prob(JB):

0.00

Kurtosis: 1.618 Cond. No.

3.33

==============================================================================

means for CHOICE by AGE status CHOICE

AGEGROUP

0 5.215414

1 5.105371

2 4.603098

(15)

standard deviations for choice by age status CHOICE

AGEGROUP

0 2.399484

1 2.558091

2 2.790690

2.2 Week 2 - Chi square analysis

Chi square test independance is performed to learn if there is a dependancy between choice and age category : if the chi value is to high (small p value), the null hypothesis should be rejected and we can consider that there is a certain relationship between choice and age that is to say the distribution of percentages per column is too far from the expected one if there is no dependency between those two variable : choice and age.

# contingency table of observed counts

ct1=pandas.crosstab(sub3['CHOICE '], sub3['AGEGROUP ']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi - square

print ('chi - square value , p value , expected counts ') cs1= scipy.stats.chi2_contingency(ct1)

print (cs1) Below the results,

AGEGROUP 0 1 2

CHOICE

1.000000 1009 711 1751 2.000000 953 490 611 3.000000 997 463 541 4.000000 1137 546 689 5.000000 1212 536 557 6.000000 1152 449 440 7.000000 1374 599 628 8.000000 2806 1568 2014 [8 rows x 3 columns]

AGEGROUP 0 1 2

CHOICE

1.000000 0.094831 0.132600 0.242152 2.000000 0.089568 0.091384 0.084497 3.000000 0.093703 0.086348 0.074817 4.000000 0.106861 0.101828 0.095284 5.000000 0.113910 0.099963 0.077029

(16)

6.000000 0.108271 0.083737 0.060849 7.000000 0.129135 0.111712 0.086848 8.000000 0.263722 0.292428 0.278523 [8 rows x 3 columns]

chi-square value, p value, expected counts

(914.57765057415486 , 3.2472883109757998e-186 , 14, array([[ 1589.61132871 , 801.08044592 , 1080.30822537] ,

[ 940.66887617 , 474.0476047 , 639.28351913] , [ 916.39650497 , 461.81560711 , 622.78788792] , [ 1086.30310334 , 547.43959024 , 738.25730642] , [ 1055.6191624 , 531.97649895 , 717.40433866] , [ 934.71527569 , 471.0473034 , 635.23742091] , [ 1191.17806568 , 600.29105152 , 809.5308828 ], [ 2925.50768304 , 1474.30189816 , 1988.1904188 ]])) So we can conlude there is a signicant dependency between choice and age : according chi square test, those two variables are not independant.

2.3 Week 3 - Correlation coecient

Correlation between age and choice

scat2 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=data) plt.xlabel('Age ')

plt.ylabel('Choice ( increasing risk )')

plt.title(' Scatterplot for the Association Between Age and Choice ( risk decision level )') plt.show()

print (' association between age and choice ')

print (scipy.stats.pearsonr(data['AGE '], data['CHOICE '])) r coecient and p-Value

association between age and choice

( -0.13200566683109677 , 8.2912338219779939e-91)

The p-Value is low, the r coecient is signicatif and show a low decreasing relationship between age and choice : age variable can enable to predict about r²=1%of variable choice.

(17)

2.4 Week 4 - Statistical interaction

reltionship between gender and age to explain choice

data['GENDER2 '] = data['GENDER2 '].convert_objects(convert_numeric=True) sub1=data[(data['GENDER2 ']== 0)]

sub2=data[(data['GENDER2 ']== 1)]

print (' association between age and choice for men ') print (scipy.stats.pearsonr(sub1['AGE '], sub1['CHOICE ']))

print (' ')

print (' association between age and choice for women ') print (scipy.stats.pearsonr(sub2['AGE '], sub2['CHOICE ']))

scat1 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=sub1) plt.xlabel('AGE ')

plt.ylabel('CHOICE ')

plt.title(' Scatterplot for the Association Between age and choice for men ') plt.show()

scat2 = seaborn.regplot(x=" AGE ", y=" CHOICE ", fit_reg=True, data=sub2) plt.xlabel('AGE ')

plt.ylabel('CHOICE ')

plt.title(' Scatterplot for the Association Between age and choice for women ') plt.show()

sub4 = data[['CHOICE ', 'GENDER2 ']].dropna()

ct2=pandas.crosstab(sub4['CHOICE '], sub4['GENDER2 ']) print (ct2)

colsum2=ct2.sum(axis=0) colpct2=ct2/colsum2 print(colpct2)

print ('chi - square value , p value , expected counts ') cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

Correlation coecient between age and choice for men and women association between age and choice for men

( -0.098725407928702102 , 2.8739936968443809e-36) association between age and choice for women ( -0.14075296105345217 , 1.1294684721244681e-32) GENDER2 0.000000 1.000000

CHOICE

1.000000 1967 1504

2.000000 1336 718

3.000000 1240 761

4.000000 1639 733

5.000000 1733 572

6.000000 1543 498

(18)

7.000000 1996 605

8.000000 4695 1693

GENDER2 0.000000 1.000000 CHOICE

1.000000 0.121803 0.212309 2.000000 0.082730 0.101355 3.000000 0.076785 0.107425 4.000000 0.101492 0.103473 5.000000 0.107313 0.080745 6.000000 0.095548 0.070299 7.000000 0.123599 0.085404 8.000000 0.290730 0.238989 [8 rows x 2 columns]

chi-square value, p value, expected counts

(526.03337708698655 , 2.022169027878708e-109 , 7, array([[ 2412.65351009 , 1058.34648991] ,

[ 1427.71256403 , 626.28743597] , [ 1390.87285327 , 610.12714673] , [ 1648.75082856 , 723.24917144] , [ 1602.17987346 , 702.82012654] , [ 1418.67640856 , 622.32359144] , [ 1807.92618259 , 793.07381741] , [ 4440.22777945 , 1947.77222055]])) Causality relation or not

Correlation between age and choice for both men and women are quite the same so we can not conclude to a particular causal statistical interaction between age and gender in explaining choice, but there is still a signicant correlation between gender and choice : women choose more often less risky option than men.

(19)

3 Machine Learning

3.1 Week 1 - Classication Tree

A classication tree was construct on a training sample (60% of 1000 answers). 8 category of variable "choice" are predicted on a test sample of 400 answers.

from pandas import Series, DataFrame import pandas as pd

import numpy as np import os

import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

# Load the dataset

data = pd.read_csv(' estim20171220 . csv ', low_memory=False)

# Split into training and testing sets

predic = data[['retireage ','date ','logamount ','age ',' logAssetEstate ',' logAssetFinancial ',' logCreditEstate ','logincome ',' motivmain1 ','motivmain2 ',' motivmain3 ','motivmain4 ',' motivmain5 ',' motivmain6 ','motivmain7 ',' motivmain8 ','motivmain9 ',' motivmain10 ','matri1 ','matri2 ','matri3 ','matri4 ','matri5 ','matri6 ','matri7 ','gender1 ','gender2 ','prof1 ','prof2 ','prof3 ','prof4 ','prof5 ','prof6 ','prof7 ','prof8 ','prof9 ','prof10 ','educ1 ','educ2 ','educ3 ','educ4 ','educ5 ','educ6 ','educ7 ',' expersubj1 ',' expersubj2 ',' expersubj3 ',' expersubj4 ','ecoclim1 ','ecoclim2 ','ecoclim3 ','ecoclim4 ','ecoclim5 ','experobj1 ',' experobj2 ','experobj3 ',' experobj4 ','experobj5 ','experobj6 ',' experobj7 ','experobj8 ',' experobj9 ',' experobj10 ',' experobj11 ','fluctu1 ','fluctu2 ','fluctu3 ','fluctu4 ','gainloss1 ',' gainloss2 ','gainloss3 ',' lossgain1 ','lossgain2 ',' lossgain3 ',' dependente ',' productown1 ',' productown2 ',' productown5 ',' productown7 ',' productown8 ',' productown9 ',' productown_spec ','fpublic ','csp1 ','csp2 ','csp3 ','csp4 ','csp5 ','csp6 ','csp7 ','csp8 ','csp9 ','csp10 ','csp11 ','csp12 ','csp13 ','csp14 ',' productown3 ',' productown4 ',' productown6 ','remove1 ','remove2 ','remove3 ','gainloss4 ','codetuu0 ','codetuu1 ','codetuu2 ','codetuu3 ','codetuu4 ','codetuu5 ','codetuu6 ','codetuu7 ','codetuu8 ']].head(1000) targets = data['choice '].head(1000)

predTrain,predTest,tarTrain,tarTest=train_test_split(predic,targets,test_size=.4)

# Build model on training data

classifier=DecisionTreeClassifier()

classifier=classifier.fit(predTrain,tarTrain) predictions=classifier.predict(predTest)

print(sklearn.metrics.confusion_matrix(tarTest,predictions)) print(sklearn.metrics.accuracy_score(tarTest, predictions))

Below the confusion matrix showing that 28 answers "1" for choice value and 44 for value "8" were correctly predicted. The total accuracy of the prediction is 27,25% it means that 109 answers were correctly predict on 400. This classication tree is able to predict the value of choice between "1" to "8" with 72% that is to say, knowing all their characteristics, we are able to dierency a risk lover from a risk averse people about 7 times on 10.

[[28 6 3 4 1 1 3 10]

[10 8 2 1 0 0 1 2]

[14 7 3 3 2 0 3 11]

[12 4 4 3 2 5 4 5]

[12 1 6 1 10 3 4 11]

[ 2 5 2 2 2 2 3 8]

[ 4 4 3 5 8 3 11 16]

[16 9 8 10 10 2 11 44]]

0.2725

(20)

3.2 Week 2 - Classication Random Forest

from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions)) print(sklearn.metrics.accuracy_score(tar_test, predictions))

# fit an Extra Trees model to the data model = ExtraTreesClassifier()

model.fit(pred_train,tar_train)

# display the relative importance of each attribute print(model.feature_importances_)

"""

Running a different number of trees and see the effect of that on the accuracy of the prediction

"""

trees=range(25) accuracy=np.zeros(25)

for idx in range(len(trees)):

classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla()

plt.plot(trees, accuracy) plt.xlabel('trees ')

plt.ylabel('accuracy ')

plt.title(" Classifier accuracy ") plt.show()

The classier is not very better than a single classication tree, the confusion mtrix and accuracy do not show very better results.

[[17 3 1 2 5 0 4 13]

[ 8 5 4 0 2 2 2 4]

[ 7 6 2 5 2 2 3 18]

[10 2 3 3 1 4 2 6]

[ 6 2 4 1 4 4 4 11]

[ 2 1 2 6 3 2 4 12]

[ 8 2 5 10 0 8 6 30]

[15 11 2 13 5 10 10 49]]

0.22[0.01127044 0.08322846 0.12768593 0.12410391 0.05945459 0.10912257 0. 0.01460772 0. 0.00652329 0.02024125 0.01633897

(21)

0.02443224 0.03289355 0.02465689 0.0154227 0.03441944 0.00441721 0.00627706 0.00585599 0.00852943 0.00823017 0.00158778 0.01039395 0.02814369 0.01828384 0.02470821 0. 0. 0.

0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0. 0.01408595

0.01213265 0. 0.02016427 0.01463633 0.00868712 0.0008746 0.0129996 0.00874455 0.00502364 0. 0. 0.

0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0. 0.00267585

0.00784494 0.01413203 0.00573719 0.00770269 0.01465239 0.01278192

0.00718586 0.00910914 0. 0. 0. 0.

0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0. ]

Model features importances show that "logamount", "age" and "logAsset- Financial" are the most important variable to predict "choice" : age an fortune are linked with risk aversion.

the below graphic show that in our case the number of trees in the forest oes not improve clearly the accuracy of classier.

Figure 6 Classier Accuracy in function of forest size

(22)

Références

[1] Ottaviani & Myrick. Feynman. Vuibert, 2017.