• Aucun résultat trouvé

Machine learning and statistical learning

N/A
N/A
Protected

Academic year: 2022

Partager "Machine learning and statistical learning"

Copied!
124
0
0

Texte intégral

(1)

Ewen Gallic ewen.gallic@gmail.com

MASTER in Economics - Track EBDS - 2nd Year 2020-2021

(2)

This part presents some concepts of statistical learning, through the prism of regression.

Ewen Gallic Machine learning and statistical learning 2/124

(3)

1. Some context

(4)

1. Some context

Model specification

In a regression problem, the aim is to understand how aresponse variableyvaries,conditionally on the available information on somepredictors x.

Let us take an example, that of thesalaries for Professors in the US in 2008-09.

The salary of a professor may be linked, among other things, to the number of years since he or she obtained their Ph.D.

Ewen Gallic Machine learning and statistical learning 4/124

(5)

50000 100000 150000 200000

0 20 40

Years since Ph.D

Salary in 2008−09 (nine−month salary, in dollars)

linear regression

loess

conditional mean

linear regression

loess

observation

(6)

1. Some context

Salary as a function of years since Ph.D

Here, the linear regression suggests that on the average, the salary increases with the number of years since Ph.D:

• the slope of 985.3 indicates that each additional year since Ph.D leads to an increase of 985 dollars of 9-month salary.

But the relationship does not seem to be linear. . .

Ewen Gallic Machine learning and statistical learning 6/124

(7)

It should be noted here that:

• the regression analysis does not depend on agenerative modelhere (a model explaining how the data are generated)

• there is nocausalclaims regarding the way mean salary would change if the number of years since Ph.D is altered

• there is no statistical inference

We could add some predictors to the model to get a better story on what is going on with salary :

• some ommitted variables may play an important role in explaining the variations.

(8)

1. Some context

Salary as a function of years since Ph.D

We can also perform some regression analysis if theresponse variable is categorical.

Let us look at the salary in a different way: let us split it into two categories, either<$100kor

≥$100k.

For each decile of years since Ph.D, we can plot theconditional proportions.

Ewen Gallic Machine learning and statistical learning 8/124

(9)

0.00 0.25 0.50 0.75 1.00

1 5 10 13 17.4 21 25 30 35 40 56

Years since Ph.D by deciles

Salary type >=100k

<100k

(10)

1. Some context

Levels of regression analysis

Berk (2008) mentionsthree levels of regression analysis:

• Level I regression analysis:

I aiming atdescribing the data I assumption free

I should not be neglected

• Level II regression analysis:

I based onstatistical inference

I uses results from level I regression analysis I use with real data may be challenging I allows to make predictions

• Level III regression analysis:

I based oncausal inference

I uses level I analysis, sometimes coupled with level II

I rely more on algorithmic methods rather than model-based methods.

Ewen Gallic Machine learning and statistical learning 10/124

(11)

2. The linear regression

(12)

2. The linear regression

Some references

• Berk (2008). Statistical learning from a regression perspective, volume 14. Springer.

• Cornillon and Matzner-Løber (2007). Régression: théorie et applications. Springer.

• James et al. (2013). An introduction to statistical learning, volume 112. Springer.

Ewen Gallic Machine learning and statistical learning 12/124

(13)

Linear regression combines level I and level II perspectives.

It is useful when one wants topredict a quantitative response.

A lot of newer statistical learning approaches can be seen as generalizations or extensions of linear regression, as reminded inJames et al. (2013).

(14)

2. The linear regression 2.1. Simple linear regression

2.1 Simple linear regression

Ewen Gallic Machine learning and statistical learning 14/124

(15)

Let us consider first the case ofsimple linear regression.

We aim at predicting a quantitative response variable y using a single predictor x (or regressor).

y is a n×1numerical response variable, wherenrepresents the numbr of observations

x is an×1 predictor.

We assume there exists a linear relationship betweenyandxsuch that:

yi=β0+β1xi+εi, i= 1, . . . , n, (1) whereεi is an error term normally distributed with 0 mean and varianceσ2,i.e,εi∼ N(0, σ2).

(16)

2. The linear regression 2.1. Simple linear regression

Principle

In Eq. 1, thecoefficients (or parameters) β0 (i.e., the constant) andβ1 (i.e., the slope) are unknown parameters to be estimated.

These coefficients areestimatedusing atraining sample.

The estimates ofβ0andβ1 are, respectively,βˆ0 andβˆ1.

Once they are estimated using a learning procedure (in this case using linear regression), they can used topredictvalues for y for some valuex0:

ˆ

y0= ˆβ0+ ˆβ1x0 (2)

Ewen Gallic Machine learning and statistical learning 16/124

(17)

2.1.1 Estimating the coefficients

(18)

2. The linear regression 2.1. Simple linear regression 2.1.1. Estimating the coefficients

Estimating the coefficients

To estimateβ0andβ1, we rely on a set of training examples{(x1, y1), . . . ,(xn, yn)}.

For example, let us go back to our data describing the 9 month salary of professors (the response variable) and look at the relationship between the salary and years since Ph.D (x).

Ewen Gallic Machine learning and statistical learning 18/124

(19)

Figure 1: Varying the intercept. Figure 2: Varying the slope.

There is an infinity of possibles values that one can pick forβˆ0 andβˆ1.

However, we want to find an estimation that leads to a line beingas close as possible to the points:

but what does “close” mean?

Ewen Gallic Machine learning and statistical learning 19/124

● ●

● ●

● ●

● ●

● ●

10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 8 Slope: 0.1

● ●

● ●

● ●

● ●

● ●

10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 9.17 Slope: −0.05

(20)

2. The linear regression 2.1. Simple linear regression 2.1.1. Estimating the coefficients

Estimating the coefficients

The most common metric we want to minimize is known as theleast square criterion.

The predictionsyˆi for each of thexi,i= 1, . . . , nare given byyˆi = ˆβ0+ ˆβ1xi.

Letei=yiyˆitheithresidual,i.e., the difference between the osberves value and its prediction by the linear model.

Theresidual sum of squareis defined as:

RSS=

n

X

i=1

ε2i =

n

X

i=1

(yiβˆ1xiβˆ0)2. (3)

We aim atminimizing this metric.

Ewen Gallic Machine learning and statistical learning 20/124

(21)

It can easily be shown that the minimization of the RSS leads to:

βˆ1=

Pn

i=1xiyi−n¯y

Pn

i=1x2i−n¯x2

βˆ0= ¯yβˆ1x¯

(4)

wherex¯=n1Pn

i=1xi,y¯=n1Pn i=1yi.

(22)

2. The linear regression 2.1. Simple linear regression 2.1.1. Estimating the coefficients

Least squares coefficient estimates

Here, the least squares coefficient estimatesβˆ0 andβ1 are, respectively,9.1719and0.0985.

● ●

● ●

● ●

● ●

● ●

10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 9.1719 Slope: 0.0985

Figure 3: Fit of the Least Square for the regression of years since Ph.D onto the 9 months salaryof Professors.

Ewen Gallic Machine learning and statistical learning 22/124

(23)

We can have a look at the RSS when we vary the values ofβ0 andβ1:

Figure 4: Surface plot of the RSS depending on the values ofβˆ andβˆ.

−0.2

−0.1 0.0 0.1 0.2 0.3

7 9 11

Intercept

Slope

10000 20000 30000 40000 50000 RSS value

Figure 5: Contour plot of the RSS depending on the values ofβˆ andβˆ.

Ewen Gallic Machine learning and statistical learning 23/124

Intercept Slope

RSS

(24)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

2.1.2 Accuracy of the coefficient estimates

Ewen Gallic Machine learning and statistical learning 24/124

(25)

The estimatesβˆ0andβˆ1 are point estimates.

When they are estimated by least squares, they are:

unbiased

I E( ˆβ0) =β0 andE( ˆβ1) =β1

efficient

I V( ˆβ0)andV( ˆβ1)are minimal

convergent

I limn→+∞V( ˆβ0) = 0andlimn→+∞V( ˆβ1) = 0

They are calledBLUE(Best Linear Unbiased Estimator).

(26)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

Accuracy of the coefficient estimates

It is easy to show that:





V( ˆβ0) =σ2

1

n+Pn x

i=1(xi−x)2

V( ˆβ1) =Pn σ2 i=1(xi−x)2

(5)

whereσ2 can be estimated:

ˆ σ2=

Pn

i=1(yiyˆi)2

n−2 =

Pn i=1e2i n−2 .

Ewen Gallic Machine learning and statistical learning 26/124

(27)

Figure 6: A: True relationship (in red), Observed values ofy(points) and Least Squares line (in blue). B: True relationship (in red), Current Least Squares line (in blue), Previous Least Squares lines (in gray).

−20

−10 0 10 20 30

−5 0 5

x

y

A

−20

−10 0 10 20

−5 0 5

x

y

B

(28)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

Accuracy of the coefficient estimates

sd beta_0

sd beta_1 mean

beta_0

mean beta_1

0 250 500 750 1000 0 250 500 750 1000

0 250 500 750 1000 0 250 500 750 1000

1.98 1.99 2.00 2.01

0.06 0.07 0.08 0.96

1.00 1.04

0.36 0.40 0.44

Number of replications

Value

Figure 7: Mean of estimates ofβ0andβ1depending on the number of resampling.

Ewen Gallic Machine learning and statistical learning 28/124

(29)

We wish to test if a coefficientθ,θ∈ {β0, β1} is equal to a specific valueθ0:

(H0:θ=θ0 H1:θ6=θ0

We know thatθˆ∼ N

θ,Pn σ2 i=1(xi−¯x)2

, so:

θˆ−θ σ/pPn

i=1(xix)¯ 2 ∼ N(0,1).

(30)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

Hypothesis tests

As Pn

i=1ε2i

σ2χ2n−2, we can define a variable T as:

T =

θ−θˆ σ/pPn

i=1(xi−¯x)2

rPn i=1ε2i σ2u /

n−2

∼ St(n−2)

We can show that the expression ofT can be simplified to:

T = θˆ−θ

ˆ σθˆ

Ewen Gallic Machine learning and statistical learning 30/124

(31)

It is thus possible to perform the following test:

(H0:θ=θ0

H1:θ6=θ0

knowing that θ−θˆσˆ

ˆθ ∼ St(n−2)

(32)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

Hypothesis tests

And we need to find the following probability:

P −tα/2<

θˆ−θ ˆ σθˆ

< tα/2

!

We therefore need to compute a t-statistic, that measures the number of standard deviations thatθˆis away fromθ0:

tobs.= θˆ−θ0

ˆ σθˆ

• if tobs.

−tα/2, tα/2 :

I wedo not reject the null hypothesis(H0) with a first-order risk ofα%

• if tobs./

−tα/2, tα/2 :

I wereject the null hypothesis(H0) with a first-order risk ofα%

Ewen Gallic Machine learning and statistical learning 32/124

(33)

Most of the time, we are interested in a specific case:

(H0:α= 0 H1:α6= 0,

In such a case, the t-statistic becomes:

T = θˆ−0

ˆ σθ

= θˆ ˆ σθˆ

The observed value istobs.= ˆσαˆ

ˆ α.

(34)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

Hypothesis tests: confidence interval

We can also use the standard error of the coefficient estimates to construct a confidence interval:

I.C.\θ(1−α) =h

θˆ±tα/2×σˆθˆ

i

. (6)

If the intervals contain 0, then we can conclude that the coefficientθis not statistically different from zero (at theα%level of significance).

We can also compute the probability of observing any number equal to | t | or larger while assumingθ= 0(this probability is known as the p-value).

Ewen Gallic Machine learning and statistical learning 34/124

(35)

Least squares (Intercept) 9.17∗∗∗

(0.28) yrs.since.phd 0.10∗∗∗

(0.01)

R2 0.18

Adj. R2 0.17

Num. obs. 397

∗∗∗p <0.001;∗∗p <0.01;p <0.05 Table 1: Statistical models

(36)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

2.1.3 Accuracy of the model

Ewen Gallic Machine learning and statistical learning 36/124

(37)

Recall that the linear regression is a supervised learning method. Hence, we can compare the predictions we obtain with the observed values of the output variable.

We want to have an idea of the quality of the estimation, to know how well the model fits the data.

To that end, we usually use severalmetrics, among which:

• the root mean squared error (RMSE)

• the residual standard error (RSE)

• theR2statistic.

(38)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: RMSE

Themean squared error(MSE) is an estimate of theaverage of the squares of the errors:

MSE= 1 n

n

X

i=1

(yiyˆi)2 (7)

Theroot mean squared erroris the square root of the MSE:

RMSE= v u u t 1 n

n

X

i=1

(yiyˆi)2= rRSS

n , (8)

where RSS=Pn

i=1(yiyˆi)2

Ewen Gallic Machine learning and statistical learning 38/124

(39)

The value of the RMSE is always non-negative.

A value of0 indicates a perfect fit to the data.

(40)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: RSE

Recall that the linear model contains an error term (ε). Hence, we will not be able to perfectly predict the response variable.

TheResidual Standard Error is the average amount that the response will deviate from the true regression line. It is an estimate of the standard deviation ofε:

RSE= r 1

n−2RSS= v u u t

1 n−2

n

X

i=1

(yiyˆi)2. (9)

Ewen Gallic Machine learning and statistical learning 40/124

(41)

In our example of the regression of salaries onto years since Ph.D, the value of the RSE is2.7534.

This means that the actual salary can deviate from the true regression line by approximately 2.7534thousand dollars, on average.

The mean salary in the data is$11.37065thousand dollars. Hence, the percentage error for any prediction, using our estimation would be2.7534/11.37065≈25%.

(42)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: R

2

Now, let us turn to theR2 statistic, which provides another method to assess the quality of fit.

TheR2 measures the proportion of variance explained. It takes a value between 0 and 1.

Let us illustrate this.

Ewen Gallic Machine learning and statistical learning 42/124

(43)

The variations ofy are only partially explained by those ofx

x y

x1 y1

x2 y2

Figure 8: Variation fromy2toy1

(44)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: R

2

As shown in Figure 8, the variation fromy1 toy2is partially explained by the variation from x1

tox2.

Thequality of fit at each point, as measured by the total variation, can therefore be broken down into two parts:

• theexplained variation

• theresidual variation

using the average point(x, y)as reference,i.e.:

yiy¯

| {z }

total variation

= yˆiy¯

| {z }

explained variation

+ yiyˆi

| {z }

residual variation

.

Ewen Gallic Machine learning and statistical learning 44/124

(45)

The closerAis toA, the stronger the explained variation is, relatively.

x y

¯ x

¯ y

xi yi

A ˆ

yi

Aˆ Explained variance

Residual variance

Figure 9: Decomposition of the variation.

(46)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: R

2

Thus, one way to assess the quality of the adjustment is to measure the following ratio:

explained variance total variance Or, for all observations:

R2= Pn

i=1yiy)¯ 2 Pn

i=1(yiy)¯ 2 =ESS

TSS = explained sum of squares

total sum of squares (10)

Ewen Gallic Machine learning and statistical learning 46/124

(47)

We can write theR2 differently, as we know that:

n

X

i=1

(yiyˆi)2=

n

X

i=1

(yiy)¯ 2βˆ12

n

X

i=1

(xix)¯ 2=

n

X

i=1

(yiy)¯2

n

X

i=1

yiy)¯ 2

Thus:

R2= Pn

i=1(yiy)¯ 2−Pn

i=1(yiyˆi)2 Pn

i=1(yiy)¯ 2

= 1− Pn

i=1(yiyˆi)2 Pn

i=1(yiy)¯ 2 = 1−RSS

TSS (11)

(48)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

Accuracy of the model: R

2

The value of theR2 lies between 0 and 1:

R2= 1−RSS

TSS ⇒ 0≤R2≤1.

• When the economictheorysuggests that the relationship between the response and its predictor should be linear, we expect the value of the R2 to be really close to one, otherwise, it suggests there might be something wrong with the generation of the data.

• In other situations, when thelinear relationship can be at best a rough approximation of the real form, we expect to find low values of the R2.

Ewen Gallic Machine learning and statistical learning 48/124

(49)

It can be noted that in the case of simple linear regression, the R is equal to the squared correlation coefficient.

Indeed:

yiyˆi=yiy¯+ ¯yyˆi

= (yiy)¯ −(ˆyiy)¯

= (yiy)¯ −

βˆ1xi+ ˆβ0βˆ1x¯−βˆ0

= (yiy)¯ −βˆ1(xix).¯

Taking the squared value:

(yiyˆi)2= (yiy)¯ 2+ ˆβ1

2(xix)¯ 2−2 ˆβ1(yiy)(x¯ ix)¯

(50)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

R

2

and correlation

Which leads to:

(yiyˆi)2= (yiy)¯ 2+ ˆβ1

2(xix)¯ 2−2 ˆβ1(yiy)(x¯ ix)¯

Summing on all individuals:

n

X

i=1

(yiyˆi)2=

n

X

i=1

(yiy)¯ 2+ ˆβ12

n

X

i=1

(xix)¯ 2−2 ˆβ1

n

X

i=1

(yiy)(x¯ ix)¯

=

n

X

i=1

(yiy)¯ 2+ ˆβ1 2 n

X

i=1

(xix)¯ 2−2 ˆβ1 n

X

i=1

(xix)¯ 2

=

n

X

i=1

(yiy)¯ 2βˆ1 2 n

X

i=1

(xix)¯ 2

Ewen Gallic Machine learning and statistical learning 50/124

(51)

In can indeed be shown that 2 ˆβ1

2 n

X

i=1

(xix)¯ 2= 2 ˆβ1 n

X

i=1

(xix)¯ 2 Pn

i=1(xix)(y¯ ix)¯ Pn

i=1(xix)¯ 2

= 2 ˆβ1 n

X

i=1

(yiy)(x¯ ix).¯

We also have:

yiy) = ˆ¯ β1xi+ ˆβ0βˆ1x¯−βˆ0= ˆβ1(xix).¯ By taking the squared value and summing for all individuals:

n

X

i=1

yiy)¯2= ˆβ1 2 n

X

i=1

(xix)¯ 2. (12)

Références

Documents relatifs

Product numbers are included in parenthesis after the equipment designator, Additional manuals may be found in Section 2 by looking up the Product number... Many

The Canadian Primary Care Sentinel Surveillance Network, a Pan-Canadian project led by the CFPC that conducts standardized surveillance on selected chronic

However, like any other method for global optimization, EGO suffers from the curse of dimensionality, meaning that its performance is satisfactory on lower dimensional

Motivated by applications in mathematical imaging, asymptotic expansions of the heat generated by single, or well separated, small particles are derived in [5, 7] based on

The proof uses the second order derivative of the average distance functional F, which itself rely on the second derivative of the distance function along smooth vector fields

h Les temps du passé, prétérit et past perfect : The violence sparked further unrest ; led to the deployment of British troops ; before a contentious Apprentice Boys parade took

(i) L1 students use more circumstances (time, place, cause), CLIL students mainly use clause complexes. (ii) More abstractions in

We compute in particular the second order derivative of the functional and use it to exclude smooth points of positive curvature for the problem with volume constraint.. The