Machine learning and statistical learning

(1)

Ewen Gallic ewen.gallic@gmail.com

MASTER in Economics - Track EBDS - 2nd Year 2020-2021

(2)

This part presents some concepts of statistical learning, through the prism of regression.

Ewen Gallic Machine learning and statistical learning 2/124

(3)

1. Some context

(4)

1. Some context

Model specification

In a regression problem, the aim is to understand how aresponse variableyvaries,conditionally on the available information on somepredictors x.

Let us take an example, that of thesalaries for Professors in the US in 2008-09.

The salary of a professor may be linked, among other things, to the number of years since he or she obtained their Ph.D.

(5)

50000 100000 150000 200000

0 20 40

Years since Ph.D

Salary in 2008−09 (nine−month salary, in dollars)

linear regression

loess

conditional mean

linear regression

loess

observation

(6)

1. Some context

Salary as a function of years since Ph.D

Here, the linear regression suggests that on the average, the salary increases with the number of years since Ph.D:

• the slope of 985.3 indicates that each additional year since Ph.D leads to an increase of 985 dollars of 9-month salary.

But the relationship does not seem to be linear. . .

(7)

It should be noted here that:

• the regression analysis does not depend on agenerative modelhere (a model explaining how the data are generated)

• there is nocausalclaims regarding the way mean salary would change if the number of years since Ph.D is altered

• there is no statistical inference

We could add some predictors to the model to get a better story on what is going on with salary :

• some ommitted variables may play an important role in explaining the variations.

(8)

1. Some context

Salary as a function of years since Ph.D

We can also perform some regression analysis if theresponse variable is categorical.

Let us look at the salary in a different way: let us split it into two categories, either<$100kor

≥$100k.

For each decile of years since Ph.D, we can plot theconditional proportions.

(9)

0.00 0.25 0.50 0.75 1.00

1 5 10 13 17.4 21 25 30 35 40 56

Years since Ph.D by deciles

Salary type ^>=100k

<100k

(10)

1. Some context

Levels of regression analysis

Berk (2008) mentionsthree levels of regression analysis:

• Level I regression analysis:

I aiming atdescribing the data I assumption free

I should not be neglected

• Level II regression analysis:

I based onstatistical inference

I uses results from level I regression analysis I use with real data may be challenging I allows to make predictions

• Level III regression analysis:

I based oncausal inference

I uses level I analysis, sometimes coupled with level II

I rely more on algorithmic methods rather than model-based methods.

(11)

2. The linear regression

(12)

2. The linear regression

Some references

• Berk (2008). Statistical learning from a regression perspective, volume 14. Springer.

• Cornillon and Matzner-Løber (2007). Régression: théorie et applications. Springer.

• James et al. (2013). An introduction to statistical learning, volume 112. Springer.

(13)

Linear regression combines level I and level II perspectives.

It is useful when one wants topredict a quantitative response.

A lot of newer statistical learning approaches can be seen as generalizations or extensions of linear regression, as reminded inJames et al. (2013).

(14)

2. The linear regression 2.1. Simple linear regression

2.1 Simple linear regression

(15)

Let us consider first the case ofsimple linear regression.

We aim at predicting a quantitative response variable y using a single predictor x (or regressor).

• y is a n×1numerical response variable, wherenrepresents the numbr of observations

• x is an×1 predictor.

We assume there exists a linear relationship betweenyandxsuch that:

y_i=β₀+β₁x_i+ε_i, i= 1, . . . , n, (1) whereε_i is an error term normally distributed with 0 mean and varianceσ²,i.e,ε_i∼ N(0, σ²).

(16)

2. The linear regression 2.1. Simple linear regression

Principle

In Eq. 1, thecoefficients (or parameters) β0 (i.e., the constant) andβ1 (i.e., the slope) are unknown parameters to be estimated.

These coefficients areestimatedusing atraining sample.

The estimates ofβ₀andβ₁ are, respectively,βˆ₀ andβˆ₁.

Once they are estimated using a learning procedure (in this case using linear regression), they can used topredictvalues for y for some valuex₀:

ˆ

y₀= ˆβ₀+ ˆβ₁x₀ (2)

(17)

2.1.1 Estimating the coefficients

(18)

2. The linear regression 2.1. Simple linear regression 2.1.1. Estimating the coefficients

Estimating the coefficients

To estimateβ0andβ1, we rely on a set of training examples{(x1, y1), . . . ,(xn, yn)}.

For example, let us go back to our data describing the 9 month salary of professors (the response variable) and look at the relationship between the salary and years since Ph.D (x).

(19)

Figure 1: Varying the intercept. Figure 2: Varying the slope.

There is an infinity of possibles values that one can pick forβˆ0 andβˆ1.

However, we want to find an estimation that leads to a line beingas close as possible to the points:

but what does “close” mean?

●

● ● ●

●

● ●

●

●● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

●●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●●

●

● 10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 8 Slope: 0.1

●

● ● ●

●

● ●

●

●● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

●●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●●

●

● 10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 9.17 Slope: −0.05

(20)

Estimating the coefficients

The most common metric we want to minimize is known as theleast square criterion.

The predictionsyˆi for each of thexi,i= 1, . . . , nare given byyˆi = ˆβ0+ ˆβ1xi.

Lete_i=y_i−yˆ_itheithresidual,i.e., the difference between the osberves value and its prediction by the linear model.

Theresidual sum of squareis defined as:

RSS=

n

X

i=1

ε²_i =

n

X

i=1

(y_i−βˆ₁x_i−βˆ₀)². (3)

We aim atminimizing this metric.

(21)

It can easily be shown that the minimization of the RSS leads to:





 βˆ1=

Pn

i=1x_iy_i−n¯x¯y

Pn

i=1x²_i−n¯x²

βˆ0= ¯y−βˆ1x¯

(4)

wherex¯=_n¹Pn

i=1xi,y¯=_n¹Pn i=1yi.

(22)

Least squares coefficient estimates

Here, the least squares coefficient estimatesβˆ0 andβ1 are, respectively,9.1719and0.0985.

●

● ● ●

●

● ●

●

● ●

●

●● ●

●

● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

● ● ●

●

● ●

●

●●

●

● ●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

10 20 30

0 20 40

Years since Ph.D

Salary ($10000)

Intercept: 9.1719 Slope: 0.0985

Figure 3: Fit of the Least Square for the regression of years since Ph.D onto the 9 months salaryof Professors.

(23)

We can have a look at the RSS when we vary the values ofβ0 andβ1:

Figure 4: Surface plot of the RSS depending on the values ofβˆ andβˆ.

●

−0.2

−0.1 0.0 0.1 0.2 0.3

7 9 11

Intercept

Slope

10000 20000 30000 40000 50000 RSS value

Figure 5: Contour plot of the RSS depending on the values ofβˆ andβˆ.

Intercept Slope

RSS

●

(24)

2. The linear regression 2.1. Simple linear regression 2.1.2. Accuracy of the coefficient estimates

2.1.2 Accuracy of the coefficient estimates

(25)

The estimatesβˆ0andβˆ1 are point estimates.

When they are estimated by least squares, they are:

• unbiased

I E( ˆβ0) =β0 andE( ˆβ1) =β1

• efficient

I V( ˆβ0)andV( ˆβ1)are minimal

• convergent

I limn→+∞V( ˆβ0) = 0andlimn→+∞V( ˆβ1) = 0

They are calledBLUE(Best Linear Unbiased Estimator).

(26)

Accuracy of the coefficient estimates

It is easy to show that:







V( ˆβ0) =σ²

1

n+Pn ^x

i=1(x_i−x)²

V( ˆβ₁) =Pn ^σ² i=1(xi−x)²

(5)

whereσ² can be estimated:

ˆ σ²=

Pn

i=1(y_i−yˆ_i)²

n−2 =

Pn i=1e²_i n−2 .

(27)

Figure 6: A: True relationship (in red), Observed values ofy(points) and Least Squares line (in blue). B: True relationship (in red), Current Least Squares line (in blue), Previous Least Squares lines (in gray).

●

● ●

●

● ●

●

● ●

●

−20

−10 0 10 20 30

−5 0 5

x

y

A

−20

−10 0 10 20

−5 0 5

x

y

B

(28)

Accuracy of the coefficient estimates

sd beta_0

sd beta_1 mean

beta_0

mean beta_1

0 250 500 750 1000 0 250 500 750 1000

1.98 1.99 2.00 2.01

0.06 0.07 0.08 0.96

1.00 1.04

0.36 0.40 0.44

Number of replications

Value

Figure 7: Mean of estimates ofβ0andβ1depending on the number of resampling.

(29)

We wish to test if a coefficientθ,θ∈ {β0, β1} is equal to a specific valueθ0:

(H₀:θ=θ₀ H1:θ6=θ0

We know thatθˆ∼ N

θ,Pn ^σ² i=1(xi−¯x)²

, so:

θˆ−θ σ/pPn

i=1(x_i−x)¯ ² ∼ N(0,1).

(30)

Hypothesis tests

As Pn

i=1ε²_i

σ² ∼χ²_n−2, we can define a variable T as:

T =

θ−θˆ σ/pPn

i=1(x_i−¯x)²

rPn i=1ε²_i σ²_u /√

n−2

∼ St(n−2)

We can show that the expression ofT can be simplified to:

T = θˆ−θ

ˆ σθˆ

(31)

It is thus possible to perform the following test:

(H0:θ=θ0

H₁:θ6=θ₀

knowing that ^θ−θ^ˆ_σ_ˆ

ˆθ ∼ St(n−2)

(32)

Hypothesis tests

And we need to find the following probability:

P −t_α/2<

θˆ−θ ˆ σθˆ

< t_α/2

!

We therefore need to compute a t-statistic, that measures the number of standard deviations thatθˆis away fromθ0:

tobs.= θˆ−θ0

ˆ σθˆ

• if t_obs.∈

−t_α/2, t_α/2 :

I wedo not reject the null hypothesis(H0) with a first-order risk ofα%

• if tobs.∈/

−t_α/2, t_α/2 :

I wereject the null hypothesis(H0) with a first-order risk ofα%

(33)

Most of the time, we are interested in a specific case:

(H₀:α= 0 H1:α6= 0,

In such a case, the t-statistic becomes:

T = θˆ−0

ˆ σθ

= θˆ ˆ σθˆ

The observed value ist_obs.= _ˆ_σ^α^ˆ

ˆ α.

(34)

Hypothesis tests: confidence interval

We can also use the standard error of the coefficient estimates to construct a confidence interval:

I.C.\_θ(1−α) =h

θˆ±t_α/2×σˆθˆ

i

. (6)

If the intervals contain 0, then we can conclude that the coefficientθis not statistically different from zero (at theα%level of significance).

We can also compute the probability of observing any number equal to | t | or larger while assumingθ= 0(this probability is known as the p-value).

(35)

Least squares (Intercept) 9.17^∗∗∗

(0.28) yrs.since.phd 0.10^∗∗∗

(0.01)

R² 0.18

Adj. R² 0.17

Num. obs. 397

∗∗∗p <0.001;^∗∗p <0.01;^∗p <0.05 Table 1: Statistical models

(36)

2. The linear regression 2.1. Simple linear regression 2.1.3. Accuracy of the model

2.1.3 Accuracy of the model

(37)

Recall that the linear regression is a supervised learning method. Hence, we can compare the predictions we obtain with the observed values of the output variable.

We want to have an idea of the quality of the estimation, to know how well the model fits the data.

To that end, we usually use severalmetrics, among which:

• the root mean squared error (RMSE)

• the residual standard error (RSE)

• theR²statistic.

(38)

Accuracy of the model: RMSE

Themean squared error(MSE) is an estimate of theaverage of the squares of the errors:

MSE= 1 n

n

X

i=1

(y_i−yˆ_i)² (7)

Theroot mean squared erroris the square root of the MSE:

RMSE= v u u t 1 n

n

X

i=1

(y_i−yˆ_i)²= rRSS

n , (8)

where RSS=Pn

i=1(yi−yˆi)²

(39)

The value of the RMSE is always non-negative.

A value of0 indicates a perfect fit to the data.

(40)

Accuracy of the model: RSE

Recall that the linear model contains an error term (ε). Hence, we will not be able to perfectly predict the response variable.

TheResidual Standard Error is the average amount that the response will deviate from the true regression line. It is an estimate of the standard deviation ofε:

RSE= r 1

n−2RSS= v u u t

1 n−2

n

X

i=1

(yi−yˆi)². (9)

(41)

In our example of the regression of salaries onto years since Ph.D, the value of the RSE is2.7534.

This means that the actual salary can deviate from the true regression line by approximately 2.7534thousand dollars, on average.

The mean salary in the data is$11.37065thousand dollars. Hence, the percentage error for any prediction, using our estimation would be2.7534/11.37065≈25%.

(42)

Accuracy of the model: R

²

Now, let us turn to theR² statistic, which provides another method to assess the quality of fit.

TheR² measures the proportion of variance explained. It takes a value between 0 and 1.

Let us illustrate this.

(43)

The variations ofy are only partially explained by those ofx

x y

x₁ y₁

x₂ y₂

Figure 8: Variation fromy₂toy₁

(44)

Accuracy of the model: R

²

As shown in Figure 8, the variation fromy1 toy2is partially explained by the variation from x1

tox2.

Thequality of fit at each point, as measured by the total variation, can therefore be broken down into two parts:

• theexplained variation

• theresidual variation

using the average point(x, y)as reference,i.e.:

yi−y¯

| {z }

total variation

= yˆi−y¯

| {z }

explained variation

+ yi−yˆi

| {z }

residual variation

.

(45)

The closerAis toA, the stronger the explained variation is, relatively.

x y

¯ x

¯ y

x_i yi

A ˆ

yi

Aˆ Explained variance

Residual variance

Figure 9: Decomposition of the variation.

(46)

Accuracy of the model: R

²

Thus, one way to assess the quality of the adjustment is to measure the following ratio:

explained variance total variance Or, for all observations:

R²= Pn

i=1(ˆyi−y)¯ ² Pn

i=1(y_i−y)¯ ² =ESS

TSS = explained sum of squares

total sum of squares (10)

(47)

We can write theR² differently, as we know that:

n

X

i=1

(yi−yˆi)²=

n

X

i=1

(yi−y)¯ ²−βˆ₁²

n

X

i=1

(xi−x)¯ ²=

n

X

i=1

(yi−y)¯²−

n

X

i=1

(ˆyi−y)¯ ²

Thus:

R²= Pn

i=1(y_i−y)¯ ²−Pn

i=1(y_i−yˆ_i)² Pn

i=1(yi−y)¯ ²

= 1− Pn

i=1(yi−yˆi)² Pn

i=1(y_i−y)¯ ² = 1−RSS

TSS (11)

(48)

Accuracy of the model: R

²

The value of theR² lies between 0 and 1:

R²= 1−RSS

TSS ⇒ 0≤R²≤1.

• When the economictheorysuggests that the relationship between the response and its predictor should be linear, we expect the value of the R² to be really close to one, otherwise, it suggests there might be something wrong with the generation of the data.

• In other situations, when thelinear relationship can be at best a rough approximation of the real form, we expect to find low values of the R².

(49)

It can be noted that in the case of simple linear regression, the R is equal to the squared correlation coefficient.

Indeed:

yi−yˆi=yi−y¯+ ¯y−yˆi

= (y_i−y)¯ −(ˆy_i−y)¯

= (yi−y)¯ −

βˆ1xi+ ˆβ0−βˆ1x¯−βˆ0

= (yi−y)¯ −βˆ1(xi−x).¯

Taking the squared value:

(yi−yˆi)²= (yi−y)¯ ²+ ˆβ1

2(xi−x)¯ ²−2 ˆβ1(yi−y)(x¯ i−x)¯

(50)

R

²

and correlation

Which leads to:

(yi−yˆi)²= (yi−y)¯ ²+ ˆβ1

2(xi−x)¯ ²−2 ˆβ1(yi−y)(x¯ i−x)¯

Summing on all individuals:

n

X

i=1

(y_i−yˆ_i)²=

n

X

i=1

(y_i−y)¯ ²+ ˆβ₁²

n

X

i=1

(x_i−x)¯ ²−2 ˆβ₁

n

X

i=1

(y_i−y)(x¯ _i−x)¯

=

n

X

i=1

(yi−y)¯ ²+ ˆβ1 2 n

X

i=1

(xi−x)¯ ²−2 ˆβ1 n

X

i=1

(xi−x)¯ ²

=

n

X

i=1

(yi−y)¯ ²−βˆ1 2 n

X

i=1

(xi−x)¯ ²

(51)

In can indeed be shown that 2 ˆβ1

2 n

X

i=1

(xi−x)¯ ²= 2 ˆβ1 n

X

i=1

(xi−x)¯ ² Pn

i=1(xi−x)(y¯ i−x)¯ Pn

i=1(x_i−x)¯ ²

= 2 ˆβ1 n

X

i=1

(yi−y)(x¯ i−x).¯

We also have:

(ˆyi−y) = ˆ¯ β1xi+ ˆβ0−βˆ1x¯−βˆ0= ˆβ1(xi−x).¯ By taking the squared value and summing for all individuals:

n

X

i=1

(ˆyi−y)¯²= ˆβ1 2 n

X

i=1

(xi−x)¯ ². (12)