• Aucun résultat trouvé

Regularization Methods for Linear Regression

N/A
N/A
Protected

Academic year: 2022

Partager "Regularization Methods for Linear Regression"

Copied!
103
0
0

Texte intégral

(1)

Regularization Methods for Linear Regression

Mathilde Mougeot

ENSIIE

2017-2018

(2)

Lessons

• 3 plenatory lessons (MRR)

• 3 Practical work sessions using R (mrr)

• 10 Project sessions (ipR)

Before next session, install on your computer

1 R software,https://www.r-project.org/

2 Rstudio,https://www.rstudio.com/

Documents are availabale (at this stage)

• https://sites.google.com/site/MougeotMathilde/teaching

(3)

Plan

A word on data and predictive models

Data are everywhere

• Industry (Temperature, IR sensors...)

• Finance : transactions

• Marketing : consumer data.

• on your phone (GPS, mail, musique ...)

!Data base are available everywhere :from small data set to Big Data

Nowadays, predictive models are crutial for monitoring, for diagnosis

• Industry : Health monitoring, Energy...

• Finance : forecast of the evolution of the market

• Marketing : scoring

• Health

!Machine learning models are used tomine, tooperatethe data.

(4)

A word on data and predictive models

Data are everywhere

• Industry (Temperature, IR sensors...)

• Finance : transactions

• Marketing : consumer data.

• on your phone (GPS, mail, musique ...)

!Data base are available everywhere :from small data set to Big Data Nowadays, predictive models are crutial for monitoring, for diagnosis

• Industry : Health monitoring, Energy...

• Finance : forecast of the evolution of the market

• Marketing : scoring

• Health

(5)

Plan

Regularization Methods for Linear Regression

-Linear regression and Regularized Linear Regression belongs to thePredictive model family.

-Linear regression is an old model but still very useful ! Gauss, 1785 ; Legendre 1805

Outline of the lesson

• Motivations

• Ordinary Least Square -OLS- (geometrical approach)

• The linear Model (probabilistic approach)

• Using R software for modeling

(6)

Evolution of the average age of the French population

(7)

Plan

Evolution of the average age of the French population

(8)

Modeling the average age of the French population

(9)

Plan

Application : Social Networks

Facebook users :

an user(million)

31/12/04 0 31/12/05 5 31/12/06 10

31/3/07 20

30/12/07 60

30/6/08 100

30/12/08 150

30/3/09 170

30/6/09 200

30/9/09 250

30/12/09 300

30/3/10 400

30/6/10 500

30/6/11 750

(10)

Evolution of the number of Facebook users

(11)

Plan

Modeling the evolution of the number of Facebook users

(12)

• (Y,X) : couple of variables Y : Target quantitative variable

X = (X1,X2, . . . ,Xp) : Co-variates, quantitative variables

• The goal is to propose a Regression model to explain Y given X.

The parameters of the model are computed using a set of data Y =Fdata(X) =Fdata(X1, . . . ,Xp) here,F is a linear function.

• Questions :

- What are the performances of this model ? - What are the main explicative variables ?

- Is-it possible to use the model to predict new values ? to forecast ? - Can we improve the model ?

(13)

Boston Housing Data

The original data aren= 506 observations onp= 14 variables, medv median value, being thetarget variable

crim per capita crime rate by town

zn proportion of residential land zoned for lots over 25,000 sq.ft indus proportion of non-retail business acres per town

chas Charles River dummy variable (= 1 if tract bounds river ; 0 otherwise) nox nitric oxides concentration (parts per 10 million)

rm average number of rooms per dwelling

age proportion of owner-occupied units built prior to 1940 dis weighted distances to five Boston employment centres rad index of accessibility to radial highways

tax full-value property-tax rate per USD 10,000 ptratio pupil-teacher ratio by town

b 1000(B 0.63)2where B is the proportion of blacks by town lstat percentage of lower status of the population

medv median value of owner-occupied homes in USD 1000’s

(14)

The data :

nˇr crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

1 0.006 18 2.3 0 0.53 6.57 65.2 4.09 1 296 15.3 396.9 4.9 24.0

2 0.027 0 7.0 0 0.46 6.42 78.9 4.96 2 242 17.8 396.9 9.1 21.6

3 0.027 0 7.0 0 0.46 7.18 61.1 4.96 2 242 17.8 392.8 4.0 34.7

4 0.032 0 2.1 0 0.45 6.99 45.8 6.06 3 222 18.7 394.6 2.9 33.4

5 0.069 0 2.1 0 0.45 7.14 54.2 6.06 3 222 18.7 396.9 5.3 36.2

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

(15)

Boston Housing Data

• ModelY =Fdata(X)

• Evaluate the performances of the model

• What are the most important variables ? (variable selection)

!sparse models, less complex, best performances

• Inference and simulation

!Ponctual estimation for new values of the co-variables

!Confidence interval computation.

(16)

Outline

• Applications

• Ordinary Least Square (OLS) / Moindre Carr´es Ordinaires (MCO)

• Linear Model

• Regularization methods : ridge, lasso

(17)

OLS

Ordinary Least Square (OLS)

(18)

Outline

Ordinary Least Square (OLS)

• Values/Variables :

- Y,Y 2R value/ Target variable

- X = (X1, ...,Xp),X 2Rp values/ covariates

• Goal :ModelingY linearlywithX, with a ”small”✏term

Y = 1+ 2X2+· · ·+ pXp+ ✏

Y = 1X1+ 2X2+· · ·+ pXp+ ✏

Y =Pp

j jXj+ ✏

(19)

Outline

Ordinary Least Square (OLS)

• Values/Variables :

- Y,Y 2R value/ Target variable

- X = (X1, ...,Xp),X 2Rp values/ covariates

• Data :S ={(xi,yi)i= 1, ...,n, yi 2R, xi 2Rp}

• Goal :ModelingY linearlywithX, with a ”small”✏term

Y = 1+ 2X2+· · ·+ pXp+ ✏

Y = 1X1+ 2X2+· · ·+ pXp+ ✏

Y =Pp

j jXj+ ✏

(20)

Ordinary Least Square (OLS)

• Values/Variables :

- Y,Y 2R value/ Target variable

- X = (X1, ...,Xp),X 2Rp values/ covariates

• Data :S ={(xi,yi)i= 1, ...,n, yi 2R, xi 2Rp}

• Goal :ModelingY linearlywithX, with a ”small”✏term

Y = 1+ 2X2+· · ·+ pXp+ ✏

Y = 1X1+ 2X2+· · ·+ pXp+ ✏

Y =Pp

j jXj+ ✏

(21)

OLS

Ordinary Least Square (OLS)

Simple Linear Regression model

(22)

Simple Linear Regression : example

We only have one co-variable (X) to explain the target variable (Y).

The scatter plot is represented by :

+ ++ + + + + +

+ + + + + +

+ ++

+ + +

+

+ ++++

+ +

+ +

++ + +

+ + + + +

+

+ +

+ + +

+ + +

+ + + + ++++

++ +

+ + +

+ ++

+ + +

+ + ++

++ +

+ + + ++

+ ++

+ +

+ +

++ ++

+ + +

+ +

+ + ++

5 6 7 8 9 10

12162024

y

(23)

Outline

Simple Linear Regression : example

For all observation couplesi, 1in(Yi,Xi), the goal is here to minimize (Yi ( 1+ 2Xi))2

+ + + + + + + +

+ + + + + +

+ ++

+ + +

+

+ ++++

+ +

+ +

++ + +

+ + + + +

+

+ +

+ + +

+ + +

+ + + + ++

++ ++ +

+ + +

+ ++

+ + +

+ + ++

++ +

+ + + ++

+ ++

+ +

+ +

++ ++

+ + +

+ +

+ + ++

5 6 7 8 9 10

12162024

x

y

Régression

(24)

Ordinary Least Square : simple linear model

Formalism :

• Distance to a single point : (yi xi 2 1)2

• Distance to the whole sample :Pn

i=1(yi xi 2 1)2

!Best line : intercept ˆ1and slope ˆ2such that Pn

i=1(yi xi 2 1)2 is minimum, among all possible values of 1and 2.

OLS estimator :

The values estimated by OLS (the estimates) for 1 and 2verify : ( ˆ1olsols

2 ) =arg min

1, 22R2{ Xn i=1

(yi xi 2 1)2}

(25)

Outline

Ordinary Least Square : simple linear model

OLS estimator (observations) :

The values estimated by OLS (the estimates) for 1 and 2verify : ( ˆ1olsols

2 ) =arg min

1, 22R2{ Xn i=1

(yi xi 2 1)2}

OLS estimator (vector notations) :y,x, 1n

The values estimated by OLS (the estimates) for 1 and 2verify : ( ˆols12ols) =arg min

1, 22R2ky x 2 1n 1k22

OLS estimator (matrix notationsY (n,1) ;X (n,2)) :

The values estimated by OLS (the estimates) for 1 and 2verify : ( ˆ1olsols

2 ) =arg min

1, 22R2kY X k22

(26)

Ordinary Least Square : simple linear model

Theorem :

The OLS estimators have the following expressions :

ˆ

1

= ¯ y ˆ

2

x ¯

ˆ

2

=

Pni=1P2(xi x)(yˆ i ˆy)

i=1(xi ¯x)2

=

CovVar(x,y)(x)

Proof :

by zeroing the derivative of the objective function, which is convex.

(27)

Outline

Ordinary Least Square : simple linear model

For the simple linear model, thecorrelation coefficientmay be very useful :

• r(x,y) :correlation coefficient/ coefficient de corr´elation lin´eaire r(x,y) = p cov(X,Y)

var(x)p

var(y)

• r(x,y) = 1 if and only ifY =aX+b, linear relation betweenY etX

R-square used in multiple regression

• R2= Var(YVarYˆ)

• R22[0,1]

• Simple regressionR2=r2 :

(28)

Ordinary Least Square : simple linear model

For the simple linear model, thecorrelation coefficientmay be very useful :

• r(x,y) :correlation coefficient/ coefficient de corr´elation lin´eaire r(x,y) = p cov(X,Y)

var(x)p

var(y)

• r(x,y) = 1 if and only ifY =aX+b, linear relation betweenY etX R-square used in multiple regression

• R2= Var(YVarYˆ)

• R22[0,1]

• Simple regressionR2=r2 :

(29)

Outline

Ordinary Least Square : simple linear model

For the simple linear model, thecorrelation coefficientmay be very useful :

• r(x,y) :correlation coefficient/ coefficient de corr´elation lin´eaire r(x,y) = p cov(X,Y)

var(x)p

var(y)

• r(x,y) = 1 if and only ifY =aX+b, linear relation betweenY etX R-square used in multiple regression

• R2= Var(YVarYˆ)

• R22[0,1]

• Simple regressionR2=r2 :

(30)

Best Practices

The correlation coefficient equals 1 for these two cases :

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

n=500, correlation 0.9998

x

y

●●

x

y

●●●

●●

●●

●●

●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

n=500,correlation 0.9990

x

y

Always looking at the data ! !

(31)

OLS

Ordinary Least Square (OLS)

Multiple Linear Regression model

(32)

Outline

Ordinary Least Square : multiple linear model

• We supposeY =Pp

j jXj+✏andS ={(xi,yi)i= 1...n, yi2Rxi 2Rp}

• The Quadratic error is defined by : E( ) =Pn

i2i =Pn

i(yi P

jxij j)2

E( ) = (Y X )T(Y X )

• Goal :To minimize the errorE( ) on the data setS. To compute ˆ2Rp :

ˆ =arg min 2RpE( )

(33)

Outline

Ordinary Least Square : multiple linear model

• We supposeY =Pp

j jXj+✏andS ={(xi,yi)i= 1...n, yi2Rxi 2Rp}

• The Quadratic error is defined by : E( ) =Pn

i2i =Pn

i(yi P

jxij j)2 with matrix notation :

E( ) = (Y X )T(Y X )

• Goal :To minimize the errorE( ) on the data setS. To compute ˆ2Rp :

ˆ =arg min 2RpE( )

(34)

Ordinary Least Square : multiple linear model

• We supposeY =Pp

j jXj+✏andS ={(xi,yi)i= 1...n, yi2Rxi 2Rp}

• The Quadratic error is defined by : E( ) =Pn

i2i =Pn

i(yi P

jxij j)2 with matrix notation :

E( ) = (Y X )T(Y X )

• Goal :To minimize the errorE( ) on the data setS.

To compute ˆ2Rp :

ˆ =arg min 2RpE( )

(35)

Outline

Ordinary Least Square. multiple regression model

• We aim to compute which minimize : E( ) =||Y X ||22

= (Y X )T(Y X )

• Assumption :XTX inversible. (n p)

Theorem :

ˆMCO= (XTX) 1XTY

(36)

Outline

MCO

• Estimation

ˆ = (XTX) 1XTY

the prediction of the target can be computed : ˆY =P

j ˆjXj Yˆ =Xˆ

=X(XTX) 1XTY

• P Projection matrixon the Hyperplan (hat matrix) P =X(XTX) 1XT

P2 =P

• Residuals - ˆ✏=Y Yˆ

Remarque : no assumption on the law or distribution of✏

(37)

Outline

MCO

• Estimation

ˆ = (XTX) 1XTY

• PredictionKnowing ˆ and givenX1, . . . ,Xp,

the prediction of the target can be computed : ˆY =P

j ˆjXj Yˆ =Xˆ

=X(XTX) 1XTY

• P Projection matrixon the Hyperplan (hat matrix) P =X(XTX) 1XT

P2 =P

• Residuals - ˆ✏=Y Yˆ

Remarque : no assumption on the law or distribution of✏

(38)

Outline

MCO

• Estimation

ˆ = (XTX) 1XTY

• PredictionKnowing ˆ and givenX1, . . . ,Xp,

the prediction of the target can be computed : ˆY =P

j ˆjXj Yˆ =Xˆ

=X(XTX) 1XTY

• P Projection matrixon the Hyperplan (hat matrix) P =X(XTX) 1XT

P2 =P

- ˆ✏=Y Yˆ

Remarque : no assumption on the law or distribution of✏

(39)

Outline

MCO

• Estimation

ˆ = (XTX) 1XTY

• PredictionKnowing ˆ and givenX1, . . . ,Xp,

the prediction of the target can be computed : ˆY =P

j ˆjXj Yˆ =Xˆ

=X(XTX) 1XTY

• P Projection matrixon the Hyperplan (hat matrix) P =X(XTX) 1XT

P2 =P

• Residuals - ˆ✏=Y Yˆ

Remarque : no assumption on the law or distribution of✏

(40)

Ordinary Least Square. Geometrical interpretation

Y = Pp

j Xj j + ✏

2Rn 2Rp 2R(n p)

(41)

Outline Properties

Ordinary Least Square. Properties

• Orthogonality : - Yˆ?ˆ✏

Xj?ˆ✏ 8j2[1. . .p]<Xj,ˆ✏>= 0

• Residual average :

- P

iˆ✏i = 0 if there is an intercept in the modelX1= (1,1, . . . ,1)

! the average point belongs to the hyperplan - Y¯ˆ = ¯Y

• Analysis of Variance -ANAVAR-(Pythagore) - var(Y) =var( ˆY) +var( ˆE)

(42)

Outline Properties

Ordinary Least Square. Properties

• Orthogonality : - Yˆ?ˆ✏

Xj?ˆ✏ 8j2[1. . .p]<Xj,ˆ✏>= 0

• Residual average :

- P

iˆ✏i = 0 if there is an intercept in the modelX1= (1,1, . . . ,1)

! the average point belongs to the hyperplan - Y¯ˆ = ¯Y

- var(Y) =var( ˆY) +var( ˆE)

(43)

Outline Properties

Ordinary Least Square. Properties

• Orthogonality : - Yˆ?ˆ✏

Xj?ˆ✏ 8j2[1. . .p]<Xj,ˆ✏>= 0

• Residual average :

- P

iˆ✏i = 0 if there is an intercept in the modelX1= (1,1, . . . ,1)

! the average point belongs to the hyperplan - Y¯ˆ = ¯Y

• Analysis of Variance -ANAVAR-(Pythagore) - var(Y) =var( ˆY) +var( ˆE)

(44)

Multiple Linear model : example with R

head(mydata,3) ; y x1 x2 x3

1 -2.20 0.38 0.98 0.46 2 -1.75 0.11 0.62 0.37 3 -0.24 0.80 0.59 0.87 ...

>modlm=lm(y ˜ x1+x2+x3,data=mydata) ; Call :

lm(formula = y ˜ x1+x2+x3, data = mydata) Coefficients :

(Intercept) x1 x2 x3

0.02754 1.98163 -3.03612 0.01903

(45)

Outline Adjustment

Multiple Linear model : example with R

>modlm=lm(y ˜ x1+x2+x3,data=mydata) ;

>summary(modlm)

lm(formula = y ˜ x1+x2+x3, data = mydata) Residuals :

Min 1Q Median 3Q Max -0.29 -0.075 -0.0035 0.073 0.281 Coefficients :

Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02754 0.01503 1.833 0.0674 . x1 1.98163 0.01577 125.652<2e-16 ***

x2 -3.03612 0.01621 -187.286<2e-16 ***

x3 0.01903 0.01576 1.208 0.2277

— Signif. codes : 0 /*** 0.001 /** 0.01 / * 0.05 / . 0.1/ 1 Residual standard error : 0.1009 on 496 degrees of freedom Multiple R-squared : 0.9904, Adjusted R-squared : 0.9904 F-statistic : 1.707e+04 on 3 and 496 DF, p-value :<2.2e-16

(46)

Outline Adjustment

Ordinary Least Square, quality of the adjustment

• R2, R-square(French : coefficient de d´etermination)

R2=var( ˆvar(YY)),remark : no unit

||Y Y1,n||

w : angle between the centered vector (Y Y¯1,n) and its centered prediction ( ˆY Yˆ¯1,n)

• var( ˆE) =1nPn

i=1(yii)2

= (1 R2)var(Y),unit of Y2

(47)

Outline Adjustment

Ordinary Least Square, quality of the adjustment

• R2, R-square(French : coefficient de d´etermination)

R2=var( ˆvar(YY)),remark : no unit

cos2w=R2= ||||YYˆ YY¯ˆ¯1,n||2

1,n||2

w : angle between the centered vector (Y Y¯1,n) and its centered prediction ( ˆY Yˆ¯1,n)

• var( ˆE) =1nPn

i=1(yii)2

= (1 R2)var(Y),unit of Y2

(48)

Ordinary Least Square, quality of the adjustment

• R2, R-square(French : coefficient de d´etermination)

R2=var( ˆvar(YY)),remark : no unit

cos2w=R2= ||||YYˆ YY¯ˆ¯1,n||2

1,n||2

w : angle between the centered vector (Y Y¯1,n) and its centered prediction ( ˆY Yˆ¯1,n)

• var( ˆE) =1nPn

i=1(yii)2

= (1 R2)var(Y),unit of Y2

(49)

Outline Adjustment

Ordinary Least Square : Quality of the Ajustement

• R-square (forfewvariables)

R2= cos2!=Var(YVarYˆ)

R22[0,1]

R2=increases mechanically with the number of variables

• Adjusted R-squared is sometimes preferred (penalization with the number of variables)

Radj2 = 1 (1 R2)n pn 1

Radj2 may be negtive

• Residual study:

ˆ✏i =yi ˆyi 8i 21..n

Vizualization of

i,i)8i21..n

i,yi) homoscedastic vs heteroscedasticbissectrice model

• Prediction: Vizualization of

(ˆyi,yi) 8i 21..n

comparison with the first bisector.

(50)

Outline Adjustment

Ordinary Least Square : Quality of the Ajustement

• R-square (forfewvariables)

R2= cos2!=Var(YVarYˆ)

R22[0,1]

R2=increases mechanically with the number of variables

• Adjusted R-squared is sometimes preferred (penalization with the number of variables)

Radj2 = 1 (1 R2)n pn 1

Radj2 may be negtive

ˆ✏i =yi ˆyi 8i 21..n

Vizualization of

i,i)8i21..n

i,yi) homoscedastic vs heteroscedasticbissectrice model

• Prediction: Vizualization of

(ˆyi,yi) 8i 21..n

comparison with the first bisector.

Références

Documents relatifs

induces a quadratic overhead of the substitution process. Turning to practical values and embracing the second approach, instead, speeds up the substitution process, inducing a

• Etape k , on ajoute au Mod`ele M k le r´egresseur qui augmente le plus le R 2 global parmi les r´egresseurs significatifs.. • On it`ere jusqu’`a ce qu’aucun r´egresseur

In this context, we propose a nonparametric estimation strategy of the hazard rate, based on a regression contrast minimized in a finite dimensional functional space generated

Three versions of the MCEM algorithm were proposed for comparison with the SEM algo- rithm, depending on the number M (or nsamp) of Gibbs iterations used to approximate the E

Abstract: For the last three decades, the advent of technologies for massive data collection have brought deep changes in many scientific fields. What was first seen as a

Density and hazard rate estimation for right- censored data by using wavelet methods.. An alternative point of view on

The regional targets set in “Action plan for the prevention and control of noncommunicable diseases in South-East Asia, 2013–2020” are the same as the global targets set in 2013

We see that CorReg , used as a pre-treatment, provides similar prediction accuracy as the three variable selection methods (this prediction is often better for small datasets) but