• Aucun résultat trouvé

Cours "Statistical learning"

N/A
N/A
Protected

Academic year: 2022

Partager "Cours "Statistical learning""

Copied!
52
0
0

Texte intégral

(1)

Introduction to Statistical Learning

Jean-Philippe Vert

Jean-Philippe.Vert@ensmp.fr

Mines ParisTech and Institut Curie

Master Course, 2011.

(2)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(3)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(4)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(5)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(6)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(7)

Motivations

Predict the risk of second heart from demographic, diet and clinical measurements

Predict the future price of a stock from company performance measures

Recognize a ZIP code from an image Identify the risk factors for prostate cancer

and many more applications in many areas of science, finance and industry wherea lot of dataare collected.

(8)

Learning from data

Supervisedlearning

An outcome measurement (targetorresponse variable)

which can be quantitative (regression) or categorial (classification) which we want to predicted based on a set offeaturesor

descriptorsorpredictors)

We have atraining setwith features and outcome

We build a prediction model, orlearnerto predict outcome from features for new unseen objects

Unsupervisedlearning No outcome

Describe how data are organized or clustered Examples - Fig 1.1-1.3

(9)

Learning from data

Supervisedlearning

An outcome measurement (targetorresponse variable)

which can be quantitative (regression) or categorial (classification) which we want to predicted based on a set offeaturesor

descriptorsorpredictors)

We have atraining setwith features and outcome

We build a prediction model, orlearnerto predict outcome from features for new unseen objects

Unsupervisedlearning No outcome

Describe how data are organized or clustered Examples - Fig 1.1-1.3

(10)

Learning from data

Supervisedlearning

An outcome measurement (targetorresponse variable)

which can be quantitative (regression) or categorial (classification) which we want to predicted based on a set offeaturesor

descriptorsorpredictors)

We have atraining setwith features and outcome

We build a prediction model, orlearnerto predict outcome from features for new unseen objects

Unsupervisedlearning No outcome

Describe how data are organized or clustered Examples - Fig 1.1-1.3

(11)

Machine learning / data mining vs statistics

They share many concepts and tools, but in ML:

Predictionis more important thanmodelling(understanding, causality)

There is no settled philosophy or theoretical framework

We are ready to usead hocmethods if they seem to work on real data

We often havemany features, and sometimeslarge training sets.

We focus onefficient algorithms, withlittle or no human intervention.

We often use complex nonlinear models dfs

(12)

Organization

Focus on supervised learning (regression and classification) Reference: "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman (HTF)

Available online athttp:

//www-stat.stanford.edu/~tibs/ElemStatLearn/

Practical sessions usingR

(13)

Notations

Y ∈ Y the response (usuallyY ={−1,1}orR) X ∈ X the input (usuallyX =Rp)

x1, . . . ,xN observed inputs, stored in theN×pmatrixX y1, . . . ,yN observed inputs, stored in the vectorY∈ YN

(14)

Simple method 1: Linear least squares

Parametric model forβ ∈Rp+1:

fβ(X) =β0+

p

X

i=1

βiXi =X>β

Estimateβˆfrom training data to minimize

RSS(β) =

N

X

i=1

(yi−fβ(xi))2

See Fig 2.1

Good if model is correct...

(15)

Simple method 2: Nearest neighbor methods (k-NN)

Prediction based on thek nearest neighbors:

Yˆ(x) = 1 k

X

xi∈Nk(x)

yi

Depends onk

Less assumptions that linear regression, but more risk of overfitting

Fig 2.2-2.4

(16)

Statistical decision theory

Joint distributionPr(X,Y)

Loss functionL(Y,f(X)), e.g. squared error loss L(Y,f(X)) = (Y −f(X))2 Expected prediction error(EPE):

EPE(f) =E(X,Y)∼Pr(X,Y)L(Y,f(X)) Minimizer isf(X) =E(Y|X)(regression function) Bayes classifierfor 0/1 loss in classification (Fig 2.5)

(17)

Least squares and k -NN

Least squares assumesf(x)is linear, and pools over values ofX to estimate the best parameters. Stable but biased

k-NN assumesf(x)is well approximated by a locally constant function, and pools over local sample data to approximate conditional expectation.Less stable but less biased.

(18)

Local methods in high dimension

IfNis large enough,k-NN seems always optimal (universally consistent)

But whenpis large,curse of dimension:

No method can be "local’ (Fig 2.6)

Training samples sparsely populate the input space, which can lead to large bias or variance (eq. 2.25 and Fig 2.7-2.8)

If structure is known (eg, linear regression function), we can reduce both variance and bias (Fig. 2.9)

(19)

Bias-variance trade-off

AssumeY =f(X) +, on a fixed design.Y(x)is random because of, ˆf(X)is random because of variations in the training setT. Then

E,T

Y−ˆf(X)2

=EY2+Eˆf(X)2−2EYˆf(X)

=Var(Y) +Var(ˆf(X)) +

EY −Eˆf(X)2

=noise+bias(ˆf)2+variance(ˆf)

(20)

Structured regression and model selection

Define a family of function classesFλ, whereλcontrols the

"complexity", eg:

Ball of radiusλin a metric function space Bandwidth of the kernel is a kernel estimator Number of basis functions

For eachλ, define

ˆfλ=argmin

Fλ

EPE(f)

Selectˆf = ˆfλˆ tominimize the bias-variance tradeoff(Fig. 2.11).

(21)

Cross-validation

A simple and systematic procedure to estimate the risk (and to optimize the model’s parameters)

1 Randomly divide the training set (of sizeN) intoK (almost) equal portions, each of sizeK/N

2 For each portion, fit the model with different parameters on the K −1 other groups and test its performance on the left-out group

3 Average performance over theK groups, and take the parameter with the smallest average performance.

TakingK =5 or 10 is recommended as a good default choice.

(22)

Summary

To learn complex functions in high dimension from limited training sets, we need to optimize a bias-variance trade-off. We will do that typically by:

1 Define a family of learners of various complexities (eg, dimension of a linear predictor)

2 Define an estimation procedure for each learner (eg, least-squares or empirical risk minimization)

3 Define a procedure to tune the complexity of the learner (eg, cross-validation)

(23)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(24)

Linear least squares

Parametric model forβ ∈Rp+1:

fβ(X) =β0+

p

X

i=1

βiXi =X>β

Estimateβˆfrom training data to minimize

RSS(β) =

N

X

i=1

(yi−fβ(xi))2

Solution ifX>Xis non-singular:

βˆ=

X>X−1

X>Y

(25)

Fitted values

Fitted values on the training set:

Yˆ =Xβˆ=X

X>X−1

X>Y=HY with H=X

X>X−1

X>

Geometrically:HprojectsYon the span ofX(Fig. 3.2) IfXis singular,βˆis not uniquely defined, butYˆ is

(26)

Inference on coefficients

AssumeY=+, with∼ N(0, σ2I) Thenβˆ∼ N β, σ2 X>X

−1

Estimating variance: σˆ=kYYˆk2/(N−p−1) Statistics on coefficients:

βˆj−βj ˆ σ√

vj ∼tN−p−1

allows to test the hypothesisH0j =0, and gives confidence intervals

βˆj±tα/2,N−p−1σˆp vj

(27)

Inference on the model

Compare a large model withp1features to a smaller model with p0 features:

F = (RSS0−RSS1)/(p1−p0) RSS1/(N−p1−1)

follows the Fisher lawFp1−p0,N−p1−1under the hypothesis that the small model is correct.

(28)

Gauss-Markov theorem

AssumeY=+, whereE=0 andE>2I.

Then the least squares estimatorβˆis BLUE (best linear unbiased estimator), i.e., for any other estimatorβ˜=CYwithEβ˜=β,

Var( ˆβ)≤Varβ˜

Nevertheless, we may have smaller total risk by increasing bias to decrease variance, in particular in the high-dimensional setting.

(29)

Gauss-Markov theorem

AssumeY=+, whereE=0 andE>2I.

Then the least squares estimatorβˆis BLUE (best linear unbiased estimator), i.e., for any other estimatorβ˜=CYwithEβ˜=β,

Var( ˆβ)≤Varβ˜

Nevertheless, we may have smaller total risk by increasing bias to decrease variance, in particular in the high-dimensional setting.

(30)

Decreasing the complexity of linear models

1 Feature subset selection

2 Penalized criterion

3 Feature construction

(31)

Feature subset selection

Best subset selection

Usually NP-hard, "leaps and bound" procedure works for up to p=40

Bestk selected by cross-validation of various criteria(Fig 3.5) Greedy selection: forward, backward, hybrid

(32)

Ridge regression

Minimize

RSS(β) +λ

p

X

i=1

βi2 Solution:

βˆλ=

X>X+λI−1

X>Y

IfX>X=I(orthogonal design), thenβˆλ = ˆβ/(1+λ), otherwise nonlinear solution pathFig 3.8

Equivalent to shrinking on the small principal components(Fig 3.9)

(33)

Lasso

Minimize

RSS(β) +λ

p

X

i=1

i|

No explicit solution, but convex quadratic program and efficient algorithm for the solution path (LARS,Fig. 3.10)

Performs feature selection because the`1ball has singularities (Fig 3.11

(34)

Discussion

In orthogonal design, best subset selection, ridge regression and Lasso correspond to 3 different ways to shrink theβˆcoefficients (Fig 3.10)

They minimizeRSS(β)over respectively the`0,`2and`1balls Generalization: penalize bykβkq, but:

convex problem only forq1 feature selection only forq1

Generalization: group lasso, fused lasso, elastic net...

(35)

Using derived input space

PCR

OLS on the topM principal components

Similar to ridge regression, but truncates instead of shrinking PLS

Similar to PCR but usesYto construct the directions: maximize maxα Corr2(Y,Xα)Var(Xα)

(36)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(37)

Supervised classification

Y ={−1,1}(can be generalized toK classes) Goal: estimateP(Y =k|X =x), or (easier) Yˆ(x) =arg maxkP(Y =k|X =x)

Approach: estimate a functionf :X →Rand predict according to Yˆ(x) =

(1 iff(x)≥0,

−1 iff(x)<0. 3 strategies

1 ModelP(X,Y)(LDA)

2 ModelP(Y|X)(logistic regression)

(38)

Linear discriminant analysis (LDA)

ModelP(Y =k) =πk andP(X|Y =k)∼ N(µk,Σ) Estimation:

ˆ πk = Nk

N ˆ µk = 1

Nk X

i:yi=k

xi

Σ =ˆ 1 N−1

X

k∈{−1,1}

X

i:yi=k

(xi−µk) (xi−µk)>

Prediction:

ln P(Y =1|X =x) P(Y =−1|X =x) =

x>ˆ−1 − − 1 >ˆ−1 1 >ˆ−1 lnN1

(39)

Remarks on LDA

If aΣˆ is estimated on each class, we obtain a quadratic function : quadratic discriminant analysis (QDA)

LDA performs linear discriminationf(X) =β>X+b.βcan also be found by OLS, takingYi =Ni/N

Good baseline method, even if the data are not Gaussian

(40)

Quadratic discriminant analysis (QDA)

ModelP(Y =k) =πk andP(X|Y =k)∼ N(µk,Σ) Estimation: same as LDA except

Σˆk = 1 Nk

X

i:yi=k

(xi−µk) (xi−µk)>

Prediction:

lnP(Y =k|X =x)

P(Y =l|X =x) =δk(x)−δl(x) with

δk(x) =−1

2ln|Σk| −1

2(x −µk)>Σ−1k (x−µk) +lnπk

(41)

Logistic regression

Model:

P(Y =1|X =x) = eβ

>x

1+eβ>x

P(Y =−1|X =x) = 1

1+eβ>x

Equivalently

P(Y =y|X =x) = 1 1+e−yβ>x Equivalently,

ln P(Y =1|X =x)

P(Y =−1|X =x) =β>x

(42)

Logistic regression: parameter estimation

Likelihood:

`(β) =−

N

X

i=1

ln

1+e−yiβ>xi

∂`

∂β(β) =

N

X

i=1

yixi 1+eyiβ>xi =

N

X

i=1

yip(−yi|xi)xi

2`

∂β∂β>(β) =−

N

X

i=1

xixi>eβ>xi 1+eβ>xi2 =

N

X

i=1

p(1|xi) (1−p(1|xi))xixi>

Optimization by Newton-Raphson is iteratively reweighted least squares (IRLS)

Problem if data linearly separable =⇒ regularization

(43)

Regularized logistic regression

Problem if data linearly separable : infinite likelihood possible Classical`2regularization

min

β N

X

i=1

ln

1+e−yiβ>xi

p

X

i=1

βi2

`1regularization (feature selection)

min

β N

X

i=1

ln

1+e−yiβ>xi

p

X

i=1

i|

(44)

LDA vs Logistic regression

Both methods are linear

Estimation is different: modelP(X,Y)(likelihood) orP(Y|X) (conditional likelihood)

LDA works better if data are Gaussian, but more sensitive to outliers

(45)

Hard-margin SVM

If data are linearly separable, separate them with largest margin Equivalently, minβkβk2such thatyiβ>xi ≥1

Dual problem:

maxα≥0 N

X

i=1

αi −1 1

N

X

i=1 N

X

j=1

αiαjyiyjxi>xj

and

βˆi =

N

Xyiαixi

(46)

Soft-margin SVM

If data are not linearly separable, add slack variable:

minβkβk2/2+CPN

i=1ζi such thatyiβ>xi ≥1−ζi

Dual problem: same as hard-margin with the additional constraint 0≤α≤C

Equivalently,

minβ N

X

i=1

max(0,1−yiβ>xi) +λkβk2

(47)

Large-margin classifiers

The margin isyf(x)

LDA, logistic and SVM all try to ensure large margin:

min

β N

X

i=1

φ(yif(xi)) +λΩ(β)

where

φ(u) =





(1−u)2 for LDA

ln(1+eu) for logistic regression max(0,1−u) for SVM

(48)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(49)

Feature expansion

We have seen many linear methods for regression and classification, of the form

β∈minRp N

X

i=1

L

yi, β>xi

+λkβk22

To be nonlinear inx, we can apply them after some transformation x 7→Φ(x)∈Rq, whereq may be larger thanp

Example: nonlinear functions ofx, polynomials, ...

Notation: we define the kernel corresponding toΦby

(50)

Representer theorem

For any solution of

βˆ∈arg min

β∈Rp N

X

i=1

L

yi, β>xi

+λkβk22

there existsαˆ∈Rnsuch that βˆ=

N

X

i=1

ˆ αiΦ(xi).

Consequences:

ˆf(x) =

N

X

i=1

ˆ

αiK(xi,x)

(51)

Solving in α

f(xi)>

i=1,...,N =Kαandkβk22>Kα, so we can plugαin the optimization problem instead ofβ, and onlyK is needed

Example: kernel ridge regression:

ˆ

α= (K +λI)−1Y Example: kernel SVM

0≤α≤Cmax

N

X

i=1

αiyi− 1 N

N

X

i=1 N

X

j=1

αiαjK(xi,xj)

(52)

Positive definite kernel

Theorem (Aronszajn)

There existsΦ :Rp→Rqfor someq(possibly infinite if and only ifK is positive definite, i.e.,K(x,x0) =K(x0,x)for anyx,x0, and

n

X

i=1 n

X

j=1

aiaiK(xi,xj)≥0

for anyn,aandx. Examples:

Linear:K(x,x0) =x>x0

Polynomial:K(x,x0) = (x>x0)d

Gaussian:K(x,x0) =exp−kx −x0k2/2σ2

Références

Documents relatifs

Classify the health hazards for these MNMs, by means of the GHS health classification hazard criteria, into one or more of the following: acute toxicity, skin corrosion/irritation,

Does it make a difference whether the parcel fronts upon the body of water or if the body of water is contained within or passes through the

Le triangle ABC n’existe pas..

On

Existe-t-il un triangle ABC dont le périmètre est égal à 134 centimètres, le rayon du cercle inscrit est égal à 12 centimètres et le rayon du cercle circonscrit est égal à

Existe-t-il un triangle ABC dont le périmètre est égal à 134 centimètres, le rayon du cercle inscrit est égal à 12 centimètres et le rayon du cercle circonscrit est égal à

[r]

On va utiliser différentes expressions de l’aire S du triangle pour extraire 3 grandeurs a, b, c susceptibles d’être les côtés