Cours "Statistical learning"

(1)

Introduction to Statistical Learning

Jean-Philippe Vert

Jean-Philippe.Vert@ensmp.fr

Mines ParisTech and Institut Curie

Master Course, 2011.

(2)

Outline

1 Introduction

2 Linear methods for regression

3 Linear methods for classification

4 Nonlinear methods with positive definite kernels

(3)

Outline

1 Introduction

(4)

Outline

1 Introduction

(5)

Outline

1 Introduction

(6)

Outline

1 Introduction

(7)

Motivations

Predict the risk of second heart from demographic, diet and clinical measurements

Predict the future price of a stock from company performance measures

Recognize a ZIP code from an image Identify the risk factors for prostate cancer

and many more applications in many areas of science, finance and industry wherea lot of dataare collected.

(8)

Learning from data

Supervisedlearning

An outcome measurement (targetorresponse variable)

which can be quantitative (regression) or categorial (classification) which we want to predicted based on a set offeaturesor

descriptorsorpredictors)

We have atraining setwith features and outcome

We build a prediction model, orlearnerto predict outcome from features for new unseen objects

Unsupervisedlearning No outcome

Describe how data are organized or clustered Examples - Fig 1.1-1.3

(9)

Learning from data

Supervisedlearning

(10)

Learning from data

Supervisedlearning

(11)

Machine learning / data mining vs statistics

They share many concepts and tools, but in ML:

Predictionis more important thanmodelling(understanding, causality)

There is no settled philosophy or theoretical framework

We are ready to usead hocmethods if they seem to work on real data

We often havemany features, and sometimeslarge training sets.

We focus onefficient algorithms, withlittle or no human intervention.

We often use complex nonlinear models dfs

(12)

Organization

Focus on supervised learning (regression and classification) Reference: "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman (HTF)

Available online athttp:

//www-stat.stanford.edu/~tibs/ElemStatLearn/

Practical sessions usingR

(13)

Notations

Y ∈ Y the response (usuallyY ={−1,1}orR) X ∈ X the input (usuallyX =R^p)

x₁, . . . ,x_N observed inputs, stored in theN×pmatrixX y₁, . . . ,y_N observed inputs, stored in the vectorY∈ Y^N

(14)

Simple method 1: Linear least squares

Parametric model forβ ∈R^p+1:

f_β(X) =β₀+

p

X

i=1

β_iX_i =X^>β

Estimateβˆfrom training data to minimize

RSS(β) =

N

X

i=1

(y_i−f_β(x_i))²

See Fig 2.1

Good if model is correct...

(15)

Simple method 2: Nearest neighbor methods (k-NN)

Prediction based on thek nearest neighbors:

Yˆ(x) = 1 k

X

x_i∈N_k(x)

y_i

Depends onk

Less assumptions that linear regression, but more risk of overfitting

Fig 2.2-2.4

(16)

Statistical decision theory

Joint distributionPr(X,Y)

Loss functionL(Y,f(X)), e.g. squared error loss L(Y,f(X)) = (Y −f(X))² Expected prediction error(EPE):

EPE(f) =E_(X,Y)∼Pr_(X,Y)L(Y,f(X)) Minimizer isf(X) =E(Y|X)(regression function) Bayes classifierfor 0/1 loss in classification (Fig 2.5)

(17)

Least squares and k -NN

Least squares assumesf(x)is linear, and pools over values ofX to estimate the best parameters. Stable but biased

k-NN assumesf(x)is well approximated by a locally constant function, and pools over local sample data to approximate conditional expectation.Less stable but less biased.

(18)

Local methods in high dimension

IfNis large enough,k-NN seems always optimal (universally consistent)

But whenpis large,curse of dimension:

No method can be "local’ (Fig 2.6)

Training samples sparsely populate the input space, which can lead to large bias or variance (eq. 2.25 and Fig 2.7-2.8)

If structure is known (eg, linear regression function), we can reduce both variance and bias (Fig. 2.9)

(19)

Bias-variance trade-off

AssumeY =f(X) +, on a fixed design.Y(x)is random because of, ˆf(X)is random because of variations in the training setT. Then

E,T

Y−ˆf(X)2

=EY²+Eˆf(X)²−2EYˆf(X)

=Var(Y) +Var(ˆf(X)) +

EY −Eˆf(X)2

=noise+bias(ˆf)²+variance(ˆf)

(20)

Structured regression and model selection

Define a family of function classesF_λ, whereλcontrols the

"complexity", eg:

Ball of radiusλin a metric function space Bandwidth of the kernel is a kernel estimator Number of basis functions

For eachλ, define

ˆf_λ=argmin

F_λ

EPE(f)

Selectˆf = ˆf_λ_ˆ tominimize the bias-variance tradeoff(Fig. 2.11).

(21)

Cross-validation

A simple and systematic procedure to estimate the risk (and to optimize the model’s parameters)

1 Randomly divide the training set (of sizeN) intoK (almost) equal portions, each of sizeK/N

2 For each portion, fit the model with different parameters on the K −1 other groups and test its performance on the left-out group

3 Average performance over theK groups, and take the parameter with the smallest average performance.

TakingK =5 or 10 is recommended as a good default choice.

(22)

Summary

To learn complex functions in high dimension from limited training sets, we need to optimize a bias-variance trade-off. We will do that typically by:

1 Define a family of learners of various complexities (eg, dimension of a linear predictor)

2 Define an estimation procedure for each learner (eg, least-squares or empirical risk minimization)

3 Define a procedure to tune the complexity of the learner (eg, cross-validation)

(23)

Outline

1 Introduction

(24)

Linear least squares

Parametric model forβ ∈R^p+1:

f_β(X) =β₀+

p

X

i=1

β_iX_i =X^>β

Estimateβˆfrom training data to minimize

RSS(β) =

N

X

i=1

(y_i−f_β(x_i))²

Solution ifX^>Xis non-singular:

βˆ=

X^>X−1

X^>Y

(25)

Fitted values

Fitted values on the training set:

Yˆ =Xβˆ=X

X^>X−1

X^>Y=HY with H=X

X^>X−1

X^>

Geometrically:HprojectsYon the span ofX(Fig. 3.2) IfXis singular,βˆis not uniquely defined, butYˆ is

(26)

Inference on coefficients

AssumeY=Xβ+, with∼ N(0, σ²I) Thenβˆ∼ N β, σ² X^>X

−1

Estimating variance: σˆ=kY−Yˆk²/(N−p−1) Statistics on coefficients:

βˆ_j−β_j ˆ σ√

v_j ∼tN−p−1

allows to test the hypothesisH₀:β_j =0, and gives confidence intervals

βˆ_j±t_{α/2,N−p−1}σˆp v_j

(27)

Inference on the model

Compare a large model withp₁features to a smaller model with p₀ features:

F = (RSS₀−RSS₁)/(p₁−p₀) RSS₁/(N−p₁−1)

follows the Fisher lawF_p₁−p₀,N−p₁−1under the hypothesis that the small model is correct.

(28)

Gauss-Markov theorem

AssumeY=Xβ+, whereE=0 andE^>=σ²I.

Then the least squares estimatorβˆis BLUE (best linear unbiased estimator), i.e., for any other estimatorβ˜=CYwithEβ˜=β,

Var( ˆβ)≤Varβ˜

Nevertheless, we may have smaller total risk by increasing bias to decrease variance, in particular in the high-dimensional setting.

(29)

Gauss-Markov theorem

AssumeY=Xβ+, whereE=0 andE^>=σ²I.

Then the least squares estimatorβˆis BLUE (best linear unbiased estimator), i.e., for any other estimatorβ˜=CYwithEβ˜=β,

Var( ˆβ)≤Varβ˜

Nevertheless, we may have smaller total risk by increasing bias to decrease variance, in particular in the high-dimensional setting.

(30)

Decreasing the complexity of linear models

1 Feature subset selection

2 Penalized criterion

3 Feature construction

(31)

Feature subset selection

Best subset selection

Usually NP-hard, "leaps and bound" procedure works for up to p=40

Bestk selected by cross-validation of various criteria(Fig 3.5) Greedy selection: forward, backward, hybrid

(32)

Ridge regression

Minimize

RSS(β) +λ

p

X

i=1

β_i² Solution:

βˆ^λ=

X^>X+λI−1

X^>Y

IfX^>X=I(orthogonal design), thenβˆ^λ = ˆβ/(1+λ), otherwise nonlinear solution pathFig 3.8

Equivalent to shrinking on the small principal components(Fig 3.9)

(33)

Lasso

Minimize

RSS(β) +λ

p

X

i=1

|βi|

No explicit solution, but convex quadratic program and efficient algorithm for the solution path (LARS,Fig. 3.10)

Performs feature selection because the`₁ball has singularities (Fig 3.11

(34)

Discussion

In orthogonal design, best subset selection, ridge regression and Lasso correspond to 3 different ways to shrink theβˆcoefficients (Fig 3.10)

They minimizeRSS(β)over respectively the`₀,`₂and`₁balls Generalization: penalize bykβk_q, but:

convex problem only forq≥1 feature selection only forq≤1

Generalization: group lasso, fused lasso, elastic net...

(35)

Using derived input space

PCR

OLS on the topM principal components

Similar to ridge regression, but truncates instead of shrinking PLS

Similar to PCR but usesYto construct the directions: maximize maxα Corr²(Y,Xα)Var(Xα)

(36)

Outline

1 Introduction

(37)

Supervised classification

Y ={−1,1}(can be generalized toK classes) Goal: estimateP(Y =k|X =x), or (easier) Yˆ(x) =arg max_kP(Y =k|X =x)

Approach: estimate a functionf :X →Rand predict according to Yˆ(x) =

(1 iff(x)≥0,

−1 iff(x)<0. 3 strategies

1 ModelP(X,Y)(LDA)

2 ModelP(Y|X)(logistic regression)

(38)

Linear discriminant analysis (LDA)

ModelP(Y =k) =π_k andP(X|Y =k)∼ N(µ_k,Σ) Estimation:

ˆ π_k = N_k

N ˆ µ_k = 1

N_k X

i:yi=k

x_i

Σ =ˆ 1 N−1

X

k∈{−1,1}

X

i:y_i=k

(x_i−µ_k) (x_i−µ_k)^>

Prediction:

ln P(Y =1|X =x) P(Y =−1|X =x) =

x^>ˆ⁻¹ − − 1 _>ˆ⁻¹ 1 _>ˆ⁻¹ lnN₁

(39)

Remarks on LDA

If aΣˆ is estimated on each class, we obtain a quadratic function : quadratic discriminant analysis (QDA)

LDA performs linear discriminationf(X) =β^>X+b.βcan also be found by OLS, takingY_i =N_i/N

Good baseline method, even if the data are not Gaussian

(40)

Quadratic discriminant analysis (QDA)

ModelP(Y =k) =π_k andP(X|Y =k)∼ N(µ_k,Σ) Estimation: same as LDA except

Σˆ_k = 1 N_k

X

i:y_i=k

(x_i−µ_k) (x_i−µ_k)^>

Prediction:

lnP(Y =k|X =x)

P(Y =l|X =x) =δ_k(x)−δ_l(x) with

δ_k(x) =−1

2ln|Σ_k| −1

2(x −µ_k)^>Σ⁻¹_k (x−µ_k) +lnπ_k

(41)

Logistic regression

Model:







P(Y =1|X =x) = ^e^β

>x

1+e^β^>^x

P(Y =−1|X =x) = ¹

1+e^β^>^x

Equivalently

P(Y =y|X =x) = 1 1+e^−yβ^>^x Equivalently,

ln P(Y =1|X =x)

P(Y =−1|X =x) =β^>x

(42)

Logistic regression: parameter estimation

Likelihood:

`(β) =−

N

X

i=1

ln

1+e^−yⁱ^β^>^xⁱ

∂`

∂β(β) =

N

X

i=1

y_ix_i 1+e^yⁱ^β^>^xⁱ =

N

X

i=1

y_ip(−y_i|x_i)x_i

∂²`

∂β∂β^>(β) =−

N

X

i=1

x_ix_i^>e^β^>^xⁱ 1+e^β^>^xⁱ2 =

N

X

i=1

p(1|x_i) (1−p(1|x_i))x_ix_i^>

Optimization by Newton-Raphson is iteratively reweighted least squares (IRLS)

Problem if data linearly separable =⇒ regularization

(43)

Regularized logistic regression

Problem if data linearly separable : infinite likelihood possible Classical`₂regularization

min

β N

X

i=1

ln

1+e^−yⁱ^β^>^xⁱ +λ

p

X

i=1

β_i²

`₁regularization (feature selection)

min

β N

X

i=1

ln

1+e^−yⁱ^β^>^xⁱ +λ

p

X

i=1

|βi|

(44)

LDA vs Logistic regression

Both methods are linear

Estimation is different: modelP(X,Y)(likelihood) orP(Y|X) (conditional likelihood)

LDA works better if data are Gaussian, but more sensitive to outliers

(45)

Hard-margin SVM

If data are linearly separable, separate them with largest margin Equivalently, min_βkβk²such thaty_iβ^>x_i ≥1

Dual problem:

maxα≥0 N

X

i=1

α_i −1 1

N

X

i=1 N

X

j=1

α_iα_jy_iy_jx_i^>x_j

and

βˆ_i =

N

Xy_iα_ix_i

(46)

Soft-margin SVM

If data are not linearly separable, add slack variable:

min_βkβk²/2+CPN

i=1ζ_i such thaty_iβ^>x_i ≥1−ζ_i

Dual problem: same as hard-margin with the additional constraint 0≤α≤C

Equivalently,

minβ N

X

i=1

max(0,1−y_iβ^>x_i) +λkβk²

(47)

Large-margin classifiers

The margin isyf(x)

LDA, logistic and SVM all try to ensure large margin:

min

β N

X

i=1

φ(y_if(x_i)) +λΩ(β)

where

φ(u) =







(1−u)² for LDA

ln(1+e⁻u) for logistic regression max(0,1−u) for SVM

(48)

Outline

1 Introduction

(49)

Feature expansion

We have seen many linear methods for regression and classification, of the form

β∈minR^p N

X

i=1

L

y_i, β^>x_i

+λkβk²₂

To be nonlinear inx, we can apply them after some transformation x 7→Φ(x)∈R^q, whereq may be larger thanp

Example: nonlinear functions ofx, polynomials, ...

Notation: we define the kernel corresponding toΦby

(50)

Representer theorem

For any solution of

βˆ∈arg min

β∈R^p N

X

i=1

L

y_i, β^>x_i

+λkβk²₂

there existsαˆ∈Rⁿsuch that βˆ=

N

X

i=1

ˆ αiΦ(x_i).

Consequences:

ˆf(x) =

N

X

i=1

ˆ

αiK(x_i,x)

(51)

Solving in α

f₍x_i)>

i=1,...,N =Kαandkβk²₂=α^>Kα, so we can plugαin the optimization problem instead ofβ, and onlyK is needed

Example: kernel ridge regression:

ˆ

α= (K +λI)⁻¹Y Example: kernel SVM

0≤α≤Cmax

N

X

i=1

α_iy_i− 1 N

N

X

i=1 N

X

j=1

α_iα_jK(x_i,x_j)

(52)

Positive definite kernel

Theorem (Aronszajn)

There existsΦ :R^p→R^qfor someq(possibly infinite if and only ifK is positive definite, i.e.,K(x,x⁰) =K(x⁰,x)for anyx,x⁰, and

n

X

i=1 n

X

j=1

a_ia_iK(x_i,x_j)≥0

for anyn,aandx. Examples:

Linear:K(x,x⁰) =x^>x⁰

Polynomial:K(x,x⁰) = (x^>x⁰)^d

Gaussian:K(x,x⁰) =exp−kx −x⁰k²/2σ²