Introduction au cours

(1)

Machine Learning Statistical Learning:

Introduction and Cross Validation

Ana Karina Fermin

winter 2020-2021

(2)

Outline

1 Introduction

2 Supervised Learning

3 Error Estimation

4 Cross Validation and Weights

5 References

(3)

Introduction

Machine Learning

Source:azavea

A definition by Tom Mitchell

(http://www.cs.cmu.edu/~tom/)

A computer program is said to learn fromexperience E with respect to someclass of tasks Tandperformance measure P, if its performance at tasks in T , as measured by P, improves with experience E.

(4)

Introduction

Credit Score, Bank Risk, ...

Source:A.Fermin

Credit Score

Task: Prediction (default or no default) Data: Client profile, Client credit history...

Performance measure: error rate.

(5)

Introduction

News Clustering

Source:theverge.com

A news clustering algorithm:

Task: group articles corresponding to the same news Performance: quality of the clusters

Experience: set of articles

(6)

Introduction

Supervised and Unsupervised

Source:KDnuggets

Supervised Learning (Imitation)

Goal: Learn a function f predicting a variableY from an individual X.

Data: Learning set with labeled examples (X_i,Yi) Assumption: Future data behaves as past data!

Predicting is not explaining!

Unsupervised Learning (Structure Discovery)

Goal: Discover a structure within a set of individuals (X_i).

Data: Learning set with unlabeled examples (X_i) Unsupervised learning is not a well-posed setting....

(7)

Introduction

A Robot that Learns

Source:RutgersPrepSchool

A robot endowed with a set of sensors playing football:

Task: play football

Performance: score evolution Experience:

current environment and outcome, past games

(8)

Introduction

Three Kinds of Learning

Source:BCG

Unsupervised Learning Task:

Clustering/DR Performance:

Quality Experience:

Raw dataset (No Ground Truth)

Supervised Learning Task:

Prediction Performance:

Average error Experience:

Predictions (Ground Truth)

Reinforcement Learning Task:

Action Performance:

Total reward Experience:

Reward from env.

(Interact. with env.)

(9)

Introduction

Machine Learning

(Unsupervised/Supervised)

Source:scikit-learn.org

ML Methods

Huge catalog of methods,

Need to define the performance, feature design.

Here, we will only see supervised learning

(10)

Introduction

Credit Score, Bank Risk, ...

Source:A.Fermin

Credit Score

Task: Prediction (default or no default) Data: Client profile, Client credit history...

(11)

Introduction

Spam detection (Text classification)

Source:http://www.jesperdeleuran.dk/

Spam detection

Task: Prediction (spam or no spam) Data: email collection

(12)

Introduction

Detection

Source:A.Fermin

Face detection

Task: Detect the position of faces in an image Different setting?

Reformulation as a supervised learning problem.

Goal: Detect the presence of faces at several positions and scales.

Data: X = sub image /Y = presence or no of a face...

Lots of detections in an image: post processing required...

Performance measure: box precision.

(13)

Introduction

Number

Source:Y.LeCun

Reading a ZIP code on an envelop Task: give a number from an image.

Data: X = image /Y = corresponding number.

(14)

Introduction

Eucalyptus

10 15 20 25 30

30 40 50 60 70

circ

ht

Height estimation

Simple (and classical) dataset.

Task: predict the height from circumference.

Data: X = circumference / Y = height.

Performance measure: means squared error.

(15)

Introduction

Under and Over Fitting

Source:geeksforgeeks.com

Finding the Right Complexity What is best?

A simple model that is stable but false? (oversimplification) A very complex model that could be correct but is unstable?

(conspiracy theory)

Neither of them: tradeoff that depends on the dataset.

(16)

Introduction

ML Pipeline

Source:CDiscount

Learning pipeline

Test and compare models.

Deployment pipeline is different!

(17)

Introduction

IFMA/IMP Bloc 3 ML - Goal

Goal

Know the inner mechanism of the most classical supervised ML methods (Logistic, SVM, Neural Nets and Trees) in order to understand their strengths and limitations.

Understand some optimizationtools used in ML Evaluation

A project (homework) with R

(18)

Introduction

Schedule

IFMA/IMP Bloc 3 ML - 5 Lectures (09h00-12h15) Fri. 15/01: Statistical Learning: Introduction and Cross Validation

Tue. 19/01: ML Methods: Probabilistic Point of View Tue. 26/01: ML Methods: Optimization Point of View Tue. 02/02: ML Methods: SVM

Tue. 09/02: ML Methods: Trees and Ensemble Methods, Neural Networks, etc

12/03: Homework (project with 2-3 students)

(19)

12/03: Homework (project with 2-3 students) For this homework :

You will have to build a good predictor from a dataset that I will give you.

The goal is not necessarily to obtain the best performance but to perform the work of a good data scientist.

You need to describe the task and the dataset using descriptive statistics and graphics.

You should explain how you have obtained your best predictor both in term of strategy and error estimation.

You are expected to describe the strength and the limitation of your approach and to propose some possible enhancements.

You may use R or Python. If you a use a notebook, please provide a compiled version.

The report should consists of around 20-30 pages and is much more than a code listing.

Originality of the work will be taken into account and any plagiarism will be sanctioned.

(20)

Introduction

References

T. Hastie, R. Tibshirani, and J. Friedman.

The Elements of Statistical Learning.

Springer Series in Statistics, 2009

M. Mohri, A. Rostamizadeh, and A. Talwalkar.

Foundations of Machine Learning.

MIT Press, 2012 A. Géron.

Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd ed.)

O’Reilly, 2019

(21)

Outline

1 Introduction

3 Error Estimation

5 References

(22)

Supervised Learning

Supervised Learning Framework Input measurementX ∈ X Output measurement Y ∈ Y. (X,Y)∼Pwith Punknown.

Training data: D_n={(X₁,Y₁), . . . ,(X_n,Y_n)} (i.i.d. ∼P) Often

X ∈R^d andY ∈ {−1,1} (classification) orX ∈R^d andY ∈R(regression).

A predictoris a function in F ={f :X → Y meas.}

Goal

Construct a goodpredictor^bf from the training data.

Need to specify the meaning of good.

Classification and regression are almost the sameproblem!

(23)

Supervised Learning

Loss and Probabilistic Framework

Loss function for a generic predictor

Loss function: `(Y,f(X)) measures the goodness of the prediction ofY byf(X)

Examples:

Prediction loss: `(Y,f(X)) =1Y6=f(X)

Quadratic loss: `(Y,f(X)) =|Y −f(X)|²

Risk function

Risk measured as the average loss for a new couple:

R(f) = E(X,Y)∼P[`(Y,f(X))]

Examples:

Prediction loss: E[`(Y,f(X))] =P(Y 6=f(X)) Quadratic loss: E[`(Y,f(X))] =E

|Y −f(X)|²

Beware: As ^bf depends onD_n,R(bf) is a random variable!

(24)

Supervised Learning

Best Solution

The best solution f^∗ (which is independent of D_n) is f^∗ = arg min

f∈FR(f) = arg min

f∈FE[`(Y,f(X))] = arg min

f∈FEX

h

EY|X [`(Y,f(X))]ⁱ Bayes Predictor (explicit solution)

In binary classification with 0−1 loss:

f^∗(X) =







+1 if P(Y = +1|X)≥P(Y =−1|X)

⇔P(Y = +1|X)≥1/2

−1 otherwise In regression with the quadratic loss

f^∗(X) =E[Y|X]

Issue: Solution requires toknowE[Y|X]for all values ofX!

(25)

Supervised Learning

Goal

Machine Learning

Learn a rule to construct a predictor^bf ∈ F from the training data D_n s.t. the riskR(^bf) is small on averageor with high probability with respect toD_n.

In practice, the rule should be an algorithm!

Canonical example: Empirical Risk Minimizer

One restricts f to a subset of functions S ={f_θ, θ∈Θ}

One replaces the minimization of the average loss by the minimization of the empirical loss

bf =f

bθ= argmin

fθ,θ∈Θ

1 n

n

X

i=1

`(Y_i,f_θ(X_i)) Examples:

Linear regression

Linear discrimination with

S={x 7→sign{x^>β+β⁽⁰⁾}/β∈R^d, β⁽⁰⁾∈R}

(26)

Supervised Learning

Eucalyptus

10 15 20 25 30

30 40 50 60 70

circ

ht

Height estimation

Simple (and classical) dataset.

Task: predict the height from circumference.

Data: X = circumference / Y = height.

Performance measure: means squared error.

(27)

Supervised Learning

Eucalyptus

10 15 20 25 30

30 40 50 60 70

circ

ht

Dataset - P.A. Cornillon

Real dataset of 1429 eucalyptus obtained by P.A. Cornillon:

X: circumference / Y: height

Can we predict the height from the circumference?

by aline? by a more complex formula?

(28)

Supervised Learning

Eucalyptus

10 15 20 25 30

30 40 50 60 70

circ

ht

by aline?

by a more complex formula?

(29)

Supervised Learning

Eucalyptus

10 15 20 25 30

30 40 50 60 70

circ

ht

by aline? by a more complex formula?

(30)

Supervised Learning

Under-fitting / Over-fitting Issue

Source:A.Ng

Model Complexity Dilemna

What is best a simple or a complex model?

Too simple to be good? Too complex to be learned?

(31)

Supervised Learning

Under-fitting / Over-fitting Issue

Source:Unknown

Under-fitting / Over-fitting

Under-fitting: simple model are too simple.

Over-fitting: complex model are too specific to the training set.

(32)

Supervised Learning

Bias-Variance Dilemma

General setting:

F={measurable functionsX → Y}

Best solution: f^∗= argmin_f_∈FR(f) ClassS ⊂ F of functions

Ideal target inS: f_S^∗= argmin_f_∈SR(f)

Estimate inS: bf_S obtained with some procedure

Approximation error and estimation error (Bias/Variance) R(bfS)− R(f^∗) =R(f_S^∗)− R(f^∗)

| {z }

Approximation error

+R(fbS)− R(f_S^∗)

| {z }

Estimation error

Approx. error can be large if the modelS is not suitable.

Estimation error can be large if the model is complex.

Agnostic approach

No assumption (so far) on the law of (X,Y).

(33)

Supervised Learning

Under-fitting / Over-fitting Issue

Source:Unknown

Different behavior for different model complexity Low complexity model are easily learned but the approximation error (bias) may be large (Under-fit).

High complexity model may contain a good ideal target but the estimation error (variance) can be large (Over-fit) Bias-variance trade-off ⇐⇒ avoidoverfittingand underfitting

Rk: Better to think in term of method (including feature engineering and specific algorithm) rather than only of model.

(34)

Supervised Learning

Theoretical Analysis

Statistical Learning Analysis Error decomposition:

R(^bfS)− R(f^?) =R(f_S^?)− R(f^?)

| {z }

Approximation error

+R(^bfS)− R(f_S^?)

| {z }

Estimation error

Bound on the approximation term: approximation theory.

Probabilistic bound on the estimation term: probability theory!

Goal: Agnostic bounds, i.e. bounds that do not require assumptions on P! (Statistical Learning?)

Often need mild assumptions on P... (Nonparametric Statistics?)

(35)

Supervised Learning

Binary Classification Loss Issue

Empirical Risk Minimizer

bf = argmin

f∈S

1 n

n

X

i=1

`^0/1(Y_i,f(X_i))

Classification loss: `^0/1(y,f(x)) =1_y_6=f_(x) Not convex and not smooth!

(36)

Supervised Learning

Probabilistic Point of View Ideal Solution and Estimation

Source:A.Fermin

The best solution f^∗ (which is independent of D_n) is f^∗ = arg min

f∈FR(f) = arg min

f∈FE[`(Y,f(X))] = arg min

f∈FEX

h

EY|X [`(Y,f(x))]ⁱ Bayes Predictor (explicit solution)

In binary classification with 0−1 loss:

f^∗(X) =

(+1 if P(Y = +1|X)≥P(Y =−1|X)

−1 otherwise

Issue: Solution requires to knowE[Y|X] for all values ofX! Solution: Replace it by an estimate.

(37)

Supervised Learning

Optimization Point of View Loss Convexification

Minimizer of the risk

bf = argmin

f∈S

1 n

n

X

i=1

`^0/1(Y_i,f(X_i))

Issue: Classification loss is not convex or smooth.

Solution: Replace it by a convex majorant.

(38)

Supervised Learning

Probabilistic and Optimization Framework

How to find a good functionf with asmall risk R(f) =E[`(Y,f(X))] ?

Canonical approach: ^bfS = argmin_f_∈S ¹_n^Pⁿ_i=1`(Y_i,f(X_i)) Problems

How to chooseS?

How to compute the minimization?

A Probabilistic Point of View

Solution: For X, estimateY|X plug this estimate in the Bayes classifier: (Generalized) Linear Models, Kernel methods, k-nn, Naive Bayes, Tree, Bagging...

An Optimization Point of View

Solution: If necessary replace the loss`by an upper bound `⁰ and minimize the empirical loss: SVR, SVM, Neural Network,Tree, Boosting...

(39)

Outline

1 Introduction

3 Error Estimation

5 References

(40)

Error Estimation

Example: TwoClass Dataset

Synthetic Dataset

Two features/covariates.

Source:JMP

Error behaviour

Learning/training error (error made on the learning/training set) decays when the complexity of the methodincreases.

Quite different behavior when the error is computed on new observations (generalization error).

Overfit for complex methods: parameters learned are too specific to the learning set!

General situation! (Think of polynomial fit...)

Need to use a different criterion than the training error!

(73)

Error Estimation

Error Estimation vs Method Selection

Predictor Error Estimation

Goal: Given a predictorf assess its quality.

Method: Hold-out error computation (/ Error correction).

Usage: Compute an estimate of the error of a selected f using atest set to be used to monitor it in the future.

Basic block very well understood.

Method Selection

Goal: Given a ML method assess its quality.

Method: Cross Validation (/ Error correction)

Usage: Compute error estimates for several ML methods using training/validation setsto choose the most promising one.

Estimates can be pointwise or better intervals.

Multiple test issues in method selection.

(74)

Error Estimation

Cross Validation and Error Correction

Two Approaches

Cross validation: Very efficient (and almost always used in practice!) but slightly biased as it target uses only a fraction of the data.

Correction approach: use empirical loss criterion butcorrect it with a term increasing with the complexity ofS

Rn(f^bS)→Rn(f^bS) + cor(S)

and choose the method with the smallest corrected risk.

Which loss to use?

The loss used in the risk: most natural!

The loss used to estimate θ: penalized estimation!^b

(75)

Error Estimation

Cross Validation

Source:M.Kühn

Very simple idea: use a second learning/verification set to compute a verification error.

Sufficient to remove the dependency issue!

Implicit random design setting...

Cross Validation

Use (1−)×n observations to train and ×n to verify!

Possible issues:

Validation for a learning set of size (1−)×ninstead ofn? Unstable error estimate ifnis too small ?

Most classical variations:

Hold Out, Leave One Out, V-fold cross validation.

(76)

Error Estimation

Hold Out

Principle

Split the dataset D in 2 setsD_train andD_testof size n×(1−) andn×.

Learn^bf^HO from the subsetD_train.

Compute the empirical error on the subset D_test: R^HO_n (^bf^HO) = 1

n

X

(X_i,Yi)∈Dtest

`(Yi,^bf^HO(X_i))

Predictor Error Estimation Use ˆf^HO as predictor.

UseR^HO_n (f^b^HO) as an estimate of the error of this estimator.

Method Selection by Cross Validation

ComputeR^HO_n (^bf_S^HO) for all the considered methods, Select the method with the smallest CV error, Reestimate the ^bfS with all the data.

(77)

Error Estimation

Hold Out

Principle

Split the dataset D in 2 setsD_train andD_testof size n×(1−) andn×.

Learn^bf^HO from the subsetD_train.

Compute the empirical error on the subset D_test: R^HO_n (^bf^HO) = 1

n

X

(X_i,Yi)∈Dtest

`(Yi,^bf^HO(X_i))

Only possible setting for error estimation.

Hold Out Limitation for Method Selection

Biased toward simpler method as the estimation does not use all the data initially.

Learning variability ofR^HO_n (^bf^HO) not taken into account.

(78)

Error Estimation

V -fold Cross Validation

Source:M.Kühn

Principle

Split the dataset D inV sets D_v of almost equals size.

For v ∈ {1, ..,V}:

Learnbf^−v from the datasetDminus the setDv. Compute the empirical error:

R^−v_n (bf^−v) = 1 nv

X

(X_i,Y_i)∈D_v

`(Y_i,bf^−v(X_i))

Compute the average empirical error:

R^CV_n (^bf) = 1 V

V

X

v=1

R^−v_n (^bf^−v)

Estimation of the quality of a method not of a given predictor.

Leave One Out : V = n.

(79)

Error Estimation

V -fold Cross Validation

Analysis (when n is a multiple of V)

The R^−v_n (^bf^−v) are identically distributed variable but are not independent!

Consequence:

E

hR^CV_n (f^b)ⁱ=E

hR^−v_n (^bf^−v)ⁱ Var^hR^CV_n (f^b)ⁱ= 1

V Var^hR^−v_n (^bf^−v)ⁱ + (1− 1

V)Cov^hR^−v_n (^bf^−v),R^−v_n ⁰(^bf^−v⁰)ⁱ Average risk for a sample of size (1−_V¹)n.

Variance term much more complex to analyze!

Fine analysis shows that the larger V the better...

Accuracy/Speed tradeoff: V = 5 or V = 10!

(80)

Error Estimation

Cross Validation

(81)

Error Estimation

Example: KNN ( k

^b

= 61 using

(85)

Outline

1 Introduction

3 Error Estimation

5 References

(86)

Cross Validation and Weights

Unbalanced and Rebalanced Dataset

Source:UniversityofGranada

Unbalanced Class

Setting: One of the class is much more present than the other.

Issue: Classifier too attracted by the majority class!

Rebalanced Dataset

Setting: Class proportions are different in the training and testing set (stratified sampling)

Issue: Training errors are not estimate of testing errors.

(87)

Resampling Strategies

Source:Oracle

Resampling

Modify the training dataset so that the classes are more balanced.

Issues: Training data is not anymore representative of testing data

Hard to do it right!

(88)

Resampling Effect

Testing

Testing class prob.: π_t(k) Testing error target:

Eπt[`(Y,f(X))] = X

k

π_t(k)E[`(Y,f(X))|Y =k]

Training

Training class prob.: π_tr(k) Training Error target:

Eπtr[`(Y,f(X))] = X

k

π_tr(k)E[`(Y,f(X))|Y =k]

Implicit Testing Error Using the Training One Amounts to use a weighted loss:

Eπtr[`(Y,f(X))] =^X

k

π_tr(k)E[`(Y,f(X))|Y =k]

=^X

k

π_t(k)E

π_tr(k)

πt(k)`(Y,f(X))

Y =k

=Eπt

πtr(Y)

π_t(Y)`(Y,f(X))

Put more weight on less probable classes!

Weighted Loss and Resampling

Weighted loss: choice of a weight Ct 6= 1.

Resampling: use a π_tr 6=π_t.

Stratified sampling may be used to reduced the size of a dataset without loosing a low probability class!

Combining Weights and Resampling

Weighted loss: useC_tr =C_t asπ_tr =π_t.

Resampling: use an implicitCt(k) =πtr(k)/πt(k).

Combined: use Ctr(k) =Ct(k)πt(k)/πtr(k) Most ML methods allow such weights!

(93)

Outline

1 Introduction

3 Error Estimation

5 References

(94)

References

T. Hastie, R. Tibshirani, and J. Friedman.

The Elements of Statistical Learning.

Springer Series in Statistics, 2009

M. Mohri, A. Rostamizadeh, and A. Talwalkar.

Foundations of Machine Learning.

MIT Press, 2012 A. Géron.

Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd ed.)

O’Reilly, 2019

(95)

References

Licence and Contributors

Creative Commons Attribution-ShareAlike (CC BY-SA 4.0) You are free to:

Share:copy and redistribute the material in any medium or format

Adapt:remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution:You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

ShareAlike:If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

No additional restrictions:You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Contributors

S. Boucheron, A.K. Fermin, S. Gadat, S. Gaiffas, A. Guilloux, Ch. Keribin, E. Le Pennec, E. Matzner, E. Scornet and X Exed team.