• Aucun résultat trouvé

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Partager "Machine Learning"

Copied!
112
0
0

Texte intégral

(1)

Machine Learning

Nathalie Villa-Vialaneix -nathalie.villa@univ-paris1.fr http://www.nathalievilla.org

IUT STID (Carcassonne) & SAMM (Université Paris 1)

Formation INRA , Niveau 3

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 1 / 37

(2)

Outline

1 Introduction

Background and notations Undertting / Overtting Errors

Use case 2 Neural networks

Overview

Multilayer perceptron Learning/Tuning 3 CART

Introduction Learning Prediction 4 Random forest

Overview

Bootstrap/Bagging Random forest

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 2 / 37

(3)

Introduction

Outline

1 Introduction

Background and notations Undertting / Overtting Errors

Use case 2 Neural networks

Overview

Multilayer perceptron Learning/Tuning 3 CART

Introduction Learning Prediction 4 Random forest

Overview

Bootstrap/Bagging Random forest

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 3 / 37

(4)

Introduction Background and notations

Background

Purpose: predict Y from X ;

What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:

numeric variables;

orfactors;

ora combination of numeric variables and factors. Y can be:

anumeric variable (Y ∈R)⇒ (supervised) regression régression;

afactor ⇒(supervised) classication discrimination.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

(5)

Introduction Background and notations

Background

Purpose: predict Y from X ;

What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:

numeric variables;

orfactors;

ora combination of numeric variables and factors. Y can be:

anumeric variable (Y ∈R)⇒ (supervised) regression régression;

afactor ⇒(supervised) classication discrimination.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

(6)

Introduction Background and notations

Background

Purpose: predict Y from X ;

What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

orfactors;

ora combination of numeric variables and factors. Y can be:

anumeric variable (Y ∈R)⇒ (supervised) regression régression;

afactor ⇒(supervised) classication discrimination.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

(7)

Introduction Background and notations

Background

Purpose: predict Y from X ;

What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:

numeric variables;

orfactors;

ora combination of numeric variables and factors.

Y can be:

anumeric variable (Y ∈R)⇒ (supervised) regression régression;

afactor ⇒(supervised) classication discrimination.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

(8)

Introduction Background and notations

Background

Purpose: predict Y from X ;

What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:

numeric variables;

orfactors;

ora combination of numeric variables and factors.

Y can be:

anumeric variable (Y ∈R)⇒ (supervised) regression régression;

afactor ⇒(supervised) classication discrimination.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

(9)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties

accuracy to the observations: predictions made onknown data are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(10)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties

accuracy to the observations: predictions made onknown data are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(11)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i.

Desirable properties

accuracy to the observations: predictions made onknown data are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(12)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties

accuracy to the observations: predictions made onknowndata are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(13)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties

accuracy to the observations: predictions made onknowndata are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(14)

Introduction Undertting / Overtting

Basics

From (xi,yi)i, denition of a machine,Φn s.t.:

ˆynew= Φn(xnew).

if Y is numeric,Φn is called aregression function fonction de classication;

if Y is a factor,Φn is called aclassier classieur;

Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties

accuracy to the observations: predictions made onknowndata are close to observed values;

generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

(15)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Function x →y to be estimated

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(16)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Observations we might have

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(17)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Observations we do have

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(18)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

First estimation from the observations: undertting

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(19)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Second estimation from the observations: accurate estimation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(20)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Third estimation from the observations: overtting

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(21)

Introduction Undertting / Overtting

Undertting/Overtting sous/sur - apprentissage

Summary

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

(22)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦn from the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(23)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦn from the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(24)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦn from the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(25)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦn from the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(26)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦnfrom the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(27)

Introduction Errors

Errors

training error (measures the accuracy to the observations)

if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

if y is numeric: mean square error (MSE) 1

n

n

X

i=1

yiyi)2

or root mean square error (RMSE) or pseudo-R2: 1MSE/Var((yi)i)

test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦnfrom the training dataset

3 calculate the test error from the remaining data simple validation

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

(28)

Introduction Errors

Example

Observations

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

(29)

Introduction Errors

Example

Training/Test datasets

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

(30)

Introduction Errors

Example

Training/Test errors

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

(31)

Introduction Errors

Example

Summary

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

(32)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =βTX+ (a priori on the type of link between X and Y )

Here: nonparametric methods:

Y = Φ(X) + where Φis totally unknown.

(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.

Example(regression framework) ELΦn:=E

n(X)−Y)2 n→+∞

−−−−→inf

Φ ELΦ whatever(X,Y)distribution.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

(33)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =βTX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:

Y = Φ(X) + where Φis totally unknown.

(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.

Example(regression framework) ELΦn:=E

n(X)−Y)2 n→+∞

−−−−→inf

Φ ELΦ whatever(X,Y)distribution.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

(34)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =βTX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:

Y = Φ(X) + where Φis totally unknown.

(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.

Example(regression framework) ELΦn:=E

n(X)−Y)2 n→+∞

−−−−→inf

Φ ELΦ whatever(X,Y)distribution.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

(35)

Introduction Use case

Use case description

Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]:

microarray data: expression of 272 selected genes over 56 individuals (pigs);

a phenotype of interest (muscle pH) measured over the 56 invididuals (numerical variable).

le 1: genes expressions le 2: muscle pH

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 10 / 37

(36)

Neural networks

Outline

1 Introduction

Background and notations Undertting / Overtting Errors

Use case 2 Neural networks

Overview

Multilayer perceptron Learning/Tuning 3 CART

Introduction Learning Prediction 4 Random forest

Overview

Bootstrap/Bagging Random forest

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 11 / 37

(37)

Neural networks Overview

Basics [Bishop, 1995]

Common properties

(articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;

combination(network) of simple elements(neurons). Example of graphical representation:

INPUTS OUTPUTS

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

(38)

Neural networks Overview

Basics [Bishop, 1995]

Common properties

(articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;

combination(network) ofsimple elements (neurons).

Example of graphical representation:

INPUTS OUTPUTS

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

(39)

Neural networks Overview

Basics [Bishop, 1995]

Common properties

(articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;

combination(network) ofsimple elements (neurons).

Example of graphical representation:

INPUTS OUTPUTS

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

(40)

Neural networks Overview

Main features

A neural network is dened by:

1 the network structure;

2 the neuron type.

Standard examples

Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);

Radial basis function networks (RBF): same purpose but based on local smoothing;

Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;

. . .

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

(41)

Neural networks Overview

Main features

A neural network is dened by:

1 the network structure;

2 the neuron type.

Standard examples

Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);

Radial basis function networks (RBF): same purpose but based on local smoothing;

Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;

. . .

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

(42)

Neural networks Overview

Main features

A neural network is dened by:

1 the network structure;

2 the neuron type.

Standard examples

Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);

Radial basis function networks (RBF): same purpose but based on local smoothing;

Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;

. . .

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

(43)

Neural networks Overview

Main features

A neural network is dened by:

1 the network structure;

2 the neuron type.

Standard examples

Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);

Radial basis function networks (RBF): same purpose but based on local smoothing;

Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;

. . .

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

(44)

Neural networks Overview

MLP: Advantages/Drawbacks

Advantages

classication OR regression (i.e., Y can be a numeric variable or a factor);

non parametric method: no prior assumption needed;

accurate (universal approximation).

Drawbacks

hard to train (high computational cost, especially when d is large);

overt easily;

black box models (hard to interpret)

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37

(45)

Neural networks Overview

MLP: Advantages/Drawbacks

Advantages

classication OR regression (i.e., Y can be a numeric variable or a factor);

non parametric method: no prior assumption needed;

accurate (universal approximation).

Drawbacks

hard to train (high computational cost, especially when d is large);

overt easily;

black box models (hard to interpret)

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37

(46)

Neural networks Multilayer perceptron

Analogy to the brain

If P

< activation thresholdthen

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

(47)

Neural networks Multilayer perceptron

Analogy to the brain

If P

< activation thresholdthen

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

(48)

Neural networks Multilayer perceptron

Analogy to the brain

P

If P

< activation thresholdthen

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

(49)

Neural networks Multilayer perceptron

Analogy to the brain

If P

> activation thresholdthen

If P

< activation thresholdthen

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

(50)

Neural networks Multilayer perceptron

Analogy to the brain

If P

< activation thresholdthen

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

(51)

Neural networks Multilayer perceptron

(articial) Perceptron

Layers

MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

no connections within a layer;

connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

(52)

Neural networks Multilayer perceptron

(articial) Perceptron

Layers

MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

no connections within a layer;

connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions Layer 1 Weighted links

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

(53)

Neural networks Multilayer perceptron

(articial) Perceptron

Layers

MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

no connections within a layer;

connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions Layer 1

P P P P

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

(54)

Neural networks Multilayer perceptron

(articial) Perceptron

Layers

MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

no connections within a layer;

connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions Layer 1 Layer 2

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

(55)

Neural networks Multilayer perceptron

(articial) Perceptron

Layers

MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

no connections within a layer;

connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

(56)

Neural networks Multilayer perceptron

A neuron

v1 v2 v3

×w1

×w2

×w3

P+

w0 (BiasBiais)

v1

v2

v3

×w1

×w2

×w3 ×w20

×w10 P+

w0 (BiasBiais)

Standard activation functions

Main issue with the Heaviside function: not continuous! Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = 1+1et

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

(57)

Neural networks Multilayer perceptron

A neuron

v1 v2 v3

×w1

×w2

×w3 ×w20

×w10

P+

w0 (BiasBiais)

Standard activation functions

Main issue with the Heaviside function: not continuous! Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = 1+1et

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

(58)

Neural networks Multilayer perceptron

A neuron

v1 v2 v3

×w1

×w2

×w3 ×w20

×w10

P+

w0 (BiasBiais)

Standard activation functions fonctions de lien / d'activation Biologically inspired: Heaviside function

Seuil

S(t) =

0 if t <threshold;

1 if not.

Standard activation functions

Main issue with the Heaviside function: not continuous! Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = 1+1et

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

(59)

Neural networks Multilayer perceptron

A neuron

v1 v2 v3

×w1

×w2

×w3 ×w20

×w10

P+

w0 (BiasBiais)

Standard activation functions

Main issue with the Heaviside function: not continuous!

Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = 1+1et

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

(60)

Neural networks Multilayer perceptron

Summary

If Y is numeric, linear output:

∀x ∈Rd, Fw(x) =

p

X

j=1

wj(2)S Xd

k=1

wkj(1)xk +wj(0)

! .

INPUTS

x1 x2 . . . xd

w(1)

wdp(1)

neuron 1

neuron p

OUTPUTS

Fw(x) w1(2) wp(2)

+wp(0)

No analytical expression!!

If Y is a factor, logistic output:

∀x ∈Rd, P(X =C|x =x)'Fw(x) =S

p

X

j=1

wj(2)S

d

X

k=1

wkj(1)xk +wj(0)

!

. (with a maximum probability rule for the nal classication)

INPUTS

x1 x2 . . . xd

w(1)

wdp(1)

neuron 1

neuron p

OUTPUTS

Fw(x) w1(2) wp(2)

+wp(0) No analytical expression!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37

(61)

Neural networks Multilayer perceptron

Summary

If Y is a factor, logistic output:

∀x ∈Rd, P(X =C|x =x)'Fw(x) =S

p

X

j=1

wj(2)S

d

X

k=1

wkj(1)xk +wj(0)

!

.

(with a maximum probability rule for the nal classication)

INPUTS

x1 x2 . . . xd

w(1)

wdp(1)

neuron 1

neuron p

OUTPUTS

Fw(x) w1(2) wp(2)

+wp(0)

No analytical expression!!

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37

(62)

Neural networks Multilayer perceptron

Universal approximation

[Hornik et al., 1989] (among others)

For any given Φ, smooth enough and any precision, there exists a one-hidden layer perceptron (with sigmoïd activation functions) that approximatesΦ with a precision at most.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 19 / 37

(63)

Neural networks Learning/Tuning

Learning weights

p is given Chose w s.t.:

wn=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))2. (MSE minimization)

Main issues:

1 no exact solution (⇒approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

(64)

Neural networks Learning/Tuning

Learning weights

p is given Chose w s.t.:

wn=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))2. (MSE minimization)

Main issues:

1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

Erreur

Poids

2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

(65)

Neural networks Learning/Tuning

Learning weights

p is given Chose w s.t.:

wn=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))2. (MSE minimization)

Main issues:

1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

(66)

Neural networks Learning/Tuning

Overtting

Weight decaycan help improve the generalization ability: wn=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))2+λkwk2

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37

(67)

Neural networks Learning/Tuning

Overtting

Weight decaycan help improve the generalization ability:

wn=arg min

w

1 n

Xn i=1

(yi −Fw(xi))2+λkwk2

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37

(68)

Neural networks Learning/Tuning

Tuning p and λ

p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).

grid search using asimple validation;

grid search using a (K-fold)cross validation (better but computationally expensive when K is large).

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

(69)

Neural networks Learning/Tuning

Tuning p and λ

p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).

grid search using asimple validation;

grid search using a (K-fold)cross validation (better but computationally expensive when K is large).

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

(70)

Neural networks Learning/Tuning

Tuning p and λ

p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).

grid search using asimple validation;

grid search using a (K-fold)cross validation (better but computationally expensive when K is large).

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

(71)

Neural networks Learning/Tuning

Cross validation

Algorithm

1: Set the grid search for p, Gp, and λ,Gλ

2: Split the data into K groups

3: for p∈ Gp andλ∈ Gλ do

4: for group =1..K do

5: Train modelp,λ,group without observations in group

6: Test error, MSEp,λ,group for observations in group

7: end for

8: Average MSEp,λ,group over group ⇒MSEp 9: end for

10: Select p andλwith minimum MSEp

n-fold cross validation is called Leave-One-Out(LOO). Standard choice for K: LOO or 10-fold CV.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

(72)

Neural networks Learning/Tuning

Cross validation

Algorithm

1: Set the grid search for p, Gp, and λ,Gλ

2: Split the data into K groups

3: for p∈ Gp andλ∈ Gλ do

4: for group =1..K do

5: Train modelp,λ,group without observations in group

6: Test error, MSEp,λ,group for observations in group

7: end for

8: Average MSEp,λ,group over group ⇒MSEp 9: end for

10: Select p andλwith minimum MSEp

n-fold cross validation is called Leave-One-Out(LOO).

Standard choice for K: LOO or 10-fold CV.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

(73)

Neural networks Learning/Tuning

Cross validation

Algorithm

1: Set the grid search for p, Gp, and λ,Gλ

2: Split the data into K groups

3: for p∈ Gp andλ∈ Gλ do

4: for group =1..K do

5: Train modelp,λ,group without observations in group

6: Test error, MSEp,λ,group for observations in group

7: end for

8: Average MSEp,λ,group over group ⇒MSEp 9: end for

10: Select p andλwith minimum MSEp

n-fold cross validation is called Leave-One-Out(LOO).

Standard choice for K: LOO or 10-fold CV.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

(74)

CART

Outline

1 Introduction

Background and notations Undertting / Overtting Errors

Use case 2 Neural networks

Overview

Multilayer perceptron Learning/Tuning 3 CART

Introduction Learning Prediction 4 Random forest

Overview

Bootstrap/Bagging Random forest

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 24 / 37

(75)

CART Introduction

Overview

CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].

Advantages

classication OR regression (i.e., Y can be a numeric variable or a factor);

non parametric method: no prior assumption needed;

can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);

provide an intuitive interpretation. Drawbacks

require a large training datasetto be ecient;

as a consequence, are oftentoo simple to provide accurate predictions.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

(76)

CART Introduction

Overview

CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].

Advantages

classication OR regression (i.e., Y can be a numeric variable or a factor);

non parametric method: no prior assumption needed;

can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);

provide an intuitive interpretation.

Drawbacks

require a large training datasetto be ecient;

as a consequence, are oftentoo simple to provide accurate predictions.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

(77)

CART Introduction

Overview

CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].

Advantages

classication OR regression (i.e., Y can be a numeric variable or a factor);

non parametric method: no prior assumption needed;

can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);

provide an intuitive interpretation.

Drawbacks

require a large training datasetto be ecient;

as a consequence, are oftentoo simple to provide accurate predictions.

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

(78)

CART Introduction

Example

X = (Gender,Age,Height) and Y =Weight

Y1 Y2 Y3

Y4 Y5 Height<1.60m Height>1.60m

Gender=M Gender=F Age<30 Age>30

Gender=M Gender=F Root

a split a node

a leaf (terminal node)

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 26 / 37

(79)

CART Learning

CART learning process

Algorithm

1: Start from root

2: repeat

3: move to a new node

4: if the node is homogeneous orsmall enough then

5: STOP

6: else

7: split the node into two child nodes withmaximal homogeneity

8: end if

9: until all nodes are processed

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 27 / 37

(80)

CART Learning

Further details

Homogeneity?

if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);

if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.

Stopping criteria?

Minimum size node (generally 1 or 5)

Minimum node purity or variance

Maximum tree depth

Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

(81)

CART Learning

Further details

Homogeneity?

if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);

if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.

Stopping criteria?

Minimum size node (generally 1 or 5)

Minimum node purity or variance

Maximum tree depth

Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

(82)

CART Learning

Further details

Homogeneity?

if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);

if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.

Stopping criteria?

Minimum size node (generally 1 or 5)

Minimum node purity or variance

Maximum tree depth

Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

Références

Documents relatifs

Nachr. Mertens, Ueber einige asymptotische Gesetze der Zahlentheorie, J. Montgomery, Fluctuations in the mean of Euler’s phi function, Proc. Titchmarsh, The theory of the

Validation scores on P2 in terms of (a) KGESR (b) Kling–Gupta efficiency (KGE) (c) NSESR and (d) Nash–Sutcliffe efficiency (NSE) using calibrated parameters on P1 (CALIB ON

The hyperparameter K denotes the number of features randomly selected at each node during the tree induction process. It must be set with an integer in the interval [1 ..M], where

As shown in the above algorithm, the main difference between Forest-RI and Forest-RK lies in that K is randomly chosen for each node of the tree leading therefore to more diverse

However, if this system is well-suited to prove invariant properties (also called safety properties) and to guarantee that agents satisfy their main goal, it does not propose any

Our investigation treats the average number of transversals of fixed size, the size of a random transversal as well as the probability that a random subset of the vertex set of a

Table III contains results and analysis of minimum num- ber of trees selected by deterministic and non-deterministic hyperparameter search algorithms. We see a considerably

In the work [ 44 ], the authors identify Radiomics as a challenge in machine learning for the three following reasons: (i) small sample size: due to the difficulty in data sharing,