Machine Learning
Nathalie Villa-Vialaneix -nathalie.villa@univ-paris1.fr http://www.nathalievilla.org
IUT STID (Carcassonne) & SAMM (Université Paris 1)
Formation INRA , Niveau 3
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 1 / 37
Outline
1 Introduction
Background and notations Undertting / Overtting Errors
Use case 2 Neural networks
Overview
Multilayer perceptron Learning/Tuning 3 CART
Introduction Learning Prediction 4 Random forest
Overview
Bootstrap/Bagging Random forest
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 2 / 37
Introduction
Outline
1 Introduction
Background and notations Undertting / Overtting Errors
Use case 2 Neural networks
Overview
Multilayer perceptron Learning/Tuning 3 CART
Introduction Learning Prediction 4 Random forest
Overview
Bootstrap/Bagging Random forest
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 3 / 37
Introduction Background and notations
Background
• Purpose: predict Y from X ;
• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);
• What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:
• numeric variables;
• orfactors;
• ora combination of numeric variables and factors. Y can be:
• anumeric variable (Y ∈R)⇒ (supervised) regression régression;
• afactor ⇒(supervised) classication discrimination.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37
Introduction Background and notations
Background
• Purpose: predict Y from X ;
• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);
• What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:
• numeric variables;
• orfactors;
• ora combination of numeric variables and factors. Y can be:
• anumeric variable (Y ∈R)⇒ (supervised) regression régression;
• afactor ⇒(supervised) classication discrimination.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37
Introduction Background and notations
Background
• Purpose: predict Y from X ;
• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);
• What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
• numeric variables;
• orfactors;
• ora combination of numeric variables and factors. Y can be:
• anumeric variable (Y ∈R)⇒ (supervised) regression régression;
• afactor ⇒(supervised) classication discrimination.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37
Introduction Background and notations
Background
• Purpose: predict Y from X ;
• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);
• What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:
• numeric variables;
• orfactors;
• ora combination of numeric variables and factors.
Y can be:
• anumeric variable (Y ∈R)⇒ (supervised) regression régression;
• afactor ⇒(supervised) classication discrimination.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37
Introduction Background and notations
Background
• Purpose: predict Y from X ;
• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);
• What we want: estimate unknown Y from new X : xn+1, . . . , xm. X can be:
• numeric variables;
• orfactors;
• ora combination of numeric variables and factors.
Y can be:
• anumeric variable (Y ∈R)⇒ (supervised) regression régression;
• afactor ⇒(supervised) classication discrimination.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties
• accuracy to the observations: predictions made onknown data are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties
• accuracy to the observations: predictions made onknown data are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i.
Desirable properties
• accuracy to the observations: predictions made onknown data are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties
• accuracy to the observations: predictions made onknowndata are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties
• accuracy to the observations: predictions made onknowndata are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Basics
From (xi,yi)i, denition of a machine,Φn s.t.:
ˆynew= Φn(xnew).
• if Y is numeric,Φn is called aregression function fonction de classication;
• if Y is a factor,Φn is called aclassier classieur;
Φn is said to betrained orlearned from the observations (xi,yi)i. Desirable properties
• accuracy to the observations: predictions made onknowndata are close to observed values;
• generalization ability: predictions made on newdata are also accurate.
Conicting objectives!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Function x →y to be estimated
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Observations we might have
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Observations we do have
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
First estimation from the observations: undertting
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Third estimation from the observations: overtting
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Undertting / Overtting
Undertting/Overtting sous/sur - apprentissage
Summary
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦn from the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦn from the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦn from the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦn from the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦnfrom the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Errors
• training error (measures the accuracy to the observations)
• if y is a factor: misclassication rate
]{ˆyi6=yi, i=1, . . . ,n} n
• if y is numeric: mean square error (MSE) 1
n
n
X
i=1
(ˆyi−yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
• test error: a way to prevent overtting (estimates the generalization error)
1 split the data into training/test sets (usually 80%/20%)
2 trainΦnfrom the training dataset
3 calculate the test error from the remaining data simple validation
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37
Introduction Errors
Example
Observations
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37
Introduction Errors
Example
Training/Test datasets
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37
Introduction Errors
Example
Training/Test errors
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37
Introduction Errors
Example
Summary
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37
Introduction Errors
Linear vs Nonparametric
Linear methods:
Y =βTX+ (a priori on the type of link between X and Y )
Here: nonparametric methods:
Y = Φ(X) + where Φis totally unknown.
(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.
Example(regression framework) ELΦn:=E
(Φn(X)−Y)2 n→+∞
−−−−→inf
Φ ELΦ whatever(X,Y)distribution.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37
Introduction Errors
Linear vs Nonparametric
Linear methods:
Y =βTX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:
Y = Φ(X) + where Φis totally unknown.
(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.
Example(regression framework) ELΦn:=E
(Φn(X)−Y)2 n→+∞
−−−−→inf
Φ ELΦ whatever(X,Y)distribution.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37
Introduction Errors
Linear vs Nonparametric
Linear methods:
Y =βTX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:
Y = Φ(X) + where Φis totally unknown.
(ML) Objective: BuildΦn from the observations such that its generalization error ELΦn is (asymptotically) optimal.
Example(regression framework) ELΦn:=E
(Φn(X)−Y)2 n→+∞
−−−−→inf
Φ ELΦ whatever(X,Y)distribution.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37
Introduction Use case
Use case description
Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]:
• microarray data: expression of 272 selected genes over 56 individuals (pigs);
• a phenotype of interest (muscle pH) measured over the 56 invididuals (numerical variable).
le 1: genes expressions le 2: muscle pH
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 10 / 37
Neural networks
Outline
1 Introduction
Background and notations Undertting / Overtting Errors
Use case 2 Neural networks
Overview
Multilayer perceptron Learning/Tuning 3 CART
Introduction Learning Prediction 4 Random forest
Overview
Bootstrap/Bagging Random forest
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 11 / 37
Neural networks Overview
Basics [Bishop, 1995]
Common properties
• (articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;
• combination(network) of simple elements(neurons). Example of graphical representation:
INPUTS OUTPUTS
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37
Neural networks Overview
Basics [Bishop, 1995]
Common properties
• (articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;
• combination(network) ofsimple elements (neurons).
Example of graphical representation:
INPUTS OUTPUTS
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37
Neural networks Overview
Basics [Bishop, 1995]
Common properties
• (articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;
• combination(network) ofsimple elements (neurons).
Example of graphical representation:
INPUTS OUTPUTS
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37
Neural networks Overview
Main features
A neural network is dened by:
1 the network structure;
2 the neuron type.
Standard examples
• Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);
• Radial basis function networks (RBF): same purpose but based on local smoothing;
• Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;
• . . .
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37
Neural networks Overview
Main features
A neural network is dened by:
1 the network structure;
2 the neuron type.
Standard examples
• Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);
• Radial basis function networks (RBF): same purpose but based on local smoothing;
• Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;
• . . .
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37
Neural networks Overview
Main features
A neural network is dened by:
1 the network structure;
2 the neuron type.
Standard examples
• Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);
• Radial basis function networks (RBF): same purpose but based on local smoothing;
• Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;
• . . .
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37
Neural networks Overview
Main features
A neural network is dened by:
1 the network structure;
2 the neuron type.
Standard examples
• Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);
• Radial basis function networks (RBF): same purpose but based on local smoothing;
• Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;
• . . .
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37
Neural networks Overview
MLP: Advantages/Drawbacks
Advantages
• classication OR regression (i.e., Y can be a numeric variable or a factor);
• non parametric method: no prior assumption needed;
• accurate (universal approximation).
Drawbacks
• hard to train (high computational cost, especially when d is large);
• overt easily;
• black box models (hard to interpret)
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37
Neural networks Overview
MLP: Advantages/Drawbacks
Advantages
• classication OR regression (i.e., Y can be a numeric variable or a factor);
• non parametric method: no prior assumption needed;
• accurate (universal approximation).
Drawbacks
• hard to train (high computational cost, especially when d is large);
• overt easily;
• black box models (hard to interpret)
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37
Neural networks Multilayer perceptron
Analogy to the brain
If P
< activation thresholdthen
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37
Neural networks Multilayer perceptron
Analogy to the brain
If P
< activation thresholdthen
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37
Neural networks Multilayer perceptron
Analogy to the brain
P
If P
< activation thresholdthen
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37
Neural networks Multilayer perceptron
Analogy to the brain
If P
> activation thresholdthen
If P
< activation thresholdthen
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37
Neural networks Multilayer perceptron
Analogy to the brain
If P
< activation thresholdthen
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37
Neural networks Multilayer perceptron
(articial) Perceptron
Layers
• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);
• no connections within a layer;
• connections between two consecutive layers (feedforward).
Example:
INPUTS
genes expressions
Example:
INPUTS
genes expressions Layer 1 Layer 2 pH
OUTPUTS
2 hidden layers MLP
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37
Neural networks Multilayer perceptron
(articial) Perceptron
Layers
• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);
• no connections within a layer;
• connections between two consecutive layers (feedforward).
Example:
INPUTS
genes expressions Layer 1 Weighted links
Example:
INPUTS
genes expressions Layer 1 Layer 2 pH
OUTPUTS
2 hidden layers MLP
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37
Neural networks Multilayer perceptron
(articial) Perceptron
Layers
• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);
• no connections within a layer;
• connections between two consecutive layers (feedforward).
Example:
INPUTS
genes expressions Layer 1
P P P P
Example:
INPUTS
genes expressions Layer 1 Layer 2 pH
OUTPUTS
2 hidden layers MLP
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37
Neural networks Multilayer perceptron
(articial) Perceptron
Layers
• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);
• no connections within a layer;
• connections between two consecutive layers (feedforward).
Example:
INPUTS
genes expressions Layer 1 Layer 2
Example:
INPUTS
genes expressions Layer 1 Layer 2 pH
OUTPUTS
2 hidden layers MLP
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37
Neural networks Multilayer perceptron
(articial) Perceptron
Layers
• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);
• no connections within a layer;
• connections between two consecutive layers (feedforward).
Example:
INPUTS
genes expressions Layer 1 Layer 2 pH
OUTPUTS
2 hidden layers MLP
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37
Neural networks Multilayer perceptron
A neuron
v1 v2 v3
×w1
×w2
×w3
P+
w0 (BiasBiais)
v1
v2
v3
×w1
×w2
×w3 ×w20
×w10 P+
w0 (BiasBiais)
Standard activation functions
Main issue with the Heaviside function: not continuous! Logistic function
−4 −2 0 2 4
0.00.20.40.60.81.0
S(t) = 1+1e−t
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37
Neural networks Multilayer perceptron
A neuron
v1 v2 v3
×w1
×w2
×w3 ×w20
×w10
P+
w0 (BiasBiais)
Standard activation functions
Main issue with the Heaviside function: not continuous! Logistic function
−4 −2 0 2 4
0.00.20.40.60.81.0
S(t) = 1+1e−t
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37
Neural networks Multilayer perceptron
A neuron
v1 v2 v3
×w1
×w2
×w3 ×w20
×w10
P+
w0 (BiasBiais)
Standard activation functions fonctions de lien / d'activation Biologically inspired: Heaviside function
Seuil
S(t) =
0 if t <threshold;
1 if not.
Standard activation functions
Main issue with the Heaviside function: not continuous! Logistic function
−4 −2 0 2 4
0.00.20.40.60.81.0
S(t) = 1+1e−t
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37
Neural networks Multilayer perceptron
A neuron
v1 v2 v3
×w1
×w2
×w3 ×w20
×w10
P+
w0 (BiasBiais)
Standard activation functions
Main issue with the Heaviside function: not continuous!
Logistic function
−4 −2 0 2 4
0.00.20.40.60.81.0
S(t) = 1+1e−t
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37
Neural networks Multilayer perceptron
Summary
If Y is numeric, linear output:
∀x ∈Rd, Fw(x) =
p
X
j=1
wj(2)S Xd
k=1
wkj(1)xk +wj(0)
! .
INPUTS
x1 x2 . . . xd
w(1)
wdp(1)
neuron 1
neuron p
OUTPUTS
Fw(x) w1(2) wp(2)
+wp(0)
No analytical expression!!
If Y is a factor, logistic output:
∀x ∈Rd, P(X =C|x =x)'Fw(x) =S
p
X
j=1
wj(2)S
d
X
k=1
wkj(1)xk +wj(0)
!
. (with a maximum probability rule for the nal classication)
INPUTS
x1 x2 . . . xd
w(1)
wdp(1)
neuron 1
neuron p
OUTPUTS
Fw(x) w1(2) wp(2)
+wp(0) No analytical expression!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37
Neural networks Multilayer perceptron
Summary
If Y is a factor, logistic output:
∀x ∈Rd, P(X =C|x =x)'Fw(x) =S
p
X
j=1
wj(2)S
d
X
k=1
wkj(1)xk +wj(0)
!
.
(with a maximum probability rule for the nal classication)
INPUTS
x1 x2 . . . xd
w(1)
wdp(1)
neuron 1
neuron p
OUTPUTS
Fw(x) w1(2) wp(2)
+wp(0)
No analytical expression!!
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37
Neural networks Multilayer perceptron
Universal approximation
[Hornik et al., 1989] (among others)
For any given Φ, smooth enough and any precision, there exists a one-hidden layer perceptron (with sigmoïd activation functions) that approximatesΦ with a precision at most.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 19 / 37
Neural networks Learning/Tuning
Learning weights
p is given Chose w s.t.:
wn=arg min
w
1 n
n
X
i=1
(yi −Fw(xi))2. (MSE minimization)
Main issues:
1 no exact solution (⇒approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;
2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37
Neural networks Learning/Tuning
Learning weights
p is given Chose w s.t.:
wn=arg min
w
1 n
n
X
i=1
(yi −Fw(xi))2. (MSE minimization)
Main issues:
1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;
Erreur
Poids
2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37
Neural networks Learning/Tuning
Learning weights
p is given Chose w s.t.:
wn=arg min
w
1 n
n
X
i=1
(yi −Fw(xi))2. (MSE minimization)
Main issues:
1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;
2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37
Neural networks Learning/Tuning
Overtting
Weight decaycan help improve the generalization ability: wn=arg min
w
1 n
n
X
i=1
(yi −Fw(xi))2+λkwk2
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37
Neural networks Learning/Tuning
Overtting
Weight decaycan help improve the generalization ability:
wn=arg min
w
1 n
Xn i=1
(yi −Fw(xi))2+λkwk2
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37
Neural networks Learning/Tuning
Tuning p and λ
p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).
• grid search using asimple validation;
• grid search using a (K-fold)cross validation (better but computationally expensive when K is large).
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37
Neural networks Learning/Tuning
Tuning p and λ
p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).
• grid search using asimple validation;
• grid search using a (K-fold)cross validation (better but computationally expensive when K is large).
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37
Neural networks Learning/Tuning
Tuning p and λ
p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).
• grid search using asimple validation;
• grid search using a (K-fold)cross validation (better but computationally expensive when K is large).
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37
Neural networks Learning/Tuning
Cross validation
Algorithm
1: Set the grid search for p, Gp, and λ,Gλ
2: Split the data into K groups
3: for p∈ Gp andλ∈ Gλ do
4: for group =1..K do
5: Train modelp,λ,group without observations in group
6: Test error, MSEp,λ,group for observations in group
7: end for
8: Average MSEp,λ,group over group ⇒MSEp,λ 9: end for
10: Select p andλwith minimum MSEp,λ
n-fold cross validation is called Leave-One-Out(LOO). Standard choice for K: LOO or 10-fold CV.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37
Neural networks Learning/Tuning
Cross validation
Algorithm
1: Set the grid search for p, Gp, and λ,Gλ
2: Split the data into K groups
3: for p∈ Gp andλ∈ Gλ do
4: for group =1..K do
5: Train modelp,λ,group without observations in group
6: Test error, MSEp,λ,group for observations in group
7: end for
8: Average MSEp,λ,group over group ⇒MSEp,λ 9: end for
10: Select p andλwith minimum MSEp,λ
n-fold cross validation is called Leave-One-Out(LOO).
Standard choice for K: LOO or 10-fold CV.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37
Neural networks Learning/Tuning
Cross validation
Algorithm
1: Set the grid search for p, Gp, and λ,Gλ
2: Split the data into K groups
3: for p∈ Gp andλ∈ Gλ do
4: for group =1..K do
5: Train modelp,λ,group without observations in group
6: Test error, MSEp,λ,group for observations in group
7: end for
8: Average MSEp,λ,group over group ⇒MSEp,λ 9: end for
10: Select p andλwith minimum MSEp,λ
n-fold cross validation is called Leave-One-Out(LOO).
Standard choice for K: LOO or 10-fold CV.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37
CART
Outline
1 Introduction
Background and notations Undertting / Overtting Errors
Use case 2 Neural networks
Overview
Multilayer perceptron Learning/Tuning 3 CART
Introduction Learning Prediction 4 Random forest
Overview
Bootstrap/Bagging Random forest
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 24 / 37
CART Introduction
Overview
CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].
Advantages
• classication OR regression (i.e., Y can be a numeric variable or a factor);
• non parametric method: no prior assumption needed;
• can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);
• provide an intuitive interpretation. Drawbacks
• require a large training datasetto be ecient;
• as a consequence, are oftentoo simple to provide accurate predictions.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37
CART Introduction
Overview
CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].
Advantages
• classication OR regression (i.e., Y can be a numeric variable or a factor);
• non parametric method: no prior assumption needed;
• can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);
• provide an intuitive interpretation.
Drawbacks
• require a large training datasetto be ecient;
• as a consequence, are oftentoo simple to provide accurate predictions.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37
CART Introduction
Overview
CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].
Advantages
• classication OR regression (i.e., Y can be a numeric variable or a factor);
• non parametric method: no prior assumption needed;
• can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);
• provide an intuitive interpretation.
Drawbacks
• require a large training datasetto be ecient;
• as a consequence, are oftentoo simple to provide accurate predictions.
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37
CART Introduction
Example
X = (Gender,Age,Height) and Y =Weight
Y1 Y2 Y3
Y4 Y5 Height<1.60m Height>1.60m
Gender=M Gender=F Age<30 Age>30
Gender=M Gender=F Root
a split a node
a leaf (terminal node)
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 26 / 37
CART Learning
CART learning process
Algorithm
1: Start from root
2: repeat
3: move to a new node
4: if the node is homogeneous orsmall enough then
5: STOP
6: else
7: split the node into two child nodes withmaximal homogeneity
8: end if
9: until all nodes are processed
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 27 / 37
CART Learning
Further details
Homogeneity?
• if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);
• if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.
Stopping criteria?
• Minimum size node (generally 1 or 5)
• Minimum node purity or variance
• Maximum tree depth
Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37
CART Learning
Further details
Homogeneity?
• if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);
• if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.
Stopping criteria?
• Minimum size node (generally 1 or 5)
• Minimum node purity or variance
• Maximum tree depth
Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37
CART Learning
Further details
Homogeneity?
• if Y is a numeric variable,varianceof (yi)i for the observations assigned to the node (Gini index is also sometimes used);
• if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.
Stopping criteria?
• Minimum size node (generally 1 or 5)
• Minimum node purity or variance
• Maximum tree depth
Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...
Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37