Machine Learning

(1)

Machine Learning

Nathalie Villa-Vialaneix -nathalie.villa@univ-paris1.fr http://www.nathalievilla.org

IUT STID (Carcassonne) & SAMM (Université Paris 1)

Formation INRA , Niveau 3

Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 1 / 37

(2)

Outline

1 Introduction

Background and notations Undertting / Overtting Errors

Use case 2 Neural networks

Overview

Multilayer perceptron Learning/Tuning 3 CART

Introduction Learning Prediction 4 Random forest

Overview

Bootstrap/Bagging Random forest

(3)

Introduction

Outline

1 Introduction

Overview

(4)

Introduction Background and notations

Background

• Purpose: predict Y from X ;

• What we have: n observations of(X,Y),(x1,y1), . . . , (xn,yn);

• What we want: estimate unknown Y from new X : x_n+1, . . . , x_m. X can be:

• numeric variables;

• orfactors;

• ora combination of numeric variables and factors. Y can be:

• anumeric variable (Y ∈R)⇒ (supervised) regression régression;

• afactor ⇒(supervised) classication discrimination.

(5)

Background

• orfactors;

(6)

Background

• What we want: estimate unknown Y from new X : x_n+1, . . . , x_m.

X can be:

• orfactors;

(7)

Background

• orfactors;

• ora combination of numeric variables and factors.

Y can be:

(8)

Background

• orfactors;

• ora combination of numeric variables and factors.

Y can be:

(9)

Introduction Undertting / Overtting

Basics

From (x_i,y_i)_i, denition of a machine,Φⁿ s.t.:

ˆy_new= Φⁿ(x_new).

• if Y is numeric,Φⁿ is called aregression function fonction de classication;

• if Y is a factor,Φⁿ is called aclassier classieur;

Φⁿ is said to betrained orlearned from the observations (x_i,y_i)_i. Desirable properties

• accuracy to the observations: predictions made onknown data are close to observed values;

• generalization ability: predictions made on newdata are also accurate.

Conicting objectives!!

(10)

Basics

(11)

Basics

Φⁿ is said to betrained orlearned from the observations (x_i,y_i)_i.

Desirable properties

(12)

Basics

• accuracy to the observations: predictions made onknowndata are close to observed values;

(13)

Basics

(14)

Basics

(15)

Undertting/Overtting sous/sur - apprentissage

Function x →y to be estimated

(16)

Undertting/Overtting sous/sur - apprentissage

Observations we might have

(17)

Undertting/Overtting sous/sur - apprentissage

Observations we do have

(18)

Undertting/Overtting sous/sur - apprentissage

First estimation from the observations: undertting

(19)

Undertting/Overtting sous/sur - apprentissage

Second estimation from the observations: accurate estimation

(20)

Undertting/Overtting sous/sur - apprentissage

Third estimation from the observations: overtting

(21)

Undertting/Overtting sous/sur - apprentissage

Summary

(22)

Introduction Errors

Errors

• training error (measures the accuracy to the observations)

• if y is a factor: misclassication rate

]{ˆyi6=yi, i=1, . . . ,n} n

• if y is numeric: mean square error (MSE) 1

n

X

i=1

(ˆyi−yi)²

or root mean square error (RMSE) or pseudo-R²: 1−MSE/Var((y_i)_i)

• test error: a way to prevent overtting (estimates the generalization error)

1 split the data into training/test sets (usually 80%/20%)

2 trainΦⁿ from the training dataset

3 calculate the test error from the remaining data simple validation

(23)

Introduction Errors

Errors

]{ˆyi6=yi, i=1, . . . ,n} n

n

X

i=1

(ˆyi−yi)²

(24)

Introduction Errors

Errors

]{ˆyi6=yi, i=1, . . . ,n} n

n

X

i=1

(ˆyi−yi)²

(25)

Introduction Errors

Errors

]{ˆyi6=yi, i=1, . . . ,n} n

n

X

i=1

(ˆyi−yi)²

or root mean square error (RMSE) or pseudo-R²: 1−MSE/Var((yi)_i)

(26)

Introduction Errors

Errors

]{ˆyi6=yi, i=1, . . . ,n} n

n

X

i=1

(ˆyi−yi)²

2 trainΦⁿfrom the training dataset

(27)

Introduction Errors

Errors

]{ˆyi6=yi, i=1, . . . ,n} n

n

X

i=1

(ˆyi−yi)²

2 trainΦⁿfrom the training dataset

(28)

Introduction Errors

Example

Observations

(29)

Introduction Errors

Example

Training/Test datasets

(30)

Introduction Errors

Example

Training/Test errors

(31)

Introduction Errors

Example

Summary

(32)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =β^TX+ (a priori on the type of link between X and Y )

Here: nonparametric methods:

Y = Φ(X) + where Φis totally unknown.

(ML) Objective: BuildΦⁿ from the observations such that its generalization error ELΦⁿ is (asymptotically) optimal.

Example(regression framework) ELΦⁿ:=E

(Φⁿ(X)−Y)² _n→+∞

−−−−→inf

Φ ELΦ whatever(X,Y)distribution.

(33)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =β^TX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:

(Φⁿ(X)−Y)² _n→+∞

−−−−→inf

(34)

Introduction Errors

Linear vs Nonparametric

Linear methods:

Y =β^TX+ (a priori on the type of link between X and Y ) Here: nonparametric methods:

(Φⁿ(X)−Y)² _n→+∞

−−−−→inf

(35)

Introduction Use case

Use case description

Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]:

• microarray data: expression of 272 selected genes over 56 individuals (pigs);

• a phenotype of interest (muscle pH) measured over the 56 invididuals (numerical variable).

le 1: genes expressions le 2: muscle pH

(36)

Neural networks

Outline

1 Introduction

Overview

(37)

Neural networks Overview

Basics [Bishop, 1995]

Common properties

• (articial)Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain;

• combination(network) of simple elements(neurons). Example of graphical representation:

INPUTS OUTPUTS

(38)

Basics [Bishop, 1995]

Common properties

• combination(network) ofsimple elements (neurons).

Example of graphical representation:

INPUTS OUTPUTS

(39)

Basics [Bishop, 1995]

Common properties

• combination(network) ofsimple elements (neurons).

Example of graphical representation:

INPUTS OUTPUTS

(40)

Main features

A neural network is dened by:

1 the network structure;

2 the neuron type.

Standard examples

• Multilayer perceptrons(MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression);

• Radial basis function networks (RBF): same purpose but based on local smoothing;

• Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;

• . . .

(41)

Main features

2 the neuron type.

Standard examples

• . . .

(42)

Main features

2 the neuron type.

Standard examples

• . . .

(43)

Main features

2 the neuron type.

Standard examples

• . . .

(44)

MLP: Advantages/Drawbacks

Advantages

• classication OR regression (i.e., Y can be a numeric variable or a factor);

• non parametric method: no prior assumption needed;

• accurate (universal approximation).

Drawbacks

• hard to train (high computational cost, especially when d is large);

• overt easily;

• black box models (hard to interpret)

(45)

MLP: Advantages/Drawbacks

Advantages

• accurate (universal approximation).

Drawbacks

• hard to train (high computational cost, especially when d is large);

• overt easily;

• black box models (hard to interpret)

(46)

Neural networks Multilayer perceptron

Analogy to the brain

If P

< activation thresholdthen

(47)

Analogy to the brain

If P

(48)

Analogy to the brain

P

If P

(49)

Analogy to the brain

If P

> activation thresholdthen

If P

(50)

Analogy to the brain

If P

(51)

(articial) Perceptron

Layers

• MLP have one input layer (X values), one output layer (Y values) and several hidden layers(only 1 is necessary);

• no connections within a layer;

• connections between two consecutive layers (feedforward).

Example:

INPUTS

genes expressions

Example:

INPUTS

genes expressions Layer 1 Layer 2 pH

OUTPUTS

2 hidden layers MLP

(52)

(articial) Perceptron

Layers

Example:

INPUTS

genes expressions Layer 1 Weighted links

Example:

INPUTS

OUTPUTS

2 hidden layers MLP

(53)

(articial) Perceptron

Layers

Example:

INPUTS

genes expressions Layer 1

P P P P

Example:

INPUTS

OUTPUTS

2 hidden layers MLP

(54)

(articial) Perceptron

Layers

Example:

INPUTS

genes expressions Layer 1 Layer 2

Example:

INPUTS

OUTPUTS

2 hidden layers MLP

(55)

(articial) Perceptron

Layers

Example:

INPUTS

OUTPUTS

2 hidden layers MLP

(56)

A neuron

v₁ v₂ v₃

×w1

×w2

×w3

P+

w₀ (BiasBiais)

v1

v2

v3

×w1

×w2

×w3 ×w2⁰

×w1⁰ P+

w0 (BiasBiais)

Standard activation functions

Main issue with the Heaviside function: not continuous! Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = ₁₊¹_e−t

(57)

A neuron

v₁ v₂ v₃

×w1

×w2

×w3 ×w₂⁰

×w₁⁰

P+

w₀ (BiasBiais)

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = ₁₊¹_e−t

(58)

A neuron

v₁ v₂ v₃

×w1

×w2

×w3 ×w₂⁰

×w₁⁰

P+

w₀ (BiasBiais)

Standard activation functions fonctions de lien / d'activation Biologically inspired: Heaviside function

Seuil

S(t) =

0 if t <threshold;

1 if not.

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = ₁₊¹_e−t

(59)

A neuron

v₁ v₂ v₃

×w1

×w2

×w3 ×w₂⁰

×w₁⁰

P+

w₀ (BiasBiais)

Main issue with the Heaviside function: not continuous!

Logistic function

−4 −2 0 2 4

0.00.20.40.60.81.0

S(t) = ₁₊¹_e−t

(60)

Summary

If Y is numeric, linear output:

∀x ∈R^d, F_w(x) =

p

X

j=1

w_j⁽²⁾S Xd

k=1

w_kj⁽¹⁾x^k +w_j⁽⁰⁾

! .

INPUTS

x¹ x² . . . x^d

w⁽¹⁾

w_dp⁽¹⁾

neuron 1

neuron p

OUTPUTS

Fw(x) w₁⁽²⁾ w_p⁽²⁾

+wp⁽⁰⁾

No analytical expression!!

If Y is a factor, logistic output:

∀x ∈R^d, P(X =C|x =x)'Fw(x) =S





p

X

j=1

w_j⁽²⁾S

d

X

k=1

w_kj⁽¹⁾x^k +w_j⁽⁰⁾

!

. (with a maximum probability rule for the nal classication)

INPUTS

x¹ x² . . . x^d

w⁽¹⁾

w_dp⁽¹⁾

neuron 1

neuron p

OUTPUTS

F_w(x) w₁⁽²⁾ wp⁽²⁾

+w_p⁽⁰⁾ No analytical expression!!

(61)

Summary

If Y is a factor, logistic output:

∀x ∈R^d, P(X =C|x =x)'F_w(x) =S





p

X

j=1

w_j⁽²⁾S

d

X

k=1

w_kj⁽¹⁾x^k +w_j⁽⁰⁾

!

.

(with a maximum probability rule for the nal classication)

INPUTS

x¹ x² . . . x^d

w⁽¹⁾

w_dp⁽¹⁾

neuron 1

neuron p

OUTPUTS

F_w(x) w₁⁽²⁾ wp⁽²⁾

+wp⁽⁰⁾

No analytical expression!!

(62)

Universal approximation

[Hornik et al., 1989] (among others)

For any given Φ, smooth enough and any precision, there exists a one-hidden layer perceptron (with sigmoïd activation functions) that approximatesΦ with a precision at most.

(63)

Neural networks Learning/Tuning

Learning weights

p is given Chose w s.t.:

wⁿ=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))². (MSE minimization)

Main issues:

1 no exact solution (⇒approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data.

(64)

Learning weights

wⁿ=arg min

w

1 n

n

X

i=1

Main issues:

1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

Erreur

Poids

(65)

Learning weights

wⁿ=arg min

w

1 n

n

X

i=1

Main issues:

1 no exact solution (⇒ approximation algorithms, e.g., Newton's method + backpropagation principle): local minima;

(66)

Overtting

Weight decaycan help improve the generalization ability: wⁿ=arg min

w

1 n

n

X

i=1

(yi −Fw(xi))²+λkwk²

(67)

Overtting

Weight decaycan help improve the generalization ability:

wⁿ=arg min

w

1 n

Xn i=1

(yi −Fw(xi))²+λkwk²

(68)

Tuning p and λ

p and λare calledhyperparametershyper-paramètres(not learned from the data buttuned).

• grid search using asimple validation;

• grid search using a (K-fold)cross validation (better but computationally expensive when K is large).

(69)

Tuning p and λ

(70)

Tuning p and λ

(71)

Cross validation

Algorithm

1: Set the grid search for p, G_p, and λ,G_λ

2: Split the data into K groups

3: for p∈ G_p andλ∈ G_λ do

4: for group =1..K do

5: Train model_p,λ,group without observations in group

6: Test error, MSE_p,λ,group for observations in group

7: end for

8: Average MSE_p,λ,group over group ⇒MSE_p,λ 9: end for

10: Select p andλwith minimum MSE_p,λ

n-fold cross validation is called Leave-One-Out(LOO). Standard choice for K: LOO or 10-fold CV.

(72)

Cross validation

Algorithm

7: end for

n-fold cross validation is called Leave-One-Out(LOO).

Standard choice for K: LOO or 10-fold CV.

(73)

Cross validation

Algorithm

7: end for

n-fold cross validation is called Leave-One-Out(LOO).

Standard choice for K: LOO or 10-fold CV.

(74)

CART

Outline

1 Introduction

Overview

(75)

CART Introduction

Overview

CART:ClassicationAndRegressionTrees introduced by [Breiman et al., 1984].

Advantages

• can deal with alarge number of input variables, either numeric variables or factors (a variable selection is included in the method);

• provide an intuitive interpretation. Drawbacks

• require a large training datasetto be ecient;

• as a consequence, are oftentoo simple to provide accurate predictions.

(76)

CART Introduction

Overview

Advantages

• provide an intuitive interpretation.

Drawbacks

(77)

CART Introduction

Overview

Advantages

• provide an intuitive interpretation.

Drawbacks

(78)

CART Introduction

Example

X = (Gender,Age,Height) and Y =Weight

Y₁ Y₂ Y₃

Y₄ Y₅ Height<1.60m Height>1.60m

Gender=M Gender=F Age<30 Age>30

Gender=M Gender=F Root

a split a node

a leaf (terminal node)

(79)

CART Learning

CART learning process

Algorithm

1: Start from root

2: repeat

3: move to a new node

4: if the node is homogeneous orsmall enough then

5: STOP

6: else

7: split the node into two child nodes withmaximal homogeneity

8: end if

9: until all nodes are processed

(80)

CART Learning

Further details

Homogeneity?

• if Y is a numeric variable,varianceof (yi)_i for the observations assigned to the node (Gini index is also sometimes used);

• if Y is a factor,node purity: % of observations assigned to the node whose Y values are not the node majority class.

Stopping criteria?

• Minimum size node (generally 1 or 5)

• Minimum node purity or variance

• Maximum tree depth

Hyperparameters can betuned by cross-validation using a grid search. An alternative approach is pruning...

(81)

CART Learning

Further details

Homogeneity?

Stopping criteria?

(82)

CART Learning

Further details

Homogeneity?

Stopping criteria?