What is (Machine-Learning)

(1)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2018 1

Apprentissage Artificiel

(Machine-Learning)

General framework + Supervised Learning

Pr. Fabien MOUTARDE Center for Robotics

MINES ParisTech PSL Université Paris

Fabien.Moutarde@mines-paristech.fr

http://people.mines-paristech.fr/fabien.moutarde

What is

Statistical Machine-Learning?

STATISTICS Data analysis

OPTIMIZATION ARTIFICIAL

INTELLIGENCE

Machine

Learning

(2)

• Handwritten characters recognition

• Object category visual recognition

• Speech recognition

Real-world examples of SML applications

3 3

6 … … 6

Handwritten digits recognition

system

Pedestrians « non-pedestrians »

Pedestrian recognition

system

• Multi-factorial forecasting

• Natural Language understanding

• Playing GO!

• …

Statistical Machine-Learning

• One of many sub-fields of Artificial Intelligence

• Application of optimization methods to statistical modelling

• Data-driven mathematical modelling, for automated classification, regression, partitioning/clustering, or decision/behavior rule

Clustering

(3)

Types of Machine-learning

• Availability of target output data?

® Supervised learning vs. Unsupervised learning

or Reinforcement Learning

• Permanent adaptability?

® offline learning vs. online (life-long) learning

• What kind of (mathematical) model?

® polynom/spline, decision tree, neural net, kernel machine, …

• Which objective function?

® cost function (quadratic error, …), implicit criterium, …

• How to find the best-fitting model?

® algorithm type (exact solving, gradient descent,

quadratic optimization, heuristics, …)

Most simple ML example:

Least Squares Linear Regression

• Model: (straight) line y=ax+b (2 parameters a and b)

• Data: n points with target value (x _i ,y _i ) ÎÂ ²

• Cost function: sum of squares of deviation from line K= S _i (y _i -a.x _i -b) ²

• Algorithm: direct ( or iterative ) solving of linear system

÷ ÷

ø ö

ç ç ç ç

è æ

÷ =

÷ ø ç ö ç è

× æ

÷ ÷

ø ö

ç ç ç ç

è æ

å å å

å å

=

n

i

i n

i

i i n

i i

n

i i n

i i

y y x b

a n

x

x x

1 1

1

1 1

2

[Question: Where does this equation come from?]

(4)

Machine-Learning typology

(Statistical) Machine-Learning

Supervised Learning

Unsupervised

Learning Reinforcement Learning Classification Regression

Supervised vs Unsupervised learning

Learning is called "supervised" when there are "target"

values for every example in training dataset:

examples = (input-output) = (x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _n ,y _n ) The goal is to build a (generally non-linear) approximate model for interpolation, in order to be able to GENERALIZE to input values other than those in training set

"Unsupervised" = when there are NO target values:

dataset = {x ₁ , x ₂ , … , x _n }

The goal is typically either to do datamining (unveil

structure in the distribution of examples in input space), or

to find an output maximizing a given evaluation function

(5)

Reinforcement Learning (RL)

Goal: find a “policy” a _t = p (s _t ) that maximizes

Typical use of RL: learn a BEHAVIOR

SUPERVISED LEARNING:

regression or classification

input

o u tp u t

points = examples è curve = regression

Input = point position target Output =

class label (

¨

=-1,+=+1)

ê Function label=f(x) (and separation

boundary)

Regression Classification

Continuous output(s) Discrete output(s)

(6)

Supervised learning

Examples (input-output) (x ₁ ,y ₁ ), (x ₂ ,y ₂ ), … , (x _n , y _n )

H

(parameterized) family of mathematical models

Hyper-parameters for training algorithm

LEARNING ALGORITHM

(usually based on optimization

technique)

h* Î H

so that **h*(x** _i ) » y _i

**In most cases, h*= argMin _hÎH K ( h, {(x _i ,y _i )} ) where K=cost K = S _i _{loss( h(x} _i _),y _i _{) [+} regularization-term] and loss= || h(x _i )-y** _i || ²

Cost function and loss function

Most supervised Machine-Learning algorithms work by minimizing a "cost function"

• The cost function is generally the average over all training examples of a "loss function"

K = S _i loss( h(x _i ),y _i )

(+ sometimes an additional « regularization » term)

• The loss function is usually some measure of the

difference between target value and prediction by

the output of the learnt model

(7)

Many different supervised ML approaches & algorithms

• Linear regressions

• Decision trees (ID3 or CART algorithms)

• Bayesian (probabilistic) methods

• …

• Multi-layer neural networks trained with gradient backpropagation

• Support Vector Machines

• Boosting of "weak" classifiers

• Random forests

• Deep Learning (Convolutional Neural Networks,…)

• …

Usual two distinct phases of supervised Machine-Learning

Pedestrians « non-pedestrians »

cars « non-cars»

STATISTICAL MACHINE- LEARNING ALGORITHM

CLASSIFIER Input

Category (class)

Recognition

Training

(8)

Typology of

classification methods

• By similarity è Nearest Neighbors (kNN)

• By succession of elementary tests è Decision Trees

• By probabilistic computations (using hypothesis on

distribution of classes) è Bayesian methods

• By error minimization (gradient descent, etc...) è Neural Networks, etc…

Idem + "margin" maximization

è Support Vector Machines (SVM)

• By voting committee (ensemble methods):

• using trees è Random Forests

• using successive weightings of examples è Boosting

Nearest Neighbors algorithm

Principle of Nearest Neighbors (kNN) for classification

[What are the main drawbacks of this method??]

(9)

Linear Multivariate Regression

[From slide by

]

Logistic

Multivariate Regression

[From slide by

]

If target output is binary (classification)

(10)

Different types of classification errors

BUT: False Negatives ("missed") ≠ False Positives!

Recall: percentage of relevant examples successfully predicted/retrieved

Precision: percentage of actually relevant examples among all those returned by the classifier

Error rate =

(FP+FN)/(TP+TN+FP+FN)

Accuracy, recall & precision formulas

# of correct positive predictions

# of real positives = ^TP

TP + FN (sensitivity) =

True Positive rate Recall

Precision # of correct positive predictions

= ^TP

# of correct predictions

Total # of examples = ^{TP + TN}

TP + TN + FP + FN Accuracy

("correctness")

[en français, exactitude]

=

(11)

Classification performance metrics

• Accuracy = proportion of correct

• Recall (sensitivity) » proportion of ”not missed”

» ”completeness” level [exhaustivité]

• Precision (specificity) » reliability of predicted labels

• Confusion matrix: predicted label v.s. true label

Precision-recall trade-off and curve

re ca ll

precision

For numeric comparison (or if curves cross each other), Area Under Curve (AUC)

Classifier C1 predicts better than C2 iff C1 has better recall and precision + Trade-off between recall and precision

è Compare precision-recall

curves!

(12)

Quality measures of learnt model:

loss function and error types

• Quality measure for a learnt model h:

Q(h)= E( L(h(x),y) )

where L(h(x),y) is the « LOSS function » generally = ||h(x)-y|| ²

• What optimum for h?

h ^* absolute optimum = ^argMin _h ^(E(h))

h ^* _H optimum within H family = argMin _h _Î _H (E(h))

h ^* _H,n optimum in H from finite set of examples =

argMin _h _Î _H (E _n (h)) where E _n (h)= (1/N) S _i _(L(h(x _i _),y _i ₎₎

ESTIMATION error MODEL error

E(h ^* _H,n )-E(h ^* )= [E(h ^* _H,n )-E(h ^* _H )] + [E(h ^* _H )-E(h ^* )]

Formal definition of SUPERVISED LEARNING

”LEARNING = APPROXIMATE + GENERALIZE”

Given a FINITE set of examples (x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,( x _n ,y _n ) where x _i ÎÂ ^d = input vectors, and y _i ÎÂ ^s = target values

(given by the ”teacher”), find a function h which

"approximates AND GENERALIZES as best as possible"

the underlying function such that y _i = f(x _i ) + noise Þ goal = to minimize the GENERALIZATION error

E _gen = ò ║h(x)-f(x)║ ² p(x)dx

(where p(x) = probability distribution of x)

(13)

About over-fitting

Fitting a data set to different orders of polynomials

[from Bishop, "Pattern Recognition and Machine Learning“]

Learning iterationsLearning iterations

Detection of over-fitting for an iterative algorithm

Training set Validation set

Error

The generalization error cannot be directly measured, only empirical error on examples can be estimated:

E _emp = ( ^S _i ||h(x _i )-y _i || ² ) /n

Machine-Learning methodology:

importance of validation set

To avoid over-fitting and maximize generalization, absolutely essential to use some VALIDATION estimation, for optimizing training hyper-parameters (and stopping criterion):

– either use a separate validation dataset (random split of data into Training-set + Validation-set)

– or use CROSS-VALIDATION:

• Repeat k times: train on (k-1)/k proportion of data + estimate error on remaining 1/k portion

• Average the k error estimations

S3 S2 S1

3-fold cross-validation:

• Train on S1 È S2 then estimate errS3 error on S3

• Train on S1 È S3 then estimate errS2 error on S2

• Train on S2 È S3 then estimate errS1 error on S1

• Average validation error: (errS1+errS2+errS3)/3

(14)

Empirical error and VC-dimension

• In practice, the only error that can be estimated and minimized is the empirical error computed on a finite set of examples:

E _emp = ( ^S _i ^||h(x _i ^)-y _i || ² ) ^/n

• According to « regularization theory » and

theoretical result by Vapnik, minimizing E _emp (h) within heH shall also minimize E _gen if H has a

finite VC-dimension

[VC-dimension {hyperplanes of ^! } ?]

VC-dimension : maximum cardinal v so that for any set S of v points, all dichotomies of S can be performed by one h Î H (VC-dim » complexity of H)

Regularization by adding penalty to the cost function

Vapnik has shown that:

Proba( max _h _Î _H |E _gen (h)–E _emp (h)| ³ e ) < G (n, d , ^e )

where n = # of examples and d =VC-dim and G decreases with d /n Þ to be sure that E _gen en decreases when minimizing E _emp ,

the smaller n is, the smaller the VC-dim d needs to be A possible way to automatically reduce VC-dim is to

modify the cost function into: C=E _emp + W (h) where W (h) penalizes « complexity » of h

( Þ reduction of « effective » VC-dim)

NB: » application of ”Occam’s razor” !!

(15)

Usual for of regularization penalty: L ₁ norm

In many cases, the complexity (in VC-dim sense) increases with maximum value of its parameters w _i è interesting to penalize large values of w _i

Usually done by modifying cost function into C = E _emp + l S _i ( ||w _i || )

Example: LASSO = regularized linear regression Min _w ( S _j _||y _j _-w.x _j _)||

2 2 + l ||w|| ₁ )

[L ₁ -norm penalization of regressor]

NB: if using L ₀ (# of NON-ZERO componants) penalization (instead of L ₁ ), we can obtain SPARSE model

What is (Machine-Learning)

Apprentissage Artificiel

(Machine-Learning)

General framework + Supervised Learning

Pr. Fabien MOUTARDE Center for Robotics

MINES ParisTech PSL Université Paris

Fabien.Moutarde@mines-paristech.fr

http://people.mines-paristech.fr/fabien.moutarde

What is

Statistical Machine-Learning?

STATISTICS Data analysis

OPTIMIZATION ARTIFICIAL

INTELLIGENCE

Machine

Learning

• Handwritten characters recognition

• Object category visual recognition

• Speech recognition

Real-world examples of SML applications

3 3

6

… … 6

Handwritten digits recognition

system

Pedestrians « non-pedestrians »

Pedestrian recognition

system

• Multi-factorial forecasting

• Natural Language understanding

• Playing GO!

• …

Statistical Machine-Learning

• One of many sub-fields of Artificial Intelligence

• Application of optimization methods to statistical modelling

• Data-driven mathematical modelling, for automated classification, regression, partitioning/clustering, or decision/behavior rule

Types of Machine-learning

• Availability of target output data?

® Supervised learning vs. Unsupervised learning

or Reinforcement Learning

• Permanent adaptability?

® offline learning vs. online (life-long) learning

• What kind of (mathematical) model?

® polynom/spline, decision tree, neural net, kernel machine, …

• Which objective function?

® cost function (quadratic error, …), implicit criterium, …

• How to find the best-fitting model?

® algorithm type (exact solving, gradient descent,

quadratic optimization, heuristics, …)

Most simple ML example:

Least Squares Linear Regression

• Model: (straight) line y=ax+b (2 parameters a and b)

• Data: n points with target value (x i ,y i ) ÎÂ 2

• Cost function: sum of squares of deviation from line K= S i (y i -a.x i -b) 2

• Algorithm: direct ( or iterative ) solving of linear system

÷ ÷

÷ ÷

ø ö

ç ç ç ç

è æ

÷ =

÷ ø ç ö ç è

× æ

÷ ÷

÷ ÷

ø ö

ç ç ç ç

è æ

å å å

å å

y y x b

a n

x

x x

[Question: Where does this equation come from?]

Machine-Learning typology

(Statistical) Machine-Learning

Supervised Learning

Unsupervised

Learning Reinforcement Learning Classification Regression

Supervised vs Unsupervised learning

• Data: n points with target value (x _i ,y _i ) ÎÂ ²

• Cost function: sum of squares of deviation from line K= S _i (y _i -a.x _i -b) ²

examples = (input-output) = (x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _n ,y _n ) The goal is to build a (generally non-linear) approximate model for interpolation, in order to be able to GENERALIZE to input values other than those in training set

dataset = {x ₁ , x ₂ , … , x _n }

Goal: find a “policy” a _t = p (s _t ) that maximizes

Examples (input-output) (x ₁ ,y ₁ ), (x ₂ ,y ₂ ), … , (x _n , y _n )

so that **h*(x** _i ) » y _i

**In most cases, h*= argMin _hÎH K ( h, {(x _i ,y _i )} ) where K=cost K = S _i _{loss( h(x} _i _),y _i _{) [+} regularization-term] and loss= || h(x _i )-y** _i || ²

K = S _i loss( h(x _i ),y _i )