Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 1
Apprentissage Artificiel
(Statistical Machine-Learning)
General framework + Supervised Learning
Pr. Fabien MOUTARDE Center for Robotics
MINES ParisTech PSL Université Paris
[email protected]
http://people.mines-paristech.fr/fabien.moutarde
Outline
• Intro: What is Statistical Machine-Learning?
• Typology of Machine-Learning
• General formalism for SUPERVISED Learning
• Evaluating learnt models:
metrics for CLASSIFICATION
• Generalization vs. overfitting
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 3
What is
Statistical Machine-Learning?
STATISTICS Data analysis
OPTIMIZATION ARTIFICIAL
INTELLIGENCE
Machine Learning
Statistical Machine-Learning
• One of many sub-fields of Artificial Intelligence
• Application of optimization methods to statistical modelling
• Data-driven mathematical modelling, for automated classification, regression, partitioning/clustering, or decision/behavior rule
Clustering
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 5
• Handwritten characters recognition
• Object category visual recognition
• Speech recognition
Real-world examples of
Machine-Learning applications
3 3
6
… … 6
Handwritten digits recognition
system
Pedestrians « non-pedestrians »
Pedestrian recognition
system
• Multi-factorial forecasting
• Natural Language understanding
• Playing GO!
• MANY MANY MORE…
One of simplest ML algorithm:
Least Squares Linear Regression
• Model: (straight) line y=ax+b (2 parameters a and b)
• Data: n points with target value (x i ,y i ) ÎÂ 2
• Cost function: sum of squares of deviation from line K= S i (y i -a.x i -b) 2
• Algorithm: direct ( or iterative ) solving of linear system
÷ ÷
÷ ÷
ø ö
ç ç ç ç
è æ
÷ =
÷ ø ç ö ç è
× æ
÷ ÷
÷ ÷
ø ö
ç ç ç ç
è æ
å å å
å å
=
=
=
=
=
n
i
i n
i
i i n
i i
n
i i n
i i
y y x b
a n
x
x x
1 1
1
1 1
2
[Question: Where does this equation come from?]
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 7
Regression vs. classification
input
o u tp u t
points = examples è curve = regression
Input = point position target Output =
class label (
¨=-1,+=+1)
ê Function label=f(x) (and separation
boundary)
Regression Classification
Continuous output(s) Discrete output(s)
Simplest classification method:
Nearest Neighbors algorithm
Principle of Nearest Neighbors (kNN) for classification
[What are the main drawbacks of this method??]
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 9
Outline
• Intro: What is Statistical Machine-Learning?
• Typology of Machine-Learning
• General formalism for SUPERVISED Learning
• Evaluating learnt models:
metrics for CLASSIFICATION
• Generalization vs. overfitting
Supervised vs Unsupervised learning
Learning is called "supervised" when there are "target"
values for every example in training dataset:
examples = (input-output) = (x 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n ) The goal is to build a (generally non-linear) approximate model for interpolation, in order to be able to GENERALIZE to input values other than those in training set
"Unsupervised" = when there are NO target values:
dataset = {x 1 , x 2 , … , x n }
The goal is typically either to do datamining (unveil
structure in the distribution of examples in input space), or
to find an output maximizing a given evaluation function
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 12
Machine-Learning TYPOLOGY
SUPERVISED LEARNING:
regression or classification
input
o u tp u t
Examples {(x
i,y
i), i=1,…N } x
i=input, y
i=target output
è Infer: curve = regression y » h(x)
Input {x
i, i=1,…N } = points positions target Output = class label (
¨=-1,+=+1)
è Infer: label=h(x) (and separation boundary)
Regression Classification
y: Continuous output(s) y: Discrete output(s)
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 14
UNSUPERVISED LEARNING:
Clustering vs. Generative model
Clustering
Points = examples
è partitioning in “groups” (colors) based on similarity
Generative model
From examples x n , estimate the PROBABILITY DISTRIBUTION p(x)
è Can GENERATE new examples SIMILAR to those in training set
Reinforcement Learning (RL)
Goal: find a “policy” a t = p (s t ) that maximizes
Typical use of RL: learn a BEHAVIOR
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 16
Outline
• Intro: What is Statistical Machine-Learning?
• Typology of Machine-Learning
• General formalism for SUPERVISED Learning
• Evaluating learnt models:
metrics for CLASSIFICATION
• Generalization vs. overfitting
Many different supervised ML approaches & algorithms
• Linear regressions
• Decision trees (ID3 or CART algorithms)
• Bayesian (probabilistic) methods
• …
• Multi-layer neural networks trained with gradient backpropagation
• Support Vector Machines
• Boosting of "weak" classifiers
• Random forests
• Deep Learning (Convolutional Neural Networks,…)
• …
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 18
Supervised learning
Examples (input-output) (x 1 ,y 1 ), (x 2 ,y 2 ), … , (x n , y n )
H
(parameterized) family of mathematical models
Hyper-parameters for training algorithm
LEARNING ALGORITHM
(usually based on optimization
technique)
h* Î H
so that h*(x i ) » y i
In most cases, h*= argMin hÎH K ( h, {(x i ,y i )} ) where K=cost K = S i loss( h(x i ),y i ) [+ regularization-term] and loss= || h(x i )-y i || 2
Cost function and loss function
Most supervised Machine-Learning algorithms work by minimizing a "cost function"
• The cost function is generally the average over all training examples of a "loss function"
K = S i loss( h(x i ),y i )
(+ sometimes an additional « regularization » term)
• The loss function is usually some measure of the
difference between target value and prediction by
the output of the learnt model
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 20
Linear Multivariate Regression
[From slide by
]Logistic
Multivariate Regression
[From slide by
]If target output is binary (classification)
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 22
Usual two distinct phases of supervised Machine-Learning
Pedestrians « non-pedestrians »
cars « non-cars»
STATISTICAL MACHINE- LEARNING ALGORITHM
CLASSIFIER Input
Category (class)
Recognition Training
Outline
• Intro: What is Statistical Machine-Learning?
• Typology of Machine-Learning
• General formalism for SUPERVISED Learning
• Evaluating learnt models:
metrics for CLASSIFICATION
• Generalization vs. overfitting
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 24
Different types of classification errors
BUT: False Negatives ("missed") ≠ False Positives!
Recall: percentage of relevant examples successfully predicted/retrieved
Precision: percentage of actually relevant examples among all those returned by the classifier
Error rate =
(FP+FN)/(TP+TN+FP+FN)
Accuracy, recall & precision formulas
# of correct positive predictions
# of real positives = TP
TP + FN (sensitivity) =
True Positive rate Recall
Precision # of correct positive predictions
# of positive predictions
= TP
TP + FP (specificity) =
# of correct predictions
Total # of examples = TP + TN
TP + TN + FP + FN Accuracy
("correctness")
[en français, exactitude]
=
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 26
Classification performance metrics
• Accuracy = proportion of correct
• Recall (sensitivity) » proportion of ”not missed”
» ”completeness” level [exhaustivité]
• Precision (specificity) » reliability of predicted labels
• Confusion matrix: predicted label v.s. true label
Precision-recall trade-off and curve
re call
precision
For numeric comparison (or if curves cross each other), Area Under Curve (AUC)
Classifier C1 predicts better than C2 iff C1 has better recall and precision + Trade-off between recall and precision
è Compare precision-recall
curves!
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 28
Quality measures of learnt model:
loss function and error types
• Quality measure for a learnt model h:
Q(h)= E( L(h(x),y) )
where L(h(x),y) is the « LOSS function » generally = ||h(x)-y|| 2
• What optimum for h?
h * absolute optimum = argMin h (E(h))
h * H optimum within H family = argMin h Î H (E(h))
h * H,n optimum in H from finite set of examples =
argMin h Î H (E n (h)) where E n (h)= (1/N) S i (L(h(x i ),y i ))
ESTIMATION error MODEL error
E(h * H,n )-E(h * )= [E(h * H,n )-E(h * H )] + [E(h * H )-E(h * )]
Outline
• Intro: What is Statistical Machine-Learning?
• Typology of Machine-Learning
• General formalism for SUPERVISED Learning
• Evaluating learnt models:
metrics for CLASSIFICATION
• Generalization vs. overfitting
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 30
Formal definition of SUPERVISED LEARNING
”LEARNING = APPROXIMATE + GENERALIZE”
Given a FINITE set of examples (x 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n ) where x i ÎÂ d = input vectors, and y i ÎÂ s = target values
(given by the ”teacher”), find a function h which
"approximates AND GENERALIZES as best as possible"
the underlying function such that y i = f(x i ) + noise Þ goal = to minimize the GENERALIZATION error
E gen = ò ║h(x) -f(x )║ 2 p(x)dx
(where p(x) = probability distribution of x)
About over-fitting
Fitting a data set to different orders of polynomials
Detection of over-fitting for an iterative algorithm
Training set Validation set
Error
The generalization error cannot be directly measured, only empirical error on examples can be estimated:
E emp = ( S i ||h(x i )-y i || 2 ) /n
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 32
Machine-Learning methodology:
importance of validation set
To avoid over-fitting and maximize generalization, absolutely essential to use some VALIDATION estimation, for optimizing training hyper-parameters (and stopping criterion):
– either use a separate validation dataset (random split of data into Training-set + Validation-set)
– or use CROSS-VALIDATION:
• Repeat k times: train on (k-1)/k proportion of data + estimate error on remaining 1/k portion
• Average the k error estimations
S3 S2 S1
3-fold cross-validation:
• Train on S1 È S2 then estimate errS3 error on S3
• Train on S1 È S3 then estimate errS2 error on S2
• Train on S2 È S3 then estimate errS1 error on S1
• Average validation error: (errS1+errS2+errS3)/3
Empirical error and VC-dimension
• In practice, the only error that can be estimated and minimized is the empirical error computed on a finite set of examples:
E emp = ( S i ||h(x i )-y i || 2 ) /n
• According to « regularization theory » and
theoretical result by Vapnik, minimizing E emp (h) within heH shall also minimize E gen if H has a
finite VC-dimension
[VC-dimension {hyperplanes of ! " } ?]
VC-dimension : maximum cardinal v so that for
any set S of v points, all dichotomies of S can be
performed by one h Î H (VC-dim » complexity of H)
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 34
Regularization by adding penalty to the cost function
Vapnik has shown that:
Proba( max hÎH |E gen (h)–E emp (h)| ³ e ) < G (n, d , e )
where n = # of examples and d =VC-dim and G decreases with d /n Þ to be sure that E gen en decreases when minimizing E emp ,
the smaller n is, the smaller the VC-dim d needs to be A possible way to automatically reduce VC-dim is to
modify the cost function into: C=E emp + W (h) where W (h) penalizes « complexity » of h
( Þ reduction of « effective » VC-dim)
NB: » application of ”Occam’s razor” !!
( » ”why do complicated if it can be done simpler?”)
Usual form of
regularization penalty: L 1 norm
In many cases, the complexity (in VC-dim sense) increases with maximum value of its parameters w i è interesting to penalize large values of w i
Usually done by modifying cost function into C = E emp + l S i ( ||w i || )
Example: LASSO = regularized linear regression Min w ( S j ||y j -w.x j )||
2
2 + l ||w|| 1 )
[L 1 -norm penalization of regressor]
NB: if using L 0 (# of NON-ZERO componants) penalization
(instead of L 1 ), we can obtain SPARSE model
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 36
Data augmentation (for classification)
In the case of CLASSIFICATION, over-fitting
avoidance and better generalization can also be favored by DATA AUGMENTATION:
for each labelled example in training set, generate several slightly distorted variants which shall have the same label
Particularly important (and easy) for image inputs or
time-series inputs
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 38
Synthesis on various algorithms for SUPERVISED Machine-Learning
Supervised learning
Examples (input-output) (x 1 ,y 1 ), (x 2 ,y 2 ), … , (x n , y n )
H
(parameterized) family of mathematical models
Hyper-parameters for training algorithm
LEARNING ALGORITHM
(usually based on optimization
technique)
h Î H
so that
h(x i ) » y i
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 41
Summary of main shallow
SUPERVISED learning algorithms
• Decision trees: naturally adapted to symbolic inputs, very fast, good scaling for very high number of classes, "white" box;
BUT noise sensitive
• Multi-layer neural networks: universal approximators, good generalization, easy handling of multi-class;
BUT optimum model NOT guaranteed, many critical hyper-parameters (# hidden neurons, weight init., learning rate, # training epochs,…)
• Support Vector Machines: maths-guaranteed optimal separation, possible handling of structured input (graphs, etc…) via kernel;
BUT not very efficient for multi-class (K times 1-vs-all SVMs, or at least log(K) times C i -vs-C j ), training computation rises quickly with input dim and # of examples O( max(N,D) * min(N,D)^2 )
• Boosting of « weak » classifiers: simple algo, can build strong classifier from any weak classifier, can select features during training;
BUT not very efficient for multi-class (n times 1-vs-all)
• Random forests: OK for symbolic input, robustness to noise, very fast to compute, efficient for large # of classes and high input dim;
BUT training sometimes long
Model type choice criteria for SUPERVISED learning
MLP Neural Network
ConvNets SVM Boosting Decision Tree
Random Forest
Many classes + + -- -- ++
High dimension of input
- + ++
Many examples
REQUIRED(except if transfer- learning)
-
Interpretability
(« white » box) - -- YES
Data OTHER than vectors of values
Only
“grid”
data
Structured (string,
graph)
symbolic symbolic
Robustness to noise and
erroneous labels + + ++ -- ++
Ease/speed of training - --- + ++ +
Handling of features Learn
them
Automated selection
Execution time - +++ +
Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 44