• Aucun résultat trouvé

(Statistical Machine-Learning)

N/A
N/A
Protected

Academic year: 2022

Partager "(Statistical Machine-Learning)"

Copied!
21
0
0

Texte intégral

(1)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 1

Apprentissage Artificiel

(Statistical Machine-Learning)

General framework + Supervised Learning

Pr. Fabien MOUTARDE Center for Robotics

MINES ParisTech PSL Université Paris

[email protected]

http://people.mines-paristech.fr/fabien.moutarde

Outline

Intro: What is Statistical Machine-Learning?

Typology of Machine-Learning

General formalism for SUPERVISED Learning

Evaluating learnt models:

metrics for CLASSIFICATION

Generalization vs. overfitting

(2)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 3

What is

Statistical Machine-Learning?

STATISTICS Data analysis

OPTIMIZATION ARTIFICIAL

INTELLIGENCE

Machine Learning

Statistical Machine-Learning

One of many sub-fields of Artificial Intelligence

Application of optimization methods to statistical modelling

Data-driven mathematical modelling, for automated classification, regression, partitioning/clustering, or decision/behavior rule

Clustering

(3)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 5

Handwritten characters recognition

Object category visual recognition

Speech recognition

Real-world examples of

Machine-Learning applications

3 3

6

6

Handwritten digits recognition

system

Pedestrians « non-pedestrians »

Pedestrian recognition

system

Multi-factorial forecasting

Natural Language understanding

Playing GO!

MANY MANY MORE…

One of simplest ML algorithm:

Least Squares Linear Regression

Model: (straight) line y=ax+b (2 parameters a and b)

Data: n points with target value (x i ,y i ) ÎÂ 2

Cost function: sum of squares of deviation from line K= S i (y i -a.x i -b) 2

Algorithm: direct ( or iterative ) solving of linear system

÷ ÷

÷ ÷

ø ö

ç ç ç ç

è æ

÷ =

÷ ø ç ö ç è

× æ

÷ ÷

÷ ÷

ø ö

ç ç ç ç

è æ

å å å

å å

=

=

=

=

=

n

i

i n

i

i i n

i i

n

i i n

i i

y y x b

a n

x

x x

1 1

1

1 1

2

[Question: Where does this equation come from?]

(4)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 7

Regression vs. classification

input

o u tp u t

points = examples è curve = regression

Input = point position target Output =

class label (

¨

=-1,+=+1)

ê Function label=f(x) (and separation

boundary)

Regression Classification

Continuous output(s) Discrete output(s)

Simplest classification method:

Nearest Neighbors algorithm

Principle of Nearest Neighbors (kNN) for classification

[What are the main drawbacks of this method??]

(5)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 9

Outline

Intro: What is Statistical Machine-Learning?

Typology of Machine-Learning

General formalism for SUPERVISED Learning

Evaluating learnt models:

metrics for CLASSIFICATION

Generalization vs. overfitting

Supervised vs Unsupervised learning

Learning is called "supervised" when there are "target"

values for every example in training dataset:

examples = (input-output) = (x 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n ) The goal is to build a (generally non-linear) approximate model for interpolation, in order to be able to GENERALIZE to input values other than those in training set

"Unsupervised" = when there are NO target values:

dataset = {x 1 , x 2 , , x n }

The goal is typically either to do datamining (unveil

structure in the distribution of examples in input space), or

to find an output maximizing a given evaluation function

(6)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 12

Machine-Learning TYPOLOGY

SUPERVISED LEARNING:

regression or classification

input

o u tp u t

Examples {(x

i

,y

i

), i=1,…N } x

i

=input, y

i

=target output

è Infer: curve = regression y » h(x)

Input {x

i

, i=1,…N } = points positions target Output = class label (

¨

=-1,+=+1)

è Infer: label=h(x) (and separation boundary)

Regression Classification

y: Continuous output(s) y: Discrete output(s)

(7)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 14

UNSUPERVISED LEARNING:

Clustering vs. Generative model

Clustering

Points = examples

è partitioning in “groups” (colors) based on similarity

Generative model

From examples x n , estimate the PROBABILITY DISTRIBUTION p(x)

è Can GENERATE new examples SIMILAR to those in training set

Reinforcement Learning (RL)

Goal: find a “policy” a t = p (s t ) that maximizes

Typical use of RL: learn a BEHAVIOR

(8)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 16

Outline

Intro: What is Statistical Machine-Learning?

Typology of Machine-Learning

General formalism for SUPERVISED Learning

Evaluating learnt models:

metrics for CLASSIFICATION

Generalization vs. overfitting

Many different supervised ML approaches & algorithms

Linear regressions

Decision trees (ID3 or CART algorithms)

Bayesian (probabilistic) methods

Multi-layer neural networks trained with gradient backpropagation

Support Vector Machines

Boosting of "weak" classifiers

Random forests

Deep Learning (Convolutional Neural Networks,…)

(9)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 18

Supervised learning

Examples (input-output) (x 1 ,y 1 ), (x 2 ,y 2 ), … , (x n , y n )

H

(parameterized) family of mathematical models

Hyper-parameters for training algorithm

LEARNING ALGORITHM

(usually based on optimization

technique)

h* Î H

so that h*(x i ) » y i

In most cases, h*= argMin hÎH K ( h, {(x i ,y i )} ) where K=cost K = S i loss( h(x i ),y i ) [+ regularization-term] and loss= || h(x i )-y i || 2

Cost function and loss function

Most supervised Machine-Learning algorithms work by minimizing a "cost function"

The cost function is generally the average over all training examples of a "loss function"

K = S i loss( h(x i ),y i )

(+ sometimes an additional « regularization » term)

The loss function is usually some measure of the

difference between target value and prediction by

the output of the learnt model

(10)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 20

Linear Multivariate Regression

[From slide by

]

Logistic

Multivariate Regression

[From slide by

]

If target output is binary (classification)

(11)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 22

Usual two distinct phases of supervised Machine-Learning

Pedestrians « non-pedestrians »

cars « non-cars»

STATISTICAL MACHINE- LEARNING ALGORITHM

CLASSIFIER Input

Category (class)

Recognition Training

Outline

Intro: What is Statistical Machine-Learning?

Typology of Machine-Learning

General formalism for SUPERVISED Learning

Evaluating learnt models:

metrics for CLASSIFICATION

Generalization vs. overfitting

(12)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 24

Different types of classification errors

BUT: False Negatives ("missed") ≠ False Positives!

Recall: percentage of relevant examples successfully predicted/retrieved

Precision: percentage of actually relevant examples among all those returned by the classifier

Error rate =

(FP+FN)/(TP+TN+FP+FN)

Accuracy, recall & precision formulas

# of correct positive predictions

# of real positives = TP

TP + FN (sensitivity) =

True Positive rate Recall

Precision # of correct positive predictions

# of positive predictions

= TP

TP + FP (specificity) =

# of correct predictions

Total # of examples = TP + TN

TP + TN + FP + FN Accuracy

("correctness")

[en français, exactitude]

=

(13)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 26

Classification performance metrics

Accuracy = proportion of correct

Recall (sensitivity) » proportion of ”not missed”

» ”completeness” level [exhaustivité]

Precision (specificity) » reliability of predicted labels

Confusion matrix: predicted label v.s. true label

Precision-recall trade-off and curve

re call

precision

For numeric comparison (or if curves cross each other), Area Under Curve (AUC)

Classifier C1 predicts better than C2 iff C1 has better recall and precision + Trade-off between recall and precision

è Compare precision-recall

curves!

(14)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 28

Quality measures of learnt model:

loss function and error types

Quality measure for a learnt model h:

Q(h)= E( L(h(x),y) )

where L(h(x),y) is the « LOSS function » generally = ||h(x)-y|| 2

What optimum for h?

h * absolute optimum = argMin h (E(h))

h * H optimum within H family = argMin h Î H (E(h))

h * H,n optimum in H from finite set of examples =

argMin h Î H (E n (h)) where E n (h)= (1/N) S i (L(h(x i ),y i ))

ESTIMATION error MODEL error

E(h * H,n )-E(h * )= [E(h * H,n )-E(h * H )] + [E(h * H )-E(h * )]

Outline

Intro: What is Statistical Machine-Learning?

Typology of Machine-Learning

General formalism for SUPERVISED Learning

Evaluating learnt models:

metrics for CLASSIFICATION

Generalization vs. overfitting

(15)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 30

Formal definition of SUPERVISED LEARNING

”LEARNING = APPROXIMATE + GENERALIZE”

Given a FINITE set of examples (x 1 ,y 1 ),(x 2 ,y 2 ),…,(x n ,y n ) where x i ÎÂ d = input vectors, and y i ÎÂ s = target values

(given by the ”teacher”), find a function h which

"approximates AND GENERALIZES as best as possible"

the underlying function such that y i = f(x i ) + noise Þ goal = to minimize the GENERALIZATION error

E gen = ò ║h(x) -f(x )║ 2 p(x)dx

(where p(x) = probability distribution of x)

About over-fitting

Fitting a data set to different orders of polynomials

Detection of over-fitting for an iterative algorithm

Training set Validation set

Error

The generalization error cannot be directly measured, only empirical error on examples can be estimated:

E emp = ( S i ||h(x i )-y i || 2 ) /n

(16)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 32

Machine-Learning methodology:

importance of validation set

To avoid over-fitting and maximize generalization, absolutely essential to use some VALIDATION estimation, for optimizing training hyper-parameters (and stopping criterion):

either use a separate validation dataset (random split of data into Training-set + Validation-set)

or use CROSS-VALIDATION:

Repeat k times: train on (k-1)/k proportion of data + estimate error on remaining 1/k portion

Average the k error estimations

S3 S2 S1

3-fold cross-validation:

Train on S1 È S2 then estimate errS3 error on S3

Train on S1 È S3 then estimate errS2 error on S2

Train on S2 È S3 then estimate errS1 error on S1

Average validation error: (errS1+errS2+errS3)/3

Empirical error and VC-dimension

In practice, the only error that can be estimated and minimized is the empirical error computed on a finite set of examples:

E emp = ( S i ||h(x i )-y i || 2 ) /n

According to « regularization theory » and

theoretical result by Vapnik, minimizing E emp (h) within heH shall also minimize E gen if H has a

finite VC-dimension

[VC-dimension {hyperplanes of ! " } ?]

VC-dimension : maximum cardinal v so that for

any set S of v points, all dichotomies of S can be

performed by one h Î H (VC-dim » complexity of H)

(17)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 34

Regularization by adding penalty to the cost function

Vapnik has shown that:

Proba( max hÎH |E gen (h)–E emp (h)| ³ e ) < G (n, d , e )

where n = # of examples and d =VC-dim and G decreases with d /n Þ to be sure that E gen en decreases when minimizing E emp ,

the smaller n is, the smaller the VC-dim d needs to be A possible way to automatically reduce VC-dim is to

modify the cost function into: C=E emp + W (h) where W (h) penalizes « complexity » of h

( Þ reduction of « effective » VC-dim)

NB: » application of ”Occam’s razor” !!

( » ”why do complicated if it can be done simpler?”)

Usual form of

regularization penalty: L 1 norm

In many cases, the complexity (in VC-dim sense) increases with maximum value of its parameters w i è interesting to penalize large values of w i

Usually done by modifying cost function into C = E emp + l S i ( ||w i || )

Example: LASSO = regularized linear regression Min w ( S j ||y j -w.x j )||

2

2 + l ||w|| 1 )

[L 1 -norm penalization of regressor]

NB: if using L 0 (# of NON-ZERO componants) penalization

(instead of L 1 ), we can obtain SPARSE model

(18)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 36

Data augmentation (for classification)

In the case of CLASSIFICATION, over-fitting

avoidance and better generalization can also be favored by DATA AUGMENTATION:

for each labelled example in training set, generate several slightly distorted variants which shall have the same label

Particularly important (and easy) for image inputs or

time-series inputs

(19)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 38

Synthesis on various algorithms for SUPERVISED Machine-Learning

Supervised learning

Examples (input-output) (x 1 ,y 1 ), (x 2 ,y 2 ), … , (x n , y n )

H

(parameterized) family of mathematical models

Hyper-parameters for training algorithm

LEARNING ALGORITHM

(usually based on optimization

technique)

h Î H

so that

h(x i ) » y i

(20)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 41

Summary of main shallow

SUPERVISED learning algorithms

Decision trees: naturally adapted to symbolic inputs, very fast, good scaling for very high number of classes, "white" box;

BUT noise sensitive

Multi-layer neural networks: universal approximators, good generalization, easy handling of multi-class;

BUT optimum model NOT guaranteed, many critical hyper-parameters (# hidden neurons, weight init., learning rate, # training epochs,…)

Support Vector Machines: maths-guaranteed optimal separation, possible handling of structured input (graphs, etc…) via kernel;

BUT not very efficient for multi-class (K times 1-vs-all SVMs, or at least log(K) times C i -vs-C j ), training computation rises quickly with input dim and # of examples O( max(N,D) * min(N,D)^2 )

Boosting of « weak » classifiers: simple algo, can build strong classifier from any weak classifier, can select features during training;

BUT not very efficient for multi-class (n times 1-vs-all)

Random forests: OK for symbolic input, robustness to noise, very fast to compute, efficient for large # of classes and high input dim;

BUT training sometimes long

Model type choice criteria for SUPERVISED learning

MLP Neural Network

ConvNets SVM Boosting Decision Tree

Random Forest

Many classes + + -- -- ++

High dimension of input

- + ++

Many examples

REQUIRED

(except if transfer- learning)

-

Interpretability

(« white » box) - -- YES

Data OTHER than vectors of values

Only

“grid”

data

Structured (string,

graph)

symbolic symbolic

Robustness to noise and

erroneous labels + + ++ -- ++

Ease/speed of training - --- + ++ +

Handling of features Learn

them

Automated selection

Execution time - +++ +

(21)

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 44

Some REFERENCE TEXTBOOKS on Statistical Machine-Learning

Introduction au machine learning

C. Azencott, Dunod (2018).

https://www.dunod.com/sciences-techniques/introduction-au-machine-learning-0

The Elements of Statistical Learning (2 nd edition)

T. Hastier, R. Tibshirani & J. Friedman, Springer, 2009.

http://statweb.stanford.edu/~tibs/ElemStatLearn/

Deep Learning

I. Goodfellow, Y. Bengio & A. Courville, MIT press, 2016.

http://www.deeplearningbook.org/

Pattern recognition and Machine-Learning

C. M. Bishop, Springer, 2006.

Introduction to Data Mining

P.N. Tan, M. Steinbach & V. Kumar, AddisonWeasley, 2006.

Apprentissage artificiel : concepts et algorithmes

A. Cornuéjols, L. Miclet & Y. Kodratoff, Eyrolles, 2002.

Références

Documents relatifs

Here, the linear regression suggests that on the average, the salary increases with the number of years since Ph.D:.. • the slope of 985.3 indicates that each additional year since

In the picture below the effect of the capacity |F | is analyzed on the variance term, bias term and total excess risk, for a fixed dataset.. In particular there are two

The slice selected for the response variable comes from the dataset House prices of the Scottish portal 5 with the temporal dimension fixed to 2012, the measure type fixed to mean,

Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Sept.2021

Statistical Machine-Learning: framework + supervised ML, Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2018 1..

Gaussian graphi- cal models (GGMs) [1–3], probabilistic Boolean networks (PBNs) [4–7], Bayesian networks (BNs) [8,9], differential equation based [10,11] and mutual information

If this picture is correct, and the present data do not contradict it, as it is possible to classify the combined dataset with 100% accuracy using a decision tree, then it is

Statistical machine learning has the chance of exploiting statistical regularities in the data that cannot easily be captured by logical statements and can handle