• Aucun résultat trouvé

Introduction

N/A
N/A
Protected

Academic year: 2022

Partager "Introduction"

Copied!
135
0
0

Texte intégral

(1)

An overview of Machine Learning

Alessandro Rudi, Francis Bach – Inria, ENS Paris

Thanks to Pierre Gaillard for the slides!

February 4, 2021

1

(2)

Overview

Introduction Supervised learning

Empirical risk minimization: OLS, Logistic regression, Ridge, Lasso, Quantile regression

Calibration of the parameters: cross-validation Local averages

Deep learning Unsupervised learning

Clustering

Dimensionality Reduction Algorithms Planning of the class

2

(3)

Introduction

(4)

General information

Teachers: Alessandro Rudi and Francis Bach.

Practical sessions: Raphal Berthier.

Website:https://www.di.ens.fr/appstat/

The class will last 52 hours (30 hours of class + 22 hours of practical sessions) and can be validated for 9 ECTS. Final grade: 50% final exam, 50% homework.

Special online format: inverted classroom! - Read lecture notes before

- Short review of material

- ≈One mandatory question per student per session - Have video feeds open

3

(5)

General information

Teachers: Alessandro Rudi and Francis Bach.

Practical sessions: Raphal Berthier.

Website:https://www.di.ens.fr/appstat/

The class will last 52 hours (30 hours of class + 22 hours of practical sessions) and can be validated for 9 ECTS. Final grade: 50% final exam, 50% homework.

Special online format: inverted classroom!

- Read lecture notes before - Short review of material

- ≈One mandatory question per student per session - Have video feeds open

3

(6)

Prerequesites

Linear algebra(matrix operations, linear systems)

Probability(e.g. notion of random variables, conditional expectation)

Basic coding skills in Python:Jupyter notebooks, Anaconda

- If you do not know the language Python, please read (and code the examples of) this 10-minutes introduction to Python:

https://www.stavros.io/tutorials/python/.

- For next week: run

https://www.di.ens.fr/appstat/spring-2020/TP/TD0-prerequisites/

crash_test.ipynb

4

(7)

Rule

questions if something is unclear (we like them especially if you think they are stupid).

5

(8)

What is ML?

artificial intelligence which can learn and model some phenomena without being explicitly programmed

Examplesof “success stories”:

- Spam classification - Machine translation - Speech recognition - Self-driving cars

6

(9)

What is ML? Examples

Image from Lorenzo Rosasco lecture 7

(10)

Search engines

8

(11)

Search engines

8

(12)

Search engines

8

(13)

Text recognition

Image from Francis Bach lecture 9

(14)

Text recognition

Image from Francis Bach lecture 9

(15)

Bioinformatics

Large data – Complex data

10

(16)

What is ML?

Machine Learning :artificial intelligence which can learn and model some phenomena without being explicitly programmed

Machine Learning⊂Statistics + Computer Sciences

Mathematics (Statistics, Optimization)

Computer science

Domain expertise

Applied Machine Learning

Traditional research Machine

learning

Dangerous software

Image from Francis Bach lecture

11

(17)

What is ML?

Machine Learning :artificial intelligence which can learn and model some phenomena without being explicitly programmed

Machine Learning⊂Statistics + Computer Sciences

input output

‘zebra’

‘tiger’

Traditional programming:

Machine learning:

program

‘zebra’

‘tiger’

training data

learning program

input output

program

chosen program / model

Image from Francis Bach lecture 11

(18)

Why is ML successful?

Image from Francis Bach lecture 12

(19)

Sexiest job of the century

13

(20)

The machine learning revolution

Big data / machine learning / data science / artificial intelligence / deep learning, a revolution?

- Technical progress: increase in computing power and storage capacity, lower costs

- Exponential increase in amount of data: Volume,Variability,Velocity,Veracity – IBM:1018bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

– In all forms: video, text, clicks, numbers

- Methodological advancementto analyze complex datasets: high dimensional statistics, deep learning, reinforcement learning,. . .

14

(21)

Moore’s law: more computing power

15

(22)

Moore’s Law: reduced costs

Limits: – debits do not follow

– miniaturization→reach the limits of classical physics→quantum mechanics

16

(23)

The machine learning revolution

Big data / machine learning / data science / artificial intelligence / deep learning, a revolution?

- Technical progress: increase in computing power and storage capacity, lower costs - Exponential increase in amount of data: Volume,Variability,Velocity,Veracity

– IBM:1018bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

– In all forms: video, text, clicks, numbers

- Methodological advancementto analyze complex datasets: high dimensional statistics, deep learning, reinforcement learning,. . .

17

(24)

The machine learning revolution

Big data / machine learning / data science / artificial intelligence / deep learning, a revolution?

- Technical progress: increase in computing power and storage capacity, lower costs - Exponential increase in amount of data: Volume,Variability,Velocity,Veracity

– IBM:1018bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

– In all forms: video, text, clicks, numbers

- Methodological advancementto analyze complex datasets: high dimensional statistics, deep learning, reinforcement learning,. . .

17

(25)

Overview of Machine Learning

18

(26)

Overview of Machine Learning

18

(27)

Overview of most popular machine learning methods

Two main categoriesof machine learning algorithms:

- Supervised learning:predict outputY from some input dataX. The training data has a known labelY.

Examples:

– X is a picture, andY is a cat or a dog – X is a picture, andY∈ {0, . . . ,9}is a digit – X is are videos captured by a robot playing table

tennis, andY are the parameters of the robots to return the ball correctly

– X is a music track andY are the audio signals of each instrument

? ?

?

- Unsupervised learning:training data is not labeled and does not have a known result Examples:

– detect change points in a non-stationary time-series – detect outliers

– cluster data in homogeneous groups

– compress data without loosing much information – density estimation

? ? ?

?

?

?

?

? ?

?

?

- Others:reinforcement learning, semi-supervised learning, online learning,. . .

19

(28)

Overview of most popular machine learning methods

Two main categoriesof machine learning algorithms:

- Supervised learning:predict outputY from some input dataX. The training data has a known labelY.

Examples:

– X is a picture, andY is a cat or a dog – X is a picture, andY∈ {0, . . . ,9}is a digit – X is are videos captured by a robot playing table

tennis, andY are the parameters of the robots to return the ball correctly

– X is a music track andY are the audio signals of each instrument

? ?

?

1

5 12 7

12 32 17 21

- Unsupervised learning:training data is not labeled and does not have a known result Examples:

– detect change points in a non-stationary time-series – detect outliers

– cluster data in homogeneous groups

– compress data without loosing much information – density estimation

? ? ?

?

?

?

?

? ?

?

?

- Others:reinforcement learning, semi-supervised learning, online learning,. . .

19

(29)

Overview of most popular machine learning methods

Two main categoriesof machine learning algorithms:

- Supervised learning:predict outputY from some input dataX. The training data has a known labelY.

Classification Regression SVM

Logistic regression Random Forest

Lasso, Ridge Nearest Neighbors

Neural Networks ?

?

?

1

5 12 7

12 32 17 21

- Unsupervised learning:training data is not labeled and does not have a known result

Clustering Dimensionality reduction K-means, the Apriori al-

gorithm, Birch, Ward, Spectral Cluster

PCA, ICA word embedding

? ? ?

?

?

?

?

? ?

?

?

- Others:reinforcement learning, semi-supervised learning, online learning,. . .

19

(30)

Supervised learning

(31)

Supervised learning

Goal:fromtraining data, we want topredict an outputY (or the best action) from the observation of someinputX.

Difficulties:Y is not a deterministic function ofX. There can be some noise:

Y =f(X) +ε The functionf is unknown and can be sophisticated.

→hard to perform well systematically Possible theoretical approaches:perform well

- in theworst-case: minimax theory, game theory - in average, or with high probability

Algorithmic approaches:

- local averages:K-nearest neighbors, decision trees

- empirical risk minimization: linear regression, lasso, spline regression, SVM, logistic regression

- online learning - deep learning

- probabilistic models: graphical models, Bayesian methods

20

? ?

?

(32)

Supervised learning: theory

Somedata (X,Y)∈ X × Yis distributed according to a probability distributionP.

We observe training dataDn:=(X1,Y1), . . . ,(Xn,Yn) .

We must form prediction into adecision setAby choosing a prediction function f : X

|{z}

observation

→ A

|{z}

decision

Our performance is measured by aloss function`:A × Y →R. We define the risk R(f) :=E` f(X),Y = expected loss off

Goal:minimizeR(f) by approaching the performance of theoraclef=arg minf∈FR(f)

Least square regression Classification

A=Y R {0,1, . . . ,K−1}

`(a,y) (a−y)2 1a6=y

R(f) E(f(X)−Y)2

P(f(X)6=Y) f E[Y|X] arg maxkP(Y =k|X)

21

(33)

Supervised learning: theory

22

(34)

Outline

Introduction

Supervised learning

Empirical risk minimization: OLS, Logistic regression, Ridge, Lasso, Quantile regression

Calibration of the parameters: cross-validation Local averages

Deep learning

Unsupervised learning Clustering

Dimensionality Reduction Algorithms

Planning of the class

23

(35)

Empirical risk minimization

Idea:estimateR(f) thanks to the training data with theempirical risk Rˆn(f) :=1

n

n

X

i=1

` f(Xi),Yi

| {z }

average error on training data

≈ R(f) =E

` f(X),Y

| {z }

expected error

We estimate ˆfnby minimizing the empirical risk fˆn∈arg min

f∈F

n(f).

Many methods are based on empirical risk minimization: ordinary least square, logistic regression, Ridge, Lasso,. . .

Choosing the right model:F is a set of models which needs to be properly chosen:

R( ˆfn) = min

f∈FR(f)

| {z }

Approximation error

+ R( ˆfn)−min

f∈FR(f)

| {z }

Estimation error

24

(36)

Overfitting

Complexity of F Error

Training error

Expected error Overfitting Underfitting

Best choice

25

(37)

Overfitting: example in regression

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Linear model: Y=aX+b

X

Y

Training error: 0.1 Expected error: 0.08

26

(38)

Overfitting: example in regression

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Cubic model: Y=aX+bX2+cX3+d

X

Y

Training error: 0.03 Expected error: 0.05

26

(39)

Overfitting: example in regression

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Polynomial model: Degree =14

X

Y

Training error: 0.01 Expected error: 0.17

26

(40)

Least square linear regression

27

(41)

Least square Linear regression

Given training data (Xi,Yi) fori= 1, . . . ,n, withXi ∈RdandYi ∈ {0,1}learn a predictorf such that our expected square loss

E

(f(X)−Y)2 is small.

We assume here thatf is alinear combination of the inputx= (x1, . . . ,xd)

fw(x) =

d

X

i=1

wixi=w>x

f(X)

X Y

28

(42)

Ordinary Least Square

InputX∈Rd, outputY ∈R, and`is the square loss:`(a,y) = (a−y)2.

The Ordinary Least Square regression (OLS) minimizes theem- pirical risk

n(w) =1 n

n

X

i=1

(Yi−w>Xi)2

This is minimized inw∈RdwhenX>Xw−X>Y= 0,where X=X1, . . . ,Xn>

∈Rn×dandY =Y1, . . . ,Yn>

∈Rn.

f(X)

X Y

AssumingXis injective (i.e.,X>Xis invertible) and there is an exact solution ˆ

w= X>X−1

X>Y.

What happens ifdn?

29

(43)

Geo-science

30

(44)

Ordinary Least Square: how to compute w ˆ

n

?

If the design matrixX>X is invertible, the OLS has the closed form:

ˆ

wn∈arg min

w

n(w) = X>X−1

X>Y.

Question:how to compute it?

- inversion of (X>X)can be prohibitive (the cost isO(d3)!)

- QR-decomposition: we writeX=QR, withQan orthogonal matrix andR an upper-triangular matrix. One needs to solve the linear system:

Rwˆ=Q>Y, with R=

 x .

.. ..

.. x . . . . x

0

x

- iterative approximation with convex optimization algorithms[Bottou, Curtis, and Nocedal 2016]: (stochastic)-gradient descent, Newton,. . .

wi+1=wi−η∇Rˆn(wi)

31

(45)

Beyond vectors?

Image from Lorenzo Rosasco lecture 32

(46)

Beyond vectors?

Image from Lorenzo Rosasco lecture 33

(47)

Beyond vectors?

Image from Lorenzo Rosasco lecture 34

(48)

Classification

Given training data (Xi,Yi) fori= 1, . . . ,n, withXi ∈RdandYi ∈ {0,1}learn a classifierf(x) such that

f(Xi)

>0 ⇒ Yi = +1

<0 ⇒ Yi = 0

Linearly separable

wX = 0

wX>0

wX<0

NonLinearly separable

f(X) = 0 f(X)>0 f(X)<0

35

(49)

Linear classification

We would like to find the best linear classifier such that

fw(X) =w>X

>0 ⇒ Y = +1

<0 ⇒ Y = 0

Empirical risk minimizationwith the binary loss?

ˆ

wn=arg min

w∈Rd

1 n

n

X

i=1

1Yi6=1w>Xi>0.

This isnot convexinw. Very hard to compute!

wX = 0

wX>0

wX<0

−5 0 5

0 0.5 1

binary

w>X

error

36

(50)

Spam filters

Image from Lorenzo Rosasco lecture 37

(51)

Image recognition

Image from Francis Bach lecture 38

(52)

Pedestrian recognition

https://youtu.be/lU4w0o-caFM

39

(53)

Logistic regression

Idea:replace the loss with a convex loss

`(w>X,y) =ylog 1 +e−w>X

+ (1−y) log 1 +ew>X

−3 0 1

0 0.5

1 binary

logistic

Hinge

ˆ y

error

wX = 0

wX>0

wX<0

Probabilistic interpretation: based on likelihood maximization of the model:

P(Y = 1|X) = 1

1 +e−w>X ∈[0,1]

Satisfied for many distributions ofX|Y: Bernoulli, Gaussian, Exponential, . . . Computation of the minimizer of the empirical risk(No closed form of the solution)

- Use a convex optimization algorithm (Newton, gradient descent,. . . )

40

(54)

Support Vector Machine (SVM)

In SVM, the linear separator (hyperplane) is chosen bymaximizing the margin. Not by minimizing the empirical risk.

wX = 0

wX>0

wX<0

Sparsity: it only depends on a few training points, calledthe support vectors In practice, we usesoft marginsbecause no perfect linear separation is possible.

41

(55)

Non-linear regression/classification

Until now, we have only considered linear predictions ofx= (x1, . . . ,xd)

fw(x) =

d

X

i=1

wixi.

But this can perform pretty bad... How to perform non-linear regression?

Non linear regression

f(X)

X Y

NonLinearly separable

f(X) = 0 f(X)>0 f(X)<0

42

(56)

Non-linear regression/classification

Idea:map the inputX into ahigher dimensional spacewhere the problem is linear.

Example: given an inputx= (x1,x2,x3) perform a linear method on a transformation of the input like

Φ(x) = x1x1,x1x2, . . . ,x3x2,x3x3

∈R9

Linear transformations of Φ(x) are polynomials ofx! The previous methods works by replacingxwith Φ(x).

Non linear regression

f(X)

X Y

NonLinearly separable

f(X) = 0 f(X)>0 f(X)<0

42

(57)

Example: Word Embedding (Word2Vect)

http://wordrepresentation.appspot.com

Words

43

(58)

Spline regression

Aspline of degreepis a function formed by connecting polynomial segments of degree pso that:

- the function is continuous

- the function has D 1 continuous derivatives - thepth-derivative is constant between knots

This can be done by choosing the good transformation Φp(x) and the right regularizationkΦp(x)k.

Difficulties: choose thenumber of knotsand thedegree

44

(59)

Regularization

How to avoid over-fitting if there is not enough data?

Complexity of F Error

Training error

Expected error Overfitting Underfitting

Best choice

Control the complexity of the solution

- explicitly by choosingF small enough: choose the degree of the polynomials,. . . - implicitly by adding aregularization term

fmin∈F

n(f) +λkfk2 The higher the normkfkis, the more complex the function is.

We do not need to know the best complexityFin advance Complexity controlled byλ, which need to be calibrated.

45

(60)

Ridge regression

The most classic regularization in statistics for linear regression:

wbn=arg min

w∈Rd

1 n

n

X

i=1

(Yi−w>Xi)2 + λ

d

X

i=1

wi2

The exact solution is unique because the problem is nowstrongly convex:

ˆ

wn= X>X+nλI−1

X>Y The regularization parameterλcontrols the matrix conditioning:

- ifλ= 0: ordinary linear regression - ifλ→ ∞: ˆwn→0

46

(61)

The Lasso: how to choose among a large set of variables with few observa- tions

The Lasso corresponds toL1regularization:

ˆ

wn=arg min

w∈Rd

1 n

n

X

i=1

(Yi−w>Xi)2 + λ

d

X

i=1

|wi|

Powerful ifdn: many potential variables, few observations ˆ

wnis sparse: most of its values will be 0→can be used to choose variables

Other formulation of the Lasso:

∃β >0 such that ˆ

wn∈arg min

kwk16β

1 n

n

X

i=1

(Yi−w>Xi)2

47

(62)

The Lasso: how to choose among a large set of variables with few observa- tions

The Lasso corresponds toL1regularization:

ˆ

wn=arg min

w∈Rd

1 n

n

X

i=1

(Yi−w>Xi)2 + λ

d

X

i=1

|wi|

Powerful ifdn: many potential variables, few observations ˆ

wnis sparse: most of its values will be 0→can be used to choose variables The Lasso is biased: ˆwn>X6=E[Y|X]. Hence, it is better to:

Perform Lasso

Choose variables with ˆwi>0

Perform Ridge on this sub-model only Another solution is Elastic Net:

ˆ

wn=arg min

w∈Rd

1 n

n

X

i=1

(Yi−w>Xi)2 + λ1

d

X

i=1

|wi|+λ2

d

X

i=1

wi2 Many extensions of the Lasso exist: Group Lasso,. . .

47

(63)

Lasso: the regularization path

The Lasso corresponds toL1regularization:

ˆ

wn=arg min

w∈Rd

1 n

n

X

i=1

(Yi−w>Xi)2 + λ

d

X

i=1

|wi|

Plot of the evolution of the coefficients of ˆwnas a function ofλ:

48

(64)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)2: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(65)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)2: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(66)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)2: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(67)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)2: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(68)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)2: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(69)

Outline

Introduction

Supervised learning

Empirical risk minimization: OLS, Logistic regression, Ridge, Lasso, Quantile regression

Calibration of the parameters: cross-validation Local averages

Deep learning

Unsupervised learning Clustering

Dimensionality Reduction Algorithms

Planning of the class

50

(70)

How to choose the parameters? Test set

All the methods in machine learning depend onlearning parameters.

How to choose them? First solution: use a test set.

- randomly choose 70% of the data to be in thetraining set - the remainder is atest set

Initial training set

New training set

Test set

We choose the parameter with the smallest error on the test set.

very simple

waste data: the best method is fitted only with 70% of the data with bad luck the test set might be lucky or unlucky

51

(71)

How to choose the parameters? Cross-validation

Cross-validation:

- randomly break data intoK groups

- for each group, use it as a test set and train the data on the (K−1) other groups

Initial training set

Test set New training set

We choose the parameter with the smallest average error on the test sets.

only 1/Kof the data lost for training Ktimes more expensive

In practice: chooseK≈10.

52

(72)

Outline

Introduction

Supervised learning

Empirical risk minimization: OLS, Logistic regression, Ridge, Lasso, Quantile regression

Calibration of the parameters: cross-validation Local averages

Deep learning

Unsupervised learning Clustering

Dimensionality Reduction Algorithms

Planning of the class

53

(73)

K-Nearest Neighbors

Classify data based on similarity with neighbors.

When observing a new inputx, find thek-closest training datapoints toxand for - classification: predict the most frequently occuring class

- regression: predict the average value K = 3

54

(74)

K-Nearest Neighbors

Classify data based on similarity with neighbors.

When observing a new inputx, find thek-closest training datapoints toxand for - classification: predict the most frequently occuring class

- regression: predict the average value

K= 1

54

Références

Documents relatifs

Strong disorder: Suppression of the expansion of a Bose-Einstein condensate in a speckle potential and disorder-induced trapping scenario In this section, we investigate the

Update cluster to the aver- aged of its assigned points - Stop when no point’s

Les dispositifs en alternance sont moins utilisés dans les DOM qu'en métropole : 57 % des contrats dans le secteur marchand sont des contrats en alternance, contre plus de 80 %

En utilisant le graphique : donner le nombre de DVD pour lequel le prix est le même dans les deux tarifs puis, préciser le tarif le plus avanta- geux en fonction du nombre de

The considered class of stopping criteria encompasses those encountered in the control, dynamic programming and reinforcement learning literature and it allows considering new

Cohomology rings of toric varieties assigned to cluster quivers: the case of unioriented quivers of type A..

In §2 we observe that current ransomware gets random numbers from a small set of functions, called Cryptographically Secure Pseudo Random Number Gener- ator (CSPRNG), that

The DM determines whether QGen should ask a follow-up question or the Guesser should guess the target object, based on the image and dialogue history.. As shown in Figure 1, the