Introduction

(1)

An overview of Machine Learning

Alessandro Rudi, Francis Bach – Inria, ENS Paris

Thanks to Pierre Gaillard for the slides!

February 4, 2021

1

(2)

Overview

Introduction Supervised learning

Empirical risk minimization: OLS, Logistic regression, Ridge, Lasso, Quantile regression

Calibration of the parameters: cross-validation Local averages

Deep learning Unsupervised learning

Clustering

Dimensionality Reduction Algorithms Planning of the class

2

(3)

Introduction

(4)

General information

Teachers: Alessandro Rudi and Francis Bach.

Practical sessions: Raphal Berthier.

Website:https://www.di.ens.fr/appstat/

The class will last 52 hours (30 hours of class + 22 hours of practical sessions) and can be validated for 9 ECTS. Final grade: 50% final exam, 50% homework.

Special online format: inverted classroom! - Read lecture notes before

- Short review of material

- ≈One mandatory question per student per session - Have video feeds open

3

(5)

General information

Teachers: Alessandro Rudi and Francis Bach.

Practical sessions: Raphal Berthier.

Website:https://www.di.ens.fr/appstat/

The class will last 52 hours (30 hours of class + 22 hours of practical sessions) and can be validated for 9 ECTS. Final grade: 50% final exam, 50% homework.

Special online format: inverted classroom!

- Read lecture notes before - Short review of material

- ≈One mandatory question per student per session - Have video feeds open

3

(6)

Prerequesites

Linear algebra(matrix operations, linear systems)

Probability(e.g. notion of random variables, conditional expectation)

Basic coding skills in Python:Jupyter notebooks, Anaconda

- If you do not know the language Python, please read (and code the examples of) this 10-minutes introduction to Python:

https://www.stavros.io/tutorials/python/.

- For next week: run

https://www.di.ens.fr/appstat/spring-2020/TP/TD0-prerequisites/

crash_test.ipynb

4

(7)

Rule

questions if something is unclear (we like them especially if you think they are stupid).

5

(8)

What is ML?

artificial intelligence which can learn and model some phenomena without being explicitly programmed

Examplesof “success stories”:

- Spam classification - Machine translation - Speech recognition - Self-driving cars

6

(9)

What is ML? Examples

Image from Lorenzo Rosasco lecture ⁷

(10)

Search engines

8

(11)

Search engines

8

(12)

Search engines

8

(13)

Text recognition

Image from Francis Bach lecture ⁹

(14)

Text recognition

Image from Francis Bach lecture ⁹

(15)

Bioinformatics

Large data – Complex data

10

(16)

What is ML?

Machine Learning :artificial intelligence which can learn and model some phenomena without being explicitly programmed

Machine Learning⊂Statistics + Computer Sciences

Mathematics (Statistics, Optimization)

Computer science

Domain expertise

Applied Machine Learning

Traditional research Machine

learning

Dangerous software

Image from Francis Bach lecture

11

(17)

What is ML?

Machine Learning :artificial intelligence which can learn and model some phenomena without being explicitly programmed

Machine Learning⊂Statistics + Computer Sciences

input output

‘zebra’

‘tiger’

Traditional programming:

Machine learning:

program

‘zebra’

‘tiger’

training data

learning program

input output

program

chosen program / model

Image from Francis Bach lecture ¹¹

(18)

Why is ML successful?

Image from Francis Bach lecture ¹²

(19)

Sexiest job of the century

13

(20)

The machine learning revolution

Big data / machine learning / data science / artificial intelligence / deep learning, a revolution?

- Technical progress: increase in computing power and storage capacity, lower costs

- Exponential increase in amount of data: Volume,Variability,Velocity,Veracity – IBM:10¹⁸bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

– In all forms: video, text, clicks, numbers

- Methodological advancementto analyze complex datasets: high dimensional statistics, deep learning, reinforcement learning,. . .

14

(21)

Moore’s law: more computing power

15

(22)

Moore’s Law: reduced costs

Limits: – debits do not follow

– miniaturization→reach the limits of classical physics→quantum mechanics

16

(23)

The machine learning revolution

- Technical progress: increase in computing power and storage capacity, lower costs - Exponential increase in amount of data: Volume,Variability,Velocity,Veracity

– IBM:10¹⁸bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

17

(24)

The machine learning revolution

- Technical progress: increase in computing power and storage capacity, lower costs - Exponential increase in amount of data: Volume,Variability,Velocity,Veracity

– IBM:10¹⁸bytes created each day — 90% of the data62 years – In all area: sciences, industries, personal life

17

(25)

Overview of Machine Learning

18

(26)

Overview of Machine Learning

18

(27)

Overview of most popular machine learning methods

Two main categoriesof machine learning algorithms:

- Supervised learning:predict outputY from some input dataX. The training data has a known labelY.

Examples:

– X is a picture, andY is a cat or a dog – X is a picture, andY∈ {0, . . . ,9}is a digit – X is are videos captured by a robot playing table

tennis, andY are the parameters of the robots to return the ball correctly

– X is a music track andY are the audio signals of each instrument

? ?

?

- Unsupervised learning:training data is not labeled and does not have a known result Examples:

– detect change points in a non-stationary time-series – detect outliers

– cluster data in homogeneous groups

– compress data without loosing much information – density estimation

? ? ?

?

? ?

?

- Others:reinforcement learning, semi-supervised learning, online learning,. . .

19

(28)

Overview of most popular machine learning methods

Examples:

– X is a picture, andY is a cat or a dog – X is a picture, andY∈ {0, . . . ,9}is a digit – X is are videos captured by a robot playing table

tennis, andY are the parameters of the robots to return the ball correctly

– X is a music track andY are the audio signals of each instrument

? ?

?

1

5 12 7

12 32 17 21

- Unsupervised learning:training data is not labeled and does not have a known result Examples:

– detect change points in a non-stationary time-series – detect outliers

– cluster data in homogeneous groups

– compress data without loosing much information – density estimation

? ? ?

?

? ?

?

19

(29)

Overview of most popular machine learning methods

Classification Regression SVM

Logistic regression Random Forest

Lasso, Ridge Nearest Neighbors

Neural Networks ?

?

1

5 12 7

12 32 17 21

- Unsupervised learning:training data is not labeled and does not have a known result

Clustering Dimensionality reduction K-means, the Apriori al-

gorithm, Birch, Ward, Spectral Cluster

PCA, ICA word embedding

? ? ?

?

? ?

?

19

(30)

Supervised learning

(31)

Supervised learning

Goal:fromtraining data, we want topredict an outputY (or the best action) from the observation of someinputX.

Difficulties:Y is not a deterministic function ofX. There can be some noise:

Y =f(X) +ε The functionf is unknown and can be sophisticated.

→hard to perform well systematically Possible theoretical approaches:perform well

- in theworst-case: minimax theory, game theory - in average, or with high probability

Algorithmic approaches:

- local averages:K-nearest neighbors, decision trees

- empirical risk minimization: linear regression, lasso, spline regression, SVM, logistic regression

- online learning - deep learning

- probabilistic models: graphical models, Bayesian methods

20

? ?

?

(32)

Supervised learning: theory

Somedata (X,Y)∈ X × Yis distributed according to a probability distributionP.

We observe training dataD_n:=(X₁,Y₁), . . . ,(X_n,Y_n) .

We must form prediction into adecision setAby choosing a prediction function f : X

|{z}

observation

→ A

|{z}

decision

Our performance is measured by aloss function`:A × Y →R. We define the risk R(f) :=E` f(X),Y = expected loss off

Goal:minimizeR(f) by approaching the performance of theoraclef^∗=arg min_f_∈FR(f)

Least square regression Classification

A=Y R {0,1, . . . ,K−1}

`(a,y) (a−y)² 1a6=y

R(f) E(f(X)−Y)²

P(f(X)6=Y) f^∗ E[Y|X] arg max_kP(Y =k|X)

21

(33)

Supervised learning: theory

22

(34)

Outline

Introduction

Supervised learning

Deep learning

Unsupervised learning Clustering

Dimensionality Reduction Algorithms

Planning of the class

23

(35)

Empirical risk minimization

Idea:estimateR(f) thanks to the training data with theempirical risk Rˆn(f) :=1

n

X

i=1

` f(Xi),Yi

| {z }

average error on training data

≈ R(f) =E

` f(X),Y

| {z }

expected error

We estimate ˆfnby minimizing the empirical risk fˆn∈arg min

f∈F

Rˆn(f).

Many methods are based on empirical risk minimization: ordinary least square, logistic regression, Ridge, Lasso,. . .

Choosing the right model:F is a set of models which needs to be properly chosen:

R( ˆfn) = min

f∈FR(f)

| {z }

Approximation error

+ R( ˆfn)−min

f∈FR(f)

| {z }

Estimation error

24

(36)

Overfitting

Complexity of F Error

Training error

Expected error Overfitting Underfitting

Best choice

25

(37)

Overfitting: example in regression

●

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Linear model: Y=aX+b

X

Y

Training error: 0.1 Expected error: 0.08

26

(38)

Overfitting: example in regression

●

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Cubic model: Y=aX+bX²+cX³+d

X

Y

26

(39)

Overfitting: example in regression

●

0.0 0.2 0.4 0.6 0.8 1.0

0.00.51.01.52.0

Polynomial model: Degree =14

X

Y

26

(40)

Least square linear regression

27

(41)

Least square Linear regression

Given training data (Xi,Yi) fori= 1, . . . ,n, withXi ∈R^dandYi ∈ {0,1}learn a predictorf such that our expected square loss

E

(f(X)−Y)² is small.

We assume here thatf is alinear combination of the inputx= (x1, . . . ,xd)

fw(x) =

d

X

i=1

w_ix_i=w^>x

f(X)

X Y

28

(42)

Ordinary Least Square

InputX∈R^d, outputY ∈R, and`is the square loss:`(a,y) = (a−y)².

The Ordinary Least Square regression (OLS) minimizes theempirical risk

Rˆn(w) =1 n

n

X

i=1

(Yi−w^>Xi)²

This is minimized inw∈R^dwhenX^>Xw−X^>Y= 0,where X=X₁, . . . ,X_n>

∈R^n×dandY =Y₁, . . . ,Y_n>

∈Rⁿ.

f(X)

X Y

AssumingXis injective (i.e.,X^>Xis invertible) and there is an exact solution ˆ

w= X^>X−1

X^>Y.

What happens ifdn?

29

(43)

Geo-science

30

(44)

Ordinary Least Square: how to compute w ˆ

n

?

If the design matrixX^>X is invertible, the OLS has the closed form:

ˆ

wn∈arg min

w

Rˆn(w) = X^>X−1

X^>Y.

Question:how to compute it?

- inversion of (X^>X)can be prohibitive (the cost isO(d³)!)

- QR-decomposition: we writeX=QR, withQan orthogonal matrix andR an upper-triangular matrix. One needs to solve the linear system:

Rwˆ=Q^>Y, with R=





 x .

.. ..

.. x . . . . x

0

^x







- iterative approximation with convex optimization algorithms[Bottou, Curtis, and Nocedal 2016]: (stochastic)-gradient descent, Newton,. . .

wi+1=wi−η∇Rˆn(wi)

31

(45)

Beyond vectors?

Image from Lorenzo Rosasco lecture ³²

(46)

Beyond vectors?

Image from Lorenzo Rosasco lecture ³³

(47)

Beyond vectors?

Image from Lorenzo Rosasco lecture ³⁴

(48)

Classification

Given training data (X_i,Y_i) fori= 1, . . . ,n, withX_i ∈R^dandY_i ∈ {0,1}learn a classifierf(x) such that

f(Xi)







>0 ⇒ Yi = +1

<0 ⇒ Yi = 0

Linearly separable

wX = 0

wX>0

wX<0

NonLinearly separable

f(X) = 0 f(X)>0 f(X)<0

35

(49)

Linear classification

We would like to find the best linear classifier such that

fw(X) =w^>X







>0 ⇒ Y = +1

<0 ⇒ Y = 0

Empirical risk minimizationwith the binary loss?

ˆ

wn=arg min

w∈R^d

1 n

n

X

i=1

1Y_i6=1_w>Xi>0.

This isnot convexinw. Very hard to compute!

wX = 0

wX>0

wX<0

−5 0 5

0 0.5 1

binary

w^>X

error

36

(50)

Spam filters

Image from Lorenzo Rosasco lecture ³⁷

(51)

Image recognition

Image from Francis Bach lecture ³⁸

(52)

Pedestrian recognition

https://youtu.be/lU4w0o-caFM

39

(53)

Logistic regression

Idea:replace the loss with a convex loss

`(w^>X,y) =ylog 1 +e^−w^>^X

+ (1−y) log 1 +e^w^>^X

−3 0 1

0 0.5

1 binary

logistic

Hinge

ˆ y

error

wX = 0

wX>0

wX<0

Probabilistic interpretation: based on likelihood maximization of the model:

P(Y = 1|X) = 1

1 +e^−w^>^X ∈[0,1]

Satisfied for many distributions ofX|Y: Bernoulli, Gaussian, Exponential, . . . Computation of the minimizer of the empirical risk(No closed form of the solution)

- Use a convex optimization algorithm (Newton, gradient descent,. . . )

40

(54)

Support Vector Machine (SVM)

In SVM, the linear separator (hyperplane) is chosen bymaximizing the margin. Not by minimizing the empirical risk.

wX = 0

wX>0

wX<0

Sparsity: it only depends on a few training points, calledthe support vectors In practice, we usesoft marginsbecause no perfect linear separation is possible.

41

(55)

Non-linear regression/classification

Until now, we have only considered linear predictions ofx= (x1, . . . ,xd)

fw(x) =

d

X

i=1

wixi.

But this can perform pretty bad... How to perform non-linear regression?

Non linear regression

f(X)

X Y

f(X) = 0 f(X)>0 f(X)<0

42

(56)

Non-linear regression/classification

Idea:map the inputX into ahigher dimensional spacewhere the problem is linear.

Example: given an inputx= (x1,x2,x3) perform a linear method on a transformation of the input like

Φ(x) = x1x1,x1x2, . . . ,x3x2,x3x3

∈R⁹

Linear transformations of Φ(x) are polynomials ofx! The previous methods works by replacingxwith Φ(x).

Non linear regression

f(X)

X Y

f(X) = 0 f(X)>0 f(X)<0

42

(57)

Example: Word Embedding (Word2Vect)

http://wordrepresentation.appspot.com

Words

↓

43

(58)

Spline regression

Aspline of degreepis a function formed by connecting polynomial segments of degree pso that:

- the function is continuous

- the function has D 1 continuous derivatives - thepth-derivative is constant between knots

This can be done by choosing the good transformation Φp(x) and the right regularizationkΦp(x)k.

Difficulties: choose thenumber of knotsand thedegree

44

(59)

Regularization

How to avoid over-fitting if there is not enough data?

Complexity of F Error

Training error

Expected error Overfitting Underfitting

Best choice

Control the complexity of the solution

- explicitly by choosingF small enough: choose the degree of the polynomials,. . . - implicitly by adding aregularization term

fmin∈F

Rˆn(f) +λkfk² The higher the normkfkis, the more complex the function is.

We do not need to know the best complexityFin advance Complexity controlled byλ, which need to be calibrated.

45

(60)

Ridge regression

The most classic regularization in statistics for linear regression:

wbn=arg min

w∈R^d

1 n

n

X

i=1

(Yi−w^>Xi)² + λ

d

X

i=1

w_i²

The exact solution is unique because the problem is nowstrongly convex:

ˆ

wn= X^>X+nλI−1

X^>Y The regularization parameterλcontrols the matrix conditioning:

- ifλ= 0: ordinary linear regression - ifλ→ ∞: ˆwn→0

46

(61)

The Lasso: how to choose among a large set of variables with few observa- tions

The Lasso corresponds toL₁regularization:

ˆ

w_n=arg min

w∈Rd

1 n

n

X

i=1

(Y_i−w^>X_i)² + λ

d

X

i=1

|w_i|

Powerful ifdn: many potential variables, few observations ˆ

wnis sparse: most of its values will be 0→can be used to choose variables

Other formulation of the Lasso:

∃β >0 such that ˆ

wn∈arg min

kwk16β

1 n

n

X

i=1

(Yi−w^>Xi)²

47

(62)

The Lasso: how to choose among a large set of variables with few observa- tions

The Lasso corresponds toL₁regularization:

ˆ

w_n=arg min

w∈Rd

1 n

n

X

i=1

(Y_i−w^>X_i)² + λ

d

X

i=1

|w_i|

Powerful ifdn: many potential variables, few observations ˆ

wnis sparse: most of its values will be 0→can be used to choose variables The Lasso is biased: ˆw_n^>X6=E[Y|X]. Hence, it is better to:

Perform Lasso

↓

Choose variables with ˆw_i>0

↓

Perform Ridge on this sub-model only Another solution is Elastic Net:

ˆ

w_n=arg min

w∈Rd

1 n

n

X

i=1

(Y_i−w^>X_i)² + λ₁

d

X

i=1

|wi|+λ₂

d

X

i=1

w_i² Many extensions of the Lasso exist: Group Lasso,. . .

47

(63)

Lasso: the regularization path

The Lasso corresponds toL1regularization:

ˆ

wn=arg min

w∈R^d

1 n

n

X

i=1

(Y_i−w^>X_i)² + λ

d

X

i=1

|w_i|

Plot of the evolution of the coefficients of ˆwnas a function ofλ:

48

(64)

Probabilistic prediction

In some situation, we are not interested by prediction theaverage caseE[Y|X]only, but by thedistribution ofY|X.→give ameasure of uncertaintyof our prediction Solution: modify the loss function:

- square loss`(a,y) = (a−y)²: prediction of theexpected value - absolute loss`(a,y) =|a−y|: prediction of themedian

(50% to be above Y, and 50% chance to be below)

- pinball loss`(a,y) = (a−y)(τ−1a<y): prediction of theτ-quantile

((1−τ) chance to be aboveY andτchance to be below)

prévision - observation erreur

observation prévision

49

(65)

Probabilistic prediction

49

(66)

Probabilistic prediction

49

(67)

Probabilistic prediction

49

(68)

Probabilistic prediction

49

(69)

Outline

Introduction

Supervised learning

Deep learning

50

(70)

How to choose the parameters? Test set

All the methods in machine learning depend onlearning parameters.

How to choose them? First solution: use a test set.

- randomly choose 70% of the data to be in thetraining set - the remainder is atest set

Initial training set

New training set

Test set

We choose the parameter with the smallest error on the test set.

very simple

waste data: the best method is fitted only with 70% of the data with bad luck the test set might be lucky or unlucky

51

(71)

How to choose the parameters? Cross-validation

Cross-validation:

- randomly break data intoK groups

- for each group, use it as a test set and train the data on the (K−1) other groups

Initial training set

Test set New training set

We choose the parameter with the smallest average error on the test sets.

only 1/Kof the data lost for training Ktimes more expensive

In practice: chooseK≈10.

52

(72)

Outline

Introduction

Supervised learning

Deep learning

53

(73)

K-Nearest Neighbors

Classify data based on similarity with neighbors.

When observing a new inputx, find thek-closest training datapoints toxand for - classification: predict the most frequently occuring class

- regression: predict the average value K = 3

54

(74)

K-Nearest Neighbors

Classify data based on similarity with neighbors.

When observing a new inputx, find thek-closest training datapoints toxand for - classification: predict the most frequently occuring class

- regression: predict the average value

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

● ●

●

● ●

●

● ●

●

K= 1

54