Supervised machine learning

(1)

Supervised machine learning An introduction

Romain Raveaux

[email protected] Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Outline

• Introduction

• Supervised/Re-inforcement/Unsupervised

• Probability

• Generative/Discriminative models

• Fitting probability models

• Fitting discriminative models

• Fitting generative models

• Inference

2

(3)

Learning paradigms

• Supervised learning

• task of learning a function: 𝑓

• that maps an input to an output ∶ 𝑓: 𝑋 → 𝑌

• based on example input-output pairs : 𝑥_𝑖, 𝑦_𝑖 ^𝑁_𝑖=1; 𝑥_𝑖 ∈ 𝑋 𝑒𝑡 𝑦_𝑖 ∈ 𝑌

• Reinforcement learning 𝑓 = 𝑓₁ ∘ 𝑓_… ∘ 𝑓_𝑘: 𝑋 → ത𝑌

• differs from supervised learning

• 𝑓is composed of sub functions (in X) to output 𝑦ത_𝑖 ,

• Incomplete feedback (𝑦ത_𝑖 is not completely given 𝑦ഥ_𝑖 ⊂ 𝑦_𝑖 ∈ 𝑌)

• Unsupervised learning :

• task of learning a function: 𝑓

• that maps an input to an output ∶ 𝑓: 𝑋 → 𝑍

• based on input examples : 𝑥_𝑖 ^𝑁_𝑖=1; 𝑥_𝑖 ∈ 𝑋

• No labels (𝑦_𝑖) are given

• Self-supervised learning

• is autonomous supervised learning.

• eliminates the pre-requisite requiring humans to label data.

(4)

Supervised learning

Regression

Credit : https://openclassrooms.com/fr/courses/4525266

(5)

Supervised learning

Classification

Credit : https://towardsdatascience.com/supervised-learning- basics-of-classification-and-main-algorithms-c16b06806cd3

(6)

Self-supervised learning

(7)

Supervised learning

• Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ

• Classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑

• Structured classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑 ^𝑁×𝑀

• Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ^𝑁×𝑀

• Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑮 (𝑔𝑟𝑎𝑝ℎ 𝑠𝑝𝑎𝑐𝑒)

• Structured learning when the output is complex:

• Segmentation mask for instance

(8)

Probability = rational way to deal with uncertainty

Image taken from Pr(X,Y) = joint

probability

Pr(Y|X) = conditional probability

Pr(X) = Marginal probability

(9)

Probability : Joint, Conditional and Marginal

(10)

Probability : Link between Joint, Conditional and Marginal : The product rule

Product rule

(11)

Link between Joint and Conditional probabilities

(12)

Conditional probability

• Pr(X|Y = 1) :

• The conditional probability Pr(X|Y = 1) can be recovered from the joint distribution Pr(X,Y).

• In particular, we examine the appropriate slice Pr(X,Y=1) of the joint distribution

• The values in the slice tell us about the relative probability that X takes various values having observed Y=1,

• But they do not themselves form a valid probability distribution

• they cannot sum to one

• We hence normalize by the total probability in the slice :

• Pr 𝑋 𝑌 = 1 = ^{Pr(𝑋,𝑌=1)}

𝑃𝑟(𝑌=1)

(13)

Probability : Link between Joint and Marginal probabilities : The sum rule

Sum rule

(14)

Link between Joint and Marginal probabilities

Sum Sum

Sum

(15)

The Rules of Probability

 called Marginalization

(16)

Bayes’ rule : provide a quantification of uncertainty

 Symmetry

Note that the likelihood (p(X|Y)) is not a probability distribution over Y, and its integral with respect to Y does not (necessarily) equal one.

𝑝 𝑌 𝑋 𝑃 𝑋 = 𝑝 𝑋 𝑌 𝑃 𝑌  Product rule

 Bayes’ rule

(17)

Expectation of a function f(x)

(18)

Entropy

• 𝑓 𝑥 = − log 𝑝 𝑥

• The opposite of p(x)

• low probability events x correspond to high information content.

• Entropy = Expectation of f(x)

• The distribution that maximizes the entropy is a Gaussian !!

Discrete case Continuous case

(19)

Learning : We need Data

• A database : Training set

• 𝑥_𝑖 ∶ a real or discrete datum, a measurement of the world.

• 𝑦_𝑖 ∶ a real or discrete state of the world. A target, a ground truth value, …

(20)

Learning = Modelling data

• Discriminative and Generative models

or just Pr(𝑥) : if there are no output labels [this is a generative model]

(21)

Supervised learning

Credit:

https://developers.google.com/machine-learning/gan/images/generative_v_discriminative.png Pr(y|x) Pr(x|y)

(22)

Inference

• We take a new set of measurements

• A new data : 𝑥^𝑛𝑒𝑤

• And we use the model to tell us about the world state.

• We want to make a prediction from 𝑥^𝑛𝑒𝑤 (to infer from 𝑥^𝑛𝑒𝑤)

• We take a datum 𝑥^𝑛𝑒𝑤 and use it to infer the state of the world 𝑦_𝑖.

(23)

Inference with a discriminative model

• A new datum : 𝑥^𝑛𝑒𝑤

• We want to make a prediction from 𝑥^𝑛𝑒𝑤

• Inference with a discriminative model:

• Easy as : Pr(y|𝑥^𝑛𝑒𝑤)

• Given a new datum, we obtain the probability distribution of the world states

• world states = classes in a classification context

• world states = the domain of real values in a regression context

(24)

Inference with a generative model

• A new datum : 𝑥^𝑛𝑒𝑤

• We want to make a prediction from 𝑥^𝑛𝑒𝑤

• Inference with a generative model:

• The generative model : Pr(𝑥^𝑛𝑒𝑤|y)

• Pr 𝑦|𝑥^𝑛𝑒𝑤 = ^Pr 𝑥^𝑛𝑒𝑤 𝑦 ^×Pr(𝑦)

Pr(𝑥^𝑛𝑒𝑤)

 Bayes’ rule

(25)

Supervised learning : What can we learn?

We want to find these distributions:

• Pr(y|x) or

• Pr(x|y)

Where can we act?

1. On the parameters of the distributions.

2. Let’s call them W

Parametrized distributions are :

• Pr(y|x,W) or

• Pr(x|y,W)

(26)

Supervised learning : fitting probability models

We want to find the parameters (W) such that the probability distribution fits the data.

Let us take an example !!! A Gaussian distribution.

(27)

Supervised learning : fitting probability models = Finding 𝜇, 𝜎²

(28)

Fitting probability models : How to fit the data ?

• We find the parameters to fit the data

• Three main ways :

• 1°) Maximum likelihood

• Assuming each data point was drawn independently from the distribution (i.i.d)

• Independent and identically distributed  i.i.d

(29)

• Three ways :

• 2°) Maximum a posteriori

• Assuming each data point was drawn independently from the distribution (i.i.d)

 Bayes’ rule

(30)

• Three ways :

• 3°) The Bayesian approach

• Beyond the scope of this lecture

A nice book Parameter W is treated as a random variable.

This view makes a lot of sense:

• W depends on the data set {𝑥₁, 𝑥₂, 𝑥₃, …}

which itself is randomly sampled.

For any prior distribution Pr(W) over the space of weight vectors, one can derive the posterior probability of W for given observations

(31)

Supervised learning : fitting probability models : On a Gaussian

• Fitting by Maximum of likelihood:

 Fitting by Maximum of likelihood

(32)

 Minimizing the Negative Log Likelihood

How to solve this optimization problem ? How to find the minimum of a function ?

(33)

How to solve this optimization problem ? How to find the minimum of a function ? Find where the derivative is equal to zero !!!!

(34)

The solution :

The problem :

For this problem, there is a closed form solution, an analytical solution

 We search the parameter 𝝁

(35)

 Minimizing the Negative Log Likelihood The problem :

 We search the parameter 𝝈^𝟐

http://romain.raveaux.free.fr/document/Overfittingbiaisedand unbiaisedvariance.html

The full steps to obtain this result can be found on :

(36)

• Overfitting phenomenon of the maximum likelihood

• The maximum likelihood approach systematically underestimates the variance of the distribution.

• Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎²

• Consider the expectations of these quantities ( Ƹ𝜇, ෢𝜎²)

 No problem 𝝁 = ෝ𝝁

(37)

• The maximum likelihood approach systematically underestimates the variance of the distribution.

• Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎²

• Consider the expectations of these quantities ( Ƹ𝜇, ෢𝜎²)

 !!! Problem 𝝈෢^𝟐 = ^𝑵−𝟏

𝑵 𝝈^𝟐

(38)

• http://romain.raveaux.free.fr/document/Overfittingbiaisedandunbiaisedvariance.html

• underestimate the true variance by a factor (N-1/N)

(39)

• Fitting by Maximum a posteriori: W=[mean,variance]

• Well it depends on the prior (Pr(W) )

• See conjugate distribution

• the result is proportional to a new distribution which has the same form as the conjugate.

• Non informative prior

(40)

Learning a discriminative model

• A discriminative model : Pr(y|x)

• A discriminative model with its parameters:

Pr(y|x,W)

• Fitting by maximum of likelihood:

• That’s what your favorite standard neural network does

Link to : Minimizing the cross entropy : a nice trip from Maximum likelihood to Kullback–Leibler divergence

http://romain.raveaux.free.fr/document/

CrossEntropy.html

Breaking news : In the context of a discriminative model for probabilistic classification, minimizing the cross entropy is equivalent to maximize the likelihood

(41)

Learning a discriminative model (maximum of likelihood)

𝑦₁ = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦₁ = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

ෞ

𝑦₁ = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦₁ = 𝑐𝑎𝑟 𝑦෢₁¹ = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟

One hot encoding

ෞ

𝑦₁: 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠𝑒𝑠.

𝑦෢₁¹ = Probability to observe the class 1=0.2

ෞ

𝑦₁ = Pr(𝑦₁|𝑥₁, 𝑊)

Just some notation

ෝ

𝑦_𝑖: 𝐷𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑊 𝑏𝑢𝑡 𝑤𝑒 𝑑𝑜𝑛^′𝑡 𝑠ℎ𝑜𝑤 𝑖𝑡 ∶ ෝ𝑦_𝑖 𝑊

(42)

ෞ

𝑦₁ = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦₁ = 𝑐𝑎𝑟 𝑦෢₁¹ = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟

One hot encoding

max ෞ𝑦₁ = max Pr 𝑦₁ 𝑥₁, 𝑊 = 0.5

argmax ෞ𝑦₁ = argmax Pr 𝑦₁ 𝑥₁, 𝑊 = 3

maximum of likelihood

(43)

ෞ

𝑦₁ = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦₁ = 𝑐𝑎𝑟 One hot encoding

max 𝑙𝑜𝑔 ෞ𝑦₁ = max log Pr 𝑦₁ 𝑥₁, 𝑊 = log 0.5 log ෞ𝑦₁ = log 0.2 , log 0.3 , log 0.5

argmax 𝑙𝑜𝑔 ෞ𝑦₁ = argmax log Pr 𝑦₁ 𝑥₁, 𝑊 = 3

Applying the log

Does not change the argmax

 Applying the log

(44)

ෞ

𝑦₁ = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦₁ = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦₁ = log 0.2 , log 0.3 , log 0.5  Applying the log

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 𝑦₁, ෞ𝑦₁ = − ෍

𝑘=1 3

𝑦₁^𝑘 log ෢𝑦₁^𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 ෞ𝑦₁ = − ෍

𝑘=1 3

𝑦෢₁^𝑘 log 𝑦₁^𝑘

 Recall : Entropy

(45)

ෞ

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦₁, ෞ𝑦₁ = − ෍

𝑘=1 3

𝑦₁^𝑘 log ෢𝑦₁^𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5

Maximum log likelihood ∶ max 𝑙𝑜𝑔 ෞ𝑦₁ = max log Pr 𝑦₁ 𝑥₁, 𝑊 = log 0.5

Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦₁ = min −log Pr 𝑦₁ 𝑥₁, 𝑊 = −log 0.5

 The sum is equal to the min because the sum includes zeros values and only one significant non zero value.

 And because in that case, the one “1” and the min log value are well located !!!  The prediction correlates the label (the ground truth)  parameters are good.

arg max f(x) = arg min –f(x)

(46)

ෞ

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦₁, ෞ𝑦₁ = − ෍

𝑘=1 3

𝑦₁^𝑘 log ෢𝑦₁^𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦₁ = min −log Pr 𝑦₁ 𝑥₁, 𝑊 = −log 0.5

minW −log Pr 𝑦₁ 𝑥₁, 𝑊 = min

𝑊 𝐻 𝑦₁, ෞ𝑦₁(𝑊) 𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ොො 𝑦 depends on parameters W

Minimizing the negative log likelihood is equivalent to minimizing the cross entropy in a classification context

(47)

• Maximum likelihood :

The log does not impact the arg max

(48)

The log does not impact the arg max

Log rule

Applying the log rule

(49)

arg max f(x) = arg min –f(x)

(50)

Minimizing the negative log likelihood is equivalent to minimizing the cross

entropy

ො

𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ො𝑦 depends on parameters W

(51)

Maximizing the log likelihood is equivalent to minimizing the cross entropy in a classification context Maximum likelihood

Minimizing the cross entropy

http://romain.raveaux.free.fr/document/CrossEntropy.html Breaking news :

1. In the context of a discriminative model for probabilistic classification

2. minimizing the cross entropy = minimizing the negative = log-likelihood maximizing the likelihood

(52)

What would happen in regression context???

Let us say that 𝑦_𝑖 ∈ ℝ

No more “One hot encoding”

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

(53)

Ok the classification case BUT

What would happen in a regression context ???

So we need to make an hypothesis

𝑦_𝑖 ℎ𝑎𝑠 𝑎 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑎 𝑚𝑒𝑎𝑛

= ෝ𝑦_𝑖(𝑊)

(54)

(55)

Negative log-

likelihood Change the

product in a sum of logs

(56)

Recall of the Gaussian distribution

Calculus of the log of the Gaussian

(57)

(58)

How to find the minimum with respect to W ????

(59)

What’s that ????

(60)

What’s that ???? The least squares criteria appears !!!

(61)

4. With a Gaussian hypothesis : 

Maximizing the likelihood is equivalent to minimizing the sum-of-squares error

function !!!!

(62)

1. Maximizing the likelihood is equivalent to minimizing the sum-of-squares error function !!!!

2. Thus the sum-of-squares error function has arisen as a

consequence of maximizing

likelihood under the assumption of a Gaussian noise distribution.

(63)

1. Said differently :

• We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model.

𝑦_𝑖 = ෝ𝑦_𝑖 + 𝜖

Gaussian noise with a zero mean and variance 𝜎²

(64)

Link to : Minimizing the least squares error

http://romain.raveaux.free.fr/document/

LeastSquaresError.html

Breaking news : In the context of a discriminative model for regression, Thus the sum-of-squares error function has arisen as a consequence of

maximizing likelihood under the assumption of a Gaussian noise distribution.

Fitting by maximum of likelihood:

That’s what your favorite standard neural network does

(65)

Learning a discriminative model (Maximum a posteriori)

Pr(y|x,W)

• By maximum a posteriori (MAP)

 Bayes’ rule

(66)

Pr(y|x,W)

• Maximum a posteriori (MAP)

𝐷 does not depend on W

(67)

Pr(y|x,W)

Independent and identically distributed 

(68)

• A discriminative model with its parameters: Pr(y|x,W)

• Consequence :

• The same than Maximum of Likelihood +

• A prior has appeared : Pr(W)

• Your favorite neural network can do that if you precise

some strcuture on Pr(W)

(69)

• Let’us try a Gaussian assumption on Pr(W) !!!!

• α controls the variance of the distribution. I is the identity matrix.

• The log :

(70)

• Let’us replace this prior into the MAP problem

• The Maximum a posteriori problems becomes :

(71)

What’s that?????

(72)

𝑊^𝑇 = 𝑊₁, 𝑊₂, … , 𝑊_𝐷

W^TW = W₁W₁ + 𝑊₂𝑊₂ + … + 𝑊_𝑁𝑊_𝑁

W^TW= 𝑊₁² + 𝑊₂² + ⋯ + 𝑊_𝑁² ² = |𝑊 |₂² L2 regularization or L2 penalty

Cost Function=

Loss (i.e. MSE or cross entropy) + Regularization term

1. L2 regularizer is known in the machine learning literature as weight decay.

2. It encourages weight values to decay towards zero

3. In statistics, it provides an example of a parameter

shrinkage method because it shrinks parameter values towards zero.

4. Error function remains a quadratic function

(73)

Pr(y|x,W)

• Maximum a posteriori (MAP)

• Your favorite neural

network can do that if you precise some strcuture on Pr(W)

Link to : Minimizing the least squares error with quadratic regularization

http://romain.raveaux.free.fr/document/LeastSquaresError.html

Breaking news : maximizing the posterior distribution is equivalent to minimizing a regularized error function

(74)

Learning a generative model

• A generative model : Pr(x|y)

• A generative model with its parameters: Pr(x|y,W)

• Maximum of likelihood

• That’s what a Generative Adverserial Network (GAN) does

• Can be used for classification:

• See Naive Bayes classifier

http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html

(75)

Inference

• Learning is great but we want to predict

• Here comes the inference time:

• we take a new set of measurements and use the model to tell us about the world state.

• Inference with a discriminative model:

• W* : Learned parameters

• 𝑥^𝑛𝑒𝑤 : new datum

• Easy as : Pr(y|𝑥^𝑛𝑒𝑤,W*)

(76)

Inference

• Learning is great but we want to predict

• Here comes the inference time:

• we take a new set of measurements and use the model to tell us about the world state.

• Inference with a generative model:

• Use the baye’s rule : Posterior distribution

See Naive Bayes classifier:

http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html

(77)

Inference with a generative model:

• An application :

• Classification

• Pixel classification

• Skin or not skin pixel

• 𝑥_𝑖 ∈ {0, … , 255} : Gray level

• 𝑦_𝑖 ∈ {0; 1} : Skin or not

(78)

• Classification

• 𝑥_𝑖 ∈ {0, … , 255} : Gray level

• 𝑦_𝑖 ∈ {0; 1} : Skin or not

• Pr(𝑥^𝑛𝑒𝑤|y,W)

• Modelled by a categorical distribution

• W={𝜆_𝑘 }_𝑘=0²⁵⁵

• W are obtained by maximum of likelihood

Total number of pixels

Number of pixels of gray level = k

(79)

• Classification

• 𝑥_𝑖 ∈ {0, … , 255} : Gray level

• 𝑦_𝑖 ∈ {0; 1} : Skin or not

• Pr(y,W)

• Modelled by a Bernoulli distribution

• W={𝜆_𝑘 }_𝑘=0¹

• W are obtained by maximum of likelihood

Total number of pixels

Number of pixels of class k

(80)

(81)

Taking the arg max

(82)

Conclusion

• Machine learning

• Models

• Generative

• Discriminative

• Inference

• Maximum a postiori (especially for generative models)

• Maximum likelihood

• Learning

• Maximum likelihood (overfitting ?)

• Maximum a postiori (include some knowledge about the parameters)

(83)

A Probabilistic View

Learning and inference explained by decision theory and probabilities : A summary

(84)

• http://romain.raveaux.free.fr/document/Connecting%20local%20models%20 the%20case%20of%20chains%20.pdf