Supervised machine learning An introduction
Romain Raveaux
romain.raveaux@univ-tours.fr Maître de conférences
Université de Tours
Laboratoire d’informatique (LIFAT) Equipe RFAI
1
Outline
• Introduction
• Supervised/Re-inforcement/Unsupervised
• Probability
• Generative/Discriminative models
• Fitting probability models
• Fitting discriminative models
• Fitting generative models
• Inference
2
Learning paradigms
• Supervised learning
• task of learning a function: 𝑓
• that maps an input to an output ∶ 𝑓: 𝑋 → 𝑌
• based on example input-output pairs : 𝑥𝑖, 𝑦𝑖 𝑁𝑖=1; 𝑥𝑖 ∈ 𝑋 𝑒𝑡 𝑦𝑖 ∈ 𝑌
• Reinforcement learning 𝑓 = 𝑓1 ∘ 𝑓… ∘ 𝑓𝑘: 𝑋 → ത𝑌
• differs from supervised learning
• 𝑓is composed of sub functions (in X) to output 𝑦ത𝑖 ,
• Incomplete feedback (𝑦ത𝑖 is not completely given 𝑦ഥ𝑖 ⊂ 𝑦𝑖 ∈ 𝑌)
• Unsupervised learning :
• task of learning a function: 𝑓
• that maps an input to an output ∶ 𝑓: 𝑋 → 𝑍
• based on input examples : 𝑥𝑖 𝑁𝑖=1; 𝑥𝑖 ∈ 𝑋
• No labels (𝑦𝑖) are given
• Self-supervised learning
• is autonomous supervised learning.
• eliminates the pre-requisite requiring humans to label data.
Supervised learning
Regression
Credit : https://openclassrooms.com/fr/courses/4525266
Supervised learning
Classification
Credit : https://towardsdatascience.com/supervised-learning- basics-of-classification-and-main-algorithms-c16b06806cd3
Self-supervised learning
Supervised learning
• Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ
• Classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑
• Structured classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑 𝑁×𝑀
• Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ𝑁×𝑀
• Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑮 (𝑔𝑟𝑎𝑝ℎ 𝑠𝑝𝑎𝑐𝑒)
• Structured learning when the output is complex:
• Segmentation mask for instance
Probability = rational way to deal with uncertainty
Image taken from Pr(X,Y) = joint
probability
Pr(Y|X) = conditional probability
Pr(X) = Marginal probability
Probability : Joint, Conditional and Marginal
Probability : Link between Joint, Conditional and Marginal : The product rule
Product rule
Link between Joint and Conditional probabilities
Conditional probability
• Pr(X|Y = 1) :
• The conditional probability Pr(X|Y = 1) can be recovered from the joint distribution Pr(X,Y).
• In particular, we examine the appropriate slice Pr(X,Y=1) of the joint distribution
• The values in the slice tell us about the relative probability that X takes various values having observed Y=1,
• But they do not themselves form a valid probability distribution
• they cannot sum to one
• We hence normalize by the total probability in the slice :
• Pr 𝑋 𝑌 = 1 = Pr(𝑋,𝑌=1)
𝑃𝑟(𝑌=1)
Probability : Link between Joint and Marginal probabilities : The sum rule
Sum rule
Link between Joint and Marginal probabilities
Sum Sum
Sum
The Rules of Probability
called Marginalization
Bayes’ rule : provide a quantification of uncertainty
Symmetry
Note that the likelihood (p(X|Y)) is not a probability distribution over Y, and its integral with respect to Y does not (necessarily) equal one.
𝑝 𝑌 𝑋 𝑃 𝑋 = 𝑝 𝑋 𝑌 𝑃 𝑌 Product rule
Bayes’ rule
Expectation of a function f(x)
Entropy
• 𝑓 𝑥 = − log 𝑝 𝑥
• The opposite of p(x)
• low probability events x correspond to high information content.
• Entropy = Expectation of f(x)
• The distribution that maximizes the entropy is a Gaussian !!
Discrete case Continuous case
Learning : We need Data
• A database : Training set
• 𝑥𝑖 ∶ a real or discrete datum, a measurement of the world.
• 𝑦𝑖 ∶ a real or discrete state of the world. A target, a ground truth value, …
Learning = Modelling data
• Discriminative and Generative models
or just Pr(𝑥) : if there are no output labels [this is a generative model]
Supervised learning
Credit:
https://developers.google.com/machine-learning/gan/images/generative_v_discriminative.png Pr(y|x) Pr(x|y)
Inference
• We take a new set of measurements
• A new data : 𝑥𝑛𝑒𝑤
• And we use the model to tell us about the world state.
• We want to make a prediction from 𝑥𝑛𝑒𝑤 (to infer from 𝑥𝑛𝑒𝑤)
• We take a datum 𝑥𝑛𝑒𝑤 and use it to infer the state of the world 𝑦𝑖.
Inference with a discriminative model
• We take a new set of measurements
• A new datum : 𝑥𝑛𝑒𝑤
• And we use the model to tell us about the world state.
• We want to make a prediction from 𝑥𝑛𝑒𝑤
• Inference with a discriminative model:
• Easy as : Pr(y|𝑥𝑛𝑒𝑤)
• Given a new datum, we obtain the probability distribution of the world states
• world states = classes in a classification context
• world states = the domain of real values in a regression context
Inference with a generative model
• We take a new set of measurements
• A new datum : 𝑥𝑛𝑒𝑤
• And we use the model to tell us about the world state.
• We want to make a prediction from 𝑥𝑛𝑒𝑤
• Inference with a generative model:
• The generative model : Pr(𝑥𝑛𝑒𝑤|y)
• Pr 𝑦|𝑥𝑛𝑒𝑤 = Pr 𝑥𝑛𝑒𝑤 𝑦 ×Pr(𝑦)
Pr(𝑥𝑛𝑒𝑤)
Bayes’ rule
Supervised learning : What can we learn?
We want to find these distributions:
• Pr(y|x) or
• Pr(x|y)
Where can we act?
1. On the parameters of the distributions.
2. Let’s call them W
Parametrized distributions are :
• Pr(y|x,W) or
• Pr(x|y,W)
Supervised learning : fitting probability models
We want to find the parameters (W) such that the probability distribution fits the data.
Let us take an example !!! A Gaussian distribution.
Supervised learning : fitting probability models = Finding 𝜇, 𝜎2
Fitting probability models : How to fit the data ?
• We find the parameters to fit the data
• Three main ways :
• 1°) Maximum likelihood
• Assuming each data point was drawn independently from the distribution (i.i.d)
• Independent and identically distributed i.i.d
Supervised learning : fitting probability models
• Three ways :
• 2°) Maximum a posteriori
• Assuming each data point was drawn independently from the distribution (i.i.d)
Bayes’ rule
Supervised learning : fitting probability models
• Three ways :
• 3°) The Bayesian approach
• Beyond the scope of this lecture
A nice book Parameter W is treated as a random variable.
This view makes a lot of sense:
• W depends on the data set {𝑥1, 𝑥2, 𝑥3, …}
which itself is randomly sampled.
For any prior distribution Pr(W) over the space of weight vectors, one can derive the posterior probability of W for given observations
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
Fitting by Maximum of likelihood
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
Minimizing the Negative Log Likelihood
How to solve this optimization problem ? How to find the minimum of a function ?
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
Minimizing the Negative Log Likelihood
How to solve this optimization problem ? How to find the minimum of a function ? Find where the derivative is equal to zero !!!!
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
Minimizing the Negative Log Likelihood
The solution :
The problem :
For this problem, there is a closed form solution, an analytical solution
We search the parameter 𝝁
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
Minimizing the Negative Log Likelihood The problem :
We search the parameter 𝝈𝟐
http://romain.raveaux.free.fr/document/Overfittingbiaisedand unbiaisedvariance.html
The full steps to obtain this result can be found on :
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
• Overfitting phenomenon of the maximum likelihood
• The maximum likelihood approach systematically underestimates the variance of the distribution.
• Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎2
• Consider the expectations of these quantities ( Ƹ𝜇, 𝜎2)
No problem 𝝁 = ෝ𝝁
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
• Overfitting phenomenon of the maximum likelihood
• The maximum likelihood approach systematically underestimates the variance of the distribution.
• Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎2
• Consider the expectations of these quantities ( Ƹ𝜇, 𝜎2)
!!! Problem 𝝈𝟐 = 𝑵−𝟏
𝑵 𝝈𝟐
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum of likelihood:
• Overfitting phenomenon of the maximum likelihood
• http://romain.raveaux.free.fr/document/Overfittingbiaisedandunbiaisedvariance.html
• underestimate the true variance by a factor (N-1/N)
Supervised learning : fitting probability models : On a Gaussian
• Fitting by Maximum a posteriori: W=[mean,variance]
• Well it depends on the prior (Pr(W) )
• See conjugate distribution
• the result is proportional to a new distribution which has the same form as the conjugate.
• Non informative prior
Learning a discriminative model
• A discriminative model : Pr(y|x)
• A discriminative model with its parameters:
Pr(y|x,W)
• Fitting by maximum of likelihood:
• That’s what your favorite standard neural network does
Link to : Minimizing the cross entropy : a nice trip from Maximum likelihood to Kullback–Leibler divergence
http://romain.raveaux.free.fr/document/
CrossEntropy.html
Breaking news : In the context of a discriminative model for probabilistic classification, minimizing the cross entropy is equivalent to maximize the likelihood
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 𝑦11 = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟
One hot encoding
ෞ
𝑦1: 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠𝑒𝑠.
𝑦11 = Probability to observe the class 1=0.2
ෞ
𝑦1 = Pr(𝑦1|𝑥1, 𝑊)
Just some notation
ෝ
𝑦𝑖: 𝐷𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑊 𝑏𝑢𝑡 𝑤𝑒 𝑑𝑜𝑛′𝑡 𝑠ℎ𝑜𝑤 𝑖𝑡 ∶ ෝ𝑦𝑖 𝑊
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 𝑦11 = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟
One hot encoding
max ෞ𝑦1 = max Pr 𝑦1 𝑥1, 𝑊 = 0.5
argmax ෞ𝑦1 = argmax Pr 𝑦1 𝑥1, 𝑊 = 3
maximum of likelihood
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding
max 𝑙𝑜𝑔 ෞ𝑦1 = max log Pr 𝑦1 𝑥1, 𝑊 = log 0.5 log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5
argmax 𝑙𝑜𝑔 ෞ𝑦1 = argmax log Pr 𝑦1 𝑥1, 𝑊 = 3
Applying the log
Does not change the argmax
Applying the log
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log
𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 𝑦1, ෞ𝑦1 = −
𝑘=1 3
𝑦1𝑘 log 𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 ෞ𝑦1 = −
𝑘=1 3
𝑦1𝑘 log 𝑦1𝑘
Recall : Entropy
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log
𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦1, ෞ𝑦1 = −
𝑘=1 3
𝑦1𝑘 log 𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5
Maximum log likelihood ∶ max 𝑙𝑜𝑔 ෞ𝑦1 = max log Pr 𝑦1 𝑥1, 𝑊 = log 0.5
Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦1 = min −log Pr 𝑦1 𝑥1, 𝑊 = −log 0.5
The sum is equal to the min because the sum includes zeros values and only one significant non zero value.
And because in that case, the one “1” and the min log value are well located !!! The prediction correlates the label (the ground truth) parameters are good.
arg max f(x) = arg min –f(x)
Learning a discriminative model (maximum of likelihood)
𝑦1 = 0,0,1 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟
A classification problem with 3 classes : a landscape or a house or a car
ෞ
𝑦1 = 0.2,0.3,0.5 → 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log
𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦1, ෞ𝑦1 = −
𝑘=1 3
𝑦1𝑘 log 𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦1 = min −log Pr 𝑦1 𝑥1, 𝑊 = −log 0.5
minW −log Pr 𝑦1 𝑥1, 𝑊 = min
𝑊 𝐻 𝑦1, ෞ𝑦1(𝑊) 𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ොො 𝑦 depends on parameters W
Minimizing the negative log likelihood is equivalent to minimizing the cross entropy in a classification context
Learning a discriminative model (maximum of likelihood)
• Maximum likelihood :
The log does not impact the arg max
Learning a discriminative model (maximum of likelihood)
• Maximum likelihood :
The log does not impact the arg max
Log rule
Applying the log rule
Learning a discriminative model (maximum of likelihood)
• Maximum likelihood :
arg max f(x) = arg min –f(x)
Learning a discriminative model (maximum of likelihood)
• Maximum likelihood :
Minimizing the negative log likelihood is equivalent to minimizing the cross
entropy
ො
𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ො𝑦 depends on parameters W
Learning a discriminative model (maximum of likelihood)
• Maximum likelihood :
Maximizing the log likelihood is equivalent to minimizing the cross entropy in a classification context Maximum likelihood
Minimizing the cross entropy
http://romain.raveaux.free.fr/document/CrossEntropy.html Breaking news :
1. In the context of a discriminative model for probabilistic classification
2. minimizing the cross entropy = minimizing the negative = log-likelihood maximizing the likelihood
Learning a discriminative model (maximum of likelihood)
What would happen in regression context???
Let us say that 𝑦𝑖 ∈ ℝ
No more “One hot encoding”
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
Learning a discriminative model (maximum of likelihood)
Ok the classification case BUT
What would happen in a regression context ???
Let us say that 𝑦𝑖 ∈ ℝ
No more “One hot encoding”
So we need to make an hypothesis
𝑦𝑖 ℎ𝑎𝑠 𝑎 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑎 𝑚𝑒𝑎𝑛
= ෝ𝑦𝑖(𝑊)
Learning a discriminative model (maximum of likelihood)
Ok the classification case BUT
What would happen in regression context???
Let us say that 𝑦𝑖 ∈ ℝ
No more “One hot encoding”
Learning a discriminative model (maximum of likelihood)
Ok the classification case BUT
What would happen in regression context???
Let us say that 𝑦𝑖 ∈ ℝ
No more “One hot encoding”
Negative log-
likelihood Change the
product in a sum of logs
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
Recall of the Gaussian distribution
Calculus of the log of the Gaussian
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
How to find the minimum with respect to W ????
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
How to find the minimum with respect to W ????
What’s that ????
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
How to find the minimum with respect to W ????
What’s that ???? The least squares criteria appears !!!
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
4. With a Gaussian hypothesis :
Maximizing the likelihood is equivalent to minimizing the sum-of-squares error
function !!!!
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
4. With a Gaussian hypothesis :
1. Maximizing the likelihood is equivalent to minimizing the sum-of-squares error function !!!!
2. Thus the sum-of-squares error function has arisen as a
consequence of maximizing
likelihood under the assumption of a Gaussian noise distribution.
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
4. With a Gaussian hypothesis :
1. Said differently :
• We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model.
𝑦𝑖 = ෝ𝑦𝑖 + 𝜖
Gaussian noise with a zero mean and variance 𝜎2
Learning a discriminative model (maximum of likelihood)
1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context
4. With a Gaussian hypothesis :
Link to : Minimizing the least squares error
http://romain.raveaux.free.fr/document/
LeastSquaresError.html
Breaking news : In the context of a discriminative model for regression, Thus the sum-of-squares error function has arisen as a consequence of
maximizing likelihood under the assumption of a Gaussian noise distribution.
Fitting by maximum of likelihood:
That’s what your favorite standard neural network does
Learning a discriminative model (Maximum a posteriori)
• A discriminative model : Pr(y|x)
• A discriminative model with its parameters:
Pr(y|x,W)
• By maximum a posteriori (MAP)
Bayes’ rule
Learning a discriminative model (Maximum a posteriori)
• A discriminative model : Pr(y|x)
• A discriminative model with its parameters:
Pr(y|x,W)
• Maximum a posteriori (MAP)
𝐷 does not depend on W
Learning a discriminative model (Maximum a posteriori)
• A discriminative model with its parameters:
Pr(y|x,W)
• By maximum a posteriori (MAP)
Independent and identically distributed
Learning a discriminative model (Maximum a posteriori)
• A discriminative model with its parameters: Pr(y|x,W)
• By maximum a posteriori (MAP)
• Consequence :
• The same than Maximum of Likelihood +
• A prior has appeared : Pr(W)
• Your favorite neural network can do that if you precise
some strcuture on Pr(W)
Learning a discriminative model (Maximum a posteriori)
• Let’us try a Gaussian assumption on Pr(W) !!!!
• α controls the variance of the distribution. I is the identity matrix.
• The log :
Learning a discriminative model (Maximum a posteriori)
• Let’us replace this prior into the MAP problem
• The Maximum a posteriori problems becomes :
Learning a discriminative model (Maximum a posteriori)
• Let’us replace this prior into the MAP problem
• The Maximum a posteriori problems becomes :
What’s that?????
Learning a discriminative model (Maximum a posteriori)
• Let’us replace this prior into the MAP problem
• The Maximum a posteriori problems becomes :
𝑊𝑇 = 𝑊1, 𝑊2, … , 𝑊𝐷
WTW = W1W1 + 𝑊2𝑊2 + … + 𝑊𝑁𝑊𝑁
WTW= 𝑊12 + 𝑊22 + ⋯ + 𝑊𝑁2 2 = |𝑊 |22 L2 regularization or L2 penalty
Cost Function=
Loss (i.e. MSE or cross entropy) + Regularization term
1. L2 regularizer is known in the machine learning literature as weight decay.
2. It encourages weight values to decay towards zero
3. In statistics, it provides an example of a parameter
shrinkage method because it shrinks parameter values towards zero.
4. Error function remains a quadratic function
Learning a discriminative model (Maximum a posteriori)
• A discriminative model : Pr(y|x)
• A discriminative model with its parameters:
Pr(y|x,W)
• Maximum a posteriori (MAP)
• Your favorite neural
network can do that if you precise some strcuture on Pr(W)
Link to : Minimizing the least squares error with quadratic regularization
http://romain.raveaux.free.fr/document/LeastSquaresError.html
Breaking news : maximizing the posterior distribution is equivalent to minimizing a regularized error function
Learning a generative model
• A generative model : Pr(x|y)
• A generative model with its parameters: Pr(x|y,W)
• Maximum of likelihood
• That’s what a Generative Adverserial Network (GAN) does
• Can be used for classification:
• See Naive Bayes classifier
http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html
Inference
• Learning is great but we want to predict
• Here comes the inference time:
• we take a new set of measurements and use the model to tell us about the world state.
• Inference with a discriminative model:
• W* : Learned parameters
• 𝑥𝑛𝑒𝑤 : new datum
• Easy as : Pr(y|𝑥𝑛𝑒𝑤,W*)
Inference
• Learning is great but we want to predict
• Here comes the inference time:
• we take a new set of measurements and use the model to tell us about the world state.
• Inference with a generative model:
• Use the baye’s rule : Posterior distribution
See Naive Bayes classifier:
http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html
Inference with a generative model:
• An application :
• Classification
• Pixel classification
• Skin or not skin pixel
• 𝑥𝑖 ∈ {0, … , 255} : Gray level
• 𝑦𝑖 ∈ {0; 1} : Skin or not
Inference with a generative model:
• An application :
• Classification
• Pixel classification
• Skin or not skin pixel
• 𝑥𝑖 ∈ {0, … , 255} : Gray level
• 𝑦𝑖 ∈ {0; 1} : Skin or not
• Pr(𝑥𝑛𝑒𝑤|y,W)
• Modelled by a categorical distribution
• W={𝜆𝑘 }𝑘=0255
• W are obtained by maximum of likelihood
Total number of pixels
Number of pixels of gray level = k
Inference with a generative model:
• An application :
• Classification
• Pixel classification
• Skin or not skin pixel
• 𝑥𝑖 ∈ {0, … , 255} : Gray level
• 𝑦𝑖 ∈ {0; 1} : Skin or not
• Pr(y,W)
• Modelled by a Bernoulli distribution
• W={𝜆𝑘 }𝑘=01
• W are obtained by maximum of likelihood
Total number of pixels
Number of pixels of class k
Inference with a generative model:
Inference with a generative model:
Taking the arg max
Conclusion
• Machine learning
• Models
• Generative
• Discriminative
• Inference
• Maximum a postiori (especially for generative models)
• Maximum likelihood
• Learning
• Maximum likelihood (overfitting ?)
• Maximum a postiori (include some knowledge about the parameters)
A Probabilistic View
Learning and inference explained by decision theory and probabilities : A summary
Next
• We will connect local models !!!
• http://romain.raveaux.free.fr/document/Connecting%20local%20models%20 the%20case%20of%20chains%20.pdf