• Aucun résultat trouvé

Supervised machine learning

N/A
N/A
Protected

Academic year: 2022

Partager "Supervised machine learning"

Copied!
84
0
0

Texte intégral

(1)

Supervised machine learning An introduction

Romain Raveaux

romain.raveaux@univ-tours.fr Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Outline

Introduction

Supervised/Re-inforcement/Unsupervised

Probability

Generative/Discriminative models

Fitting probability models

Fitting discriminative models

Fitting generative models

Inference

2

(3)

Learning paradigms

Supervised learning

task of learning a function: 𝑓

that maps an input to an output ∶ 𝑓: 𝑋 → 𝑌

based on example input-output pairs : 𝑥𝑖, 𝑦𝑖 𝑁𝑖=1; 𝑥𝑖 ∈ 𝑋 𝑒𝑡 𝑦𝑖 ∈ 𝑌

Reinforcement learning 𝑓 = 𝑓1 ∘ 𝑓 ∘ 𝑓𝑘: 𝑋 → ത𝑌

differs from supervised learning

𝑓is composed of sub functions (in X) to output 𝑦𝑖 ,

Incomplete feedback (𝑦𝑖 is not completely given 𝑦𝑖 ⊂ 𝑦𝑖 ∈ 𝑌)

Unsupervised learning :

task of learning a function: 𝑓

that maps an input to an output ∶ 𝑓: 𝑋 → 𝑍

based on input examples : 𝑥𝑖 𝑁𝑖=1; 𝑥𝑖 ∈ 𝑋

No labels (𝑦𝑖) are given

Self-supervised learning

is autonomous supervised learning.

eliminates the pre-requisite requiring humans to label data.

(4)

Supervised learning

Regression

Credit : https://openclassrooms.com/fr/courses/4525266

(5)

Supervised learning

Classification

Credit : https://towardsdatascience.com/supervised-learning- basics-of-classification-and-main-algorithms-c16b06806cd3

(6)

Self-supervised learning

(7)

Supervised learning

Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ

Classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑

Structured classification : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑎, 𝑏, 𝑐, 𝑑 𝑁×𝑀

Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ ℝ𝑁×𝑀

Structured Regression : 𝑓: 𝑋 → 𝑌 𝑎𝑛𝑑 𝑦 ∈ 𝑮 (𝑔𝑟𝑎𝑝ℎ 𝑠𝑝𝑎𝑐𝑒)

Structured learning when the output is complex:

Segmentation mask for instance

(8)

Probability = rational way to deal with uncertainty

Image taken from Pr(X,Y) = joint

probability

Pr(Y|X) = conditional probability

Pr(X) = Marginal probability

(9)

Probability : Joint, Conditional and Marginal

(10)

Probability : Link between Joint, Conditional and Marginal : The product rule

Product rule

(11)

Link between Joint and Conditional probabilities

(12)

Conditional probability

Pr(X|Y = 1) :

The conditional probability Pr(X|Y = 1) can be recovered from the joint distribution Pr(X,Y).

In particular, we examine the appropriate slice Pr(X,Y=1) of the joint distribution

The values in the slice tell us about the relative probability that X takes various values having observed Y=1,

But they do not themselves form a valid probability distribution

they cannot sum to one

We hence normalize by the total probability in the slice :

Pr 𝑋 𝑌 = 1 = Pr(𝑋,𝑌=1)

𝑃𝑟(𝑌=1)

(13)

Probability : Link between Joint and Marginal probabilities : The sum rule

Sum rule

(14)

Link between Joint and Marginal probabilities

Sum Sum

Sum

(15)

The Rules of Probability

called Marginalization

(16)

Bayes’ rule : provide a quantification of uncertainty

Symmetry

Note that the likelihood (p(X|Y)) is not a probability distribution over Y, and its integral with respect to Y does not (necessarily) equal one.

𝑝 𝑌 𝑋 𝑃 𝑋 = 𝑝 𝑋 𝑌 𝑃 𝑌 Product rule

Bayes’ rule

(17)

Expectation of a function f(x)

(18)

Entropy

𝑓 𝑥 = − log 𝑝 𝑥

The opposite of p(x)

low probability events x correspond to high information content.

Entropy = Expectation of f(x)

The distribution that maximizes the entropy is a Gaussian !!

Discrete case Continuous case

(19)

Learning : We need Data

A database : Training set

𝑥𝑖 a real or discrete datum, a measurement of the world.

𝑦𝑖 a real or discrete state of the world. A target, a ground truth value, …

(20)

Learning = Modelling data

Discriminative and Generative models

or just Pr(𝑥) : if there are no output labels [this is a generative model]

(21)

Supervised learning

Credit:

https://developers.google.com/machine-learning/gan/images/generative_v_discriminative.png Pr(y|x) Pr(x|y)

(22)

Inference

We take a new set of measurements

A new data : 𝑥𝑛𝑒𝑤

And we use the model to tell us about the world state.

We want to make a prediction from 𝑥𝑛𝑒𝑤 (to infer from 𝑥𝑛𝑒𝑤)

We take a datum 𝑥𝑛𝑒𝑤 and use it to infer the state of the world 𝑦𝑖.

(23)

Inference with a discriminative model

We take a new set of measurements

A new datum : 𝑥𝑛𝑒𝑤

And we use the model to tell us about the world state.

We want to make a prediction from 𝑥𝑛𝑒𝑤

Inference with a discriminative model:

Easy as : Pr(y|𝑥𝑛𝑒𝑤)

Given a new datum, we obtain the probability distribution of the world states

world states = classes in a classification context

world states = the domain of real values in a regression context

(24)

Inference with a generative model

We take a new set of measurements

A new datum : 𝑥𝑛𝑒𝑤

And we use the model to tell us about the world state.

We want to make a prediction from 𝑥𝑛𝑒𝑤

Inference with a generative model:

The generative model : Pr(𝑥𝑛𝑒𝑤|y)

Pr 𝑦|𝑥𝑛𝑒𝑤 = Pr 𝑥𝑛𝑒𝑤 𝑦 ×Pr(𝑦)

Pr(𝑥𝑛𝑒𝑤)

Bayes’ rule

(25)

Supervised learning : What can we learn?

We want to find these distributions:

Pr(y|x) or

Pr(x|y)

Where can we act?

1. On the parameters of the distributions.

2. Let’s call them W

Parametrized distributions are :

Pr(y|x,W) or

Pr(x|y,W)

(26)

Supervised learning : fitting probability models

We want to find the parameters (W) such that the probability distribution fits the data.

Let us take an example !!! A Gaussian distribution.

(27)

Supervised learning : fitting probability models = Finding 𝜇, 𝜎2

(28)

Fitting probability models : How to fit the data ?

• We find the parameters to fit the data

Three main ways :

1°) Maximum likelihood

Assuming each data point was drawn independently from the distribution (i.i.d)

Independent and identically distributed i.i.d

(29)

Supervised learning : fitting probability models

Three ways :

2°) Maximum a posteriori

Assuming each data point was drawn independently from the distribution (i.i.d)

Bayes’ rule

(30)

Supervised learning : fitting probability models

Three ways :

3°) The Bayesian approach

Beyond the scope of this lecture

A nice book Parameter W is treated as a random variable.

This view makes a lot of sense:

W depends on the data set {𝑥1, 𝑥2, 𝑥3, …}

which itself is randomly sampled.

For any prior distribution Pr(W) over the space of weight vectors, one can derive the posterior probability of W for given observations

(31)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Fitting by Maximum of likelihood

(32)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Minimizing the Negative Log Likelihood

How to solve this optimization problem ? How to find the minimum of a function ?

(33)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Minimizing the Negative Log Likelihood

How to solve this optimization problem ? How to find the minimum of a function ? Find where the derivative is equal to zero !!!!

(34)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Minimizing the Negative Log Likelihood

The solution :

The problem :

For this problem, there is a closed form solution, an analytical solution

We search the parameter 𝝁

(35)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Minimizing the Negative Log Likelihood The problem :

We search the parameter 𝝈𝟐

http://romain.raveaux.free.fr/document/Overfittingbiaisedand unbiaisedvariance.html

The full steps to obtain this result can be found on :

(36)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Overfitting phenomenon of the maximum likelihood

The maximum likelihood approach systematically underestimates the variance of the distribution.

Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎2

Consider the expectations of these quantities ( Ƹ𝜇, ෢𝜎2)

No problem 𝝁 = ෝ𝝁

(37)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Overfitting phenomenon of the maximum likelihood

The maximum likelihood approach systematically underestimates the variance of the distribution.

Hypothesis : Data come from a Gaussian distribution with parameters 𝜇 and 𝜎2

Consider the expectations of these quantities ( Ƹ𝜇, ෢𝜎2)

!!! Problem 𝝈𝟐 = 𝑵−𝟏

𝑵 𝝈𝟐

(38)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum of likelihood:

Overfitting phenomenon of the maximum likelihood

http://romain.raveaux.free.fr/document/Overfittingbiaisedandunbiaisedvariance.html

underestimate the true variance by a factor (N-1/N)

(39)

Supervised learning : fitting probability models : On a Gaussian

Fitting by Maximum a posteriori: W=[mean,variance]

Well it depends on the prior (Pr(W) )

See conjugate distribution

the result is proportional to a new distribution which has the same form as the conjugate.

Non informative prior

(40)

Learning a discriminative model

A discriminative model : Pr(y|x)

A discriminative model with its parameters:

Pr(y|x,W)

Fitting by maximum of likelihood:

That’s what your favorite standard neural network does

Link to : Minimizing the cross entropy : a nice trip from Maximum likelihood to Kullback–Leibler divergence

http://romain.raveaux.free.fr/document/

CrossEntropy.html

Breaking news : In the context of a discriminative model for probabilistic classification, minimizing the cross entropy is equivalent to maximize the likelihood

(41)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 𝑦11 = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟

One hot encoding

𝑦1: 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠𝑒𝑠.

𝑦11 = Probability to observe the class 1=0.2

𝑦1 = Pr(𝑦1|𝑥1, 𝑊)

Just some notation

𝑦𝑖: 𝐷𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑊 𝑏𝑢𝑡 𝑤𝑒 𝑑𝑜𝑛𝑡 𝑠ℎ𝑜𝑤 𝑖𝑡 ∶ ෝ𝑦𝑖 𝑊

(42)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 𝑦11 = 0.2 ∶ 𝑓𝑖𝑟𝑠𝑡 𝑡𝑒𝑟𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟

One hot encoding

max ෞ𝑦1 = max Pr 𝑦1 𝑥1, 𝑊 = 0.5

argmax ෞ𝑦1 = argmax Pr 𝑦1 𝑥1, 𝑊 = 3

maximum of likelihood

(43)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding

max 𝑙𝑜𝑔 ෞ𝑦1 = max log Pr 𝑦1 𝑥1, 𝑊 = log 0.5 log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5

argmax 𝑙𝑜𝑔 ෞ𝑦1 = argmax log Pr 𝑦1 𝑥1, 𝑊 = 3

Applying the log

Does not change the argmax

Applying the log

(44)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 𝑦1, ෞ𝑦1 = − ෍

𝑘=1 3

𝑦1𝑘 log ෢𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 𝐸𝑛𝑡𝑟𝑜𝑝𝑦: 𝐻 ෞ𝑦1 = − ෍

𝑘=1 3

𝑦1𝑘 log 𝑦1𝑘

Recall : Entropy

(45)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦1, ෞ𝑦1 = − ෍

𝑘=1 3

𝑦1𝑘 log ෢𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5

Maximum log likelihood ∶ max 𝑙𝑜𝑔 ෞ𝑦1 = max log Pr 𝑦1 𝑥1, 𝑊 = log 0.5

Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦1 = min −log Pr 𝑦1 𝑥1, 𝑊 = −log 0.5

The sum is equal to the min because the sum includes zeros values and only one significant non zero value.

And because in that case, the one “1” and the min log value are well located !!!  The prediction correlates the label (the ground truth)  parameters are good.

arg max f(x) = arg min –f(x)

(46)

Learning a discriminative model (maximum of likelihood)

𝑦1 = 0,0,1 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → 𝑦1 = 𝑐𝑎𝑟

A classification problem with 3 classes : a landscape or a house or a car

𝑦1 = 0.2,0.3,0.5 𝑙𝑎𝑛𝑠𝑐𝑎𝑝𝑒, ℎ𝑜𝑢𝑠𝑒, 𝑐𝑎𝑟 → ෞ𝑦1 = 𝑐𝑎𝑟 One hot encoding log ෞ𝑦1 = log 0.2 , log 0.3 , log 0.5 Applying the log

𝐶𝑟𝑜𝑠𝑠 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ∶ 𝐻 𝑦1, ෞ𝑦1 = − ෍

𝑘=1 3

𝑦1𝑘 log ෢𝑦1𝑘 = 0 ∗ log 0.2 − 0 ∗ log 0.3 − 1 ∗ log(0.5) = −log 0.5 Negative log likelihood ∶ m𝑖𝑛 − 𝑙𝑜𝑔 ෞ𝑦1 = min −log Pr 𝑦1 𝑥1, 𝑊 = −log 0.5

minW −log Pr 𝑦1 𝑥1, 𝑊 = min

𝑊 𝐻 𝑦1, ෞ𝑦1(𝑊) 𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ො 𝑦 depends on parameters W

Minimizing the negative log likelihood is equivalent to minimizing the cross entropy in a classification context

(47)

Learning a discriminative model (maximum of likelihood)

Maximum likelihood :

The log does not impact the arg max

(48)

Learning a discriminative model (maximum of likelihood)

Maximum likelihood :

The log does not impact the arg max

Log rule

Applying the log rule

(49)

Learning a discriminative model (maximum of likelihood)

Maximum likelihood :

arg max f(x) = arg min –f(x)

(50)

Learning a discriminative model (maximum of likelihood)

Maximum likelihood :

Minimizing the negative log likelihood is equivalent to minimizing the cross

entropy

𝑦 𝑊 : 𝑇ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ො𝑦 depends on parameters W

(51)

Learning a discriminative model (maximum of likelihood)

Maximum likelihood :

Maximizing the log likelihood is equivalent to minimizing the cross entropy in a classification context Maximum likelihood

Minimizing the cross entropy

http://romain.raveaux.free.fr/document/CrossEntropy.html Breaking news :

1. In the context of a discriminative model for probabilistic classification

2. minimizing the cross entropy = minimizing the negative = log-likelihood maximizing the likelihood

(52)

Learning a discriminative model (maximum of likelihood)

What would happen in regression context???

Let us say that 𝑦𝑖 ∈ ℝ

No more “One hot encoding”

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

(53)

Learning a discriminative model (maximum of likelihood)

Ok the classification case BUT

What would happen in a regression context ???

Let us say that 𝑦𝑖 ∈ ℝ

No more “One hot encoding”

So we need to make an hypothesis

𝑦𝑖 ℎ𝑎𝑠 𝑎 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑎 𝑚𝑒𝑎𝑛

= ෝ𝑦𝑖(𝑊)

(54)

Learning a discriminative model (maximum of likelihood)

Ok the classification case BUT

What would happen in regression context???

Let us say that 𝑦𝑖 ∈ ℝ

No more “One hot encoding”

(55)

Learning a discriminative model (maximum of likelihood)

Ok the classification case BUT

What would happen in regression context???

Let us say that 𝑦𝑖 ∈ ℝ

No more “One hot encoding”

Negative log-

likelihood Change the

product in a sum of logs

(56)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

Recall of the Gaussian distribution

Calculus of the log of the Gaussian

(57)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

(58)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

How to find the minimum with respect to W ????

(59)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

How to find the minimum with respect to W ????

What’s that ????

(60)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

How to find the minimum with respect to W ????

What’s that ???? The least squares criteria appears !!!

(61)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

4. With a Gaussian hypothesis :

Maximizing the likelihood is equivalent to minimizing the sum-of-squares error

function !!!!

(62)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

4. With a Gaussian hypothesis :

1. Maximizing the likelihood is equivalent to minimizing the sum-of-squares error function !!!!

2. Thus the sum-of-squares error function has arisen as a

consequence of maximizing

likelihood under the assumption of a Gaussian noise distribution.

(63)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

4. With a Gaussian hypothesis :

1. Said differently :

We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model.

𝑦𝑖 = ෝ𝑦𝑖 + 𝜖

Gaussian noise with a zero mean and variance 𝜎2

(64)

Learning a discriminative model (maximum of likelihood)

1. Learning a discriminative model, 2. by the maximum of likelihood 3. BUT in a regression context

4. With a Gaussian hypothesis :

Link to : Minimizing the least squares error

http://romain.raveaux.free.fr/document/

LeastSquaresError.html

Breaking news : In the context of a discriminative model for regression, Thus the sum-of-squares error function has arisen as a consequence of

maximizing likelihood under the assumption of a Gaussian noise distribution.

Fitting by maximum of likelihood:

That’s what your favorite standard neural network does

(65)

Learning a discriminative model (Maximum a posteriori)

A discriminative model : Pr(y|x)

A discriminative model with its parameters:

Pr(y|x,W)

By maximum a posteriori (MAP)

Bayes’ rule

(66)

Learning a discriminative model (Maximum a posteriori)

A discriminative model : Pr(y|x)

A discriminative model with its parameters:

Pr(y|x,W)

Maximum a posteriori (MAP)

𝐷 does not depend on W

(67)

Learning a discriminative model (Maximum a posteriori)

A discriminative model with its parameters:

Pr(y|x,W)

By maximum a posteriori (MAP)

Independent and identically distributed

(68)

Learning a discriminative model (Maximum a posteriori)

A discriminative model with its parameters: Pr(y|x,W)

By maximum a posteriori (MAP)

Consequence :

The same than Maximum of Likelihood +

A prior has appeared : Pr(W)

Your favorite neural network can do that if you precise

some strcuture on Pr(W)

(69)

Learning a discriminative model (Maximum a posteriori)

Let’us try a Gaussian assumption on Pr(W) !!!!

α controls the variance of the distribution. I is the identity matrix.

The log :

(70)

Learning a discriminative model (Maximum a posteriori)

Let’us replace this prior into the MAP problem

The Maximum a posteriori problems becomes :

(71)

Learning a discriminative model (Maximum a posteriori)

Let’us replace this prior into the MAP problem

The Maximum a posteriori problems becomes :

What’s that?????

(72)

Learning a discriminative model (Maximum a posteriori)

Let’us replace this prior into the MAP problem

The Maximum a posteriori problems becomes :

𝑊𝑇 = 𝑊1, 𝑊2, … , 𝑊𝐷

WTW = W1W1 + 𝑊2𝑊2 + … + 𝑊𝑁𝑊𝑁

WTW= 𝑊12 + 𝑊22 + ⋯ + 𝑊𝑁2 2 = |𝑊 |22 L2 regularization or L2 penalty

Cost Function=

Loss (i.e. MSE or cross entropy) + Regularization term

1. L2 regularizer is known in the machine learning literature as weight decay.

2. It encourages weight values to decay towards zero

3. In statistics, it provides an example of a parameter

shrinkage method because it shrinks parameter values towards zero.

4. Error function remains a quadratic function

(73)

Learning a discriminative model (Maximum a posteriori)

A discriminative model : Pr(y|x)

A discriminative model with its parameters:

Pr(y|x,W)

Maximum a posteriori (MAP)

Your favorite neural

network can do that if you precise some strcuture on Pr(W)

Link to : Minimizing the least squares error with quadratic regularization

http://romain.raveaux.free.fr/document/LeastSquaresError.html

Breaking news : maximizing the posterior distribution is equivalent to minimizing a regularized error function

(74)

Learning a generative model

A generative model : Pr(x|y)

A generative model with its parameters: Pr(x|y,W)

Maximum of likelihood

That’s what a Generative Adverserial Network (GAN) does

Can be used for classification:

See Naive Bayes classifier

http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html

(75)

Inference

Learning is great but we want to predict

Here comes the inference time:

we take a new set of measurements and use the model to tell us about the world state.

Inference with a discriminative model:

W* : Learned parameters

𝑥𝑛𝑒𝑤 : new datum

Easy as : Pr(y|𝑥𝑛𝑒𝑤,W*)

(76)

Inference

Learning is great but we want to predict

Here comes the inference time:

we take a new set of measurements and use the model to tell us about the world state.

Inference with a generative model:

Use the baye’s rule : Posterior distribution

See Naive Bayes classifier:

http://romain.raveaux.free.fr/document/NaiveBayesClassifier.html

(77)

Inference with a generative model:

An application :

Classification

Pixel classification

Skin or not skin pixel

𝑥𝑖 ∈ {0, … , 255} : Gray level

𝑦𝑖 ∈ {0; 1} : Skin or not

(78)

Inference with a generative model:

An application :

Classification

Pixel classification

Skin or not skin pixel

𝑥𝑖 ∈ {0, … , 255} : Gray level

𝑦𝑖 ∈ {0; 1} : Skin or not

Pr(𝑥𝑛𝑒𝑤|y,W)

Modelled by a categorical distribution

W={𝜆𝑘 }𝑘=0255

W are obtained by maximum of likelihood

Total number of pixels

Number of pixels of gray level = k

(79)

Inference with a generative model:

An application :

Classification

Pixel classification

Skin or not skin pixel

𝑥𝑖 ∈ {0, … , 255} : Gray level

𝑦𝑖 ∈ {0; 1} : Skin or not

Pr(y,W)

Modelled by a Bernoulli distribution

W={𝜆𝑘 }𝑘=01

W are obtained by maximum of likelihood

Total number of pixels

Number of pixels of class k

(80)

Inference with a generative model:

(81)

Inference with a generative model:

Taking the arg max

(82)

Conclusion

Machine learning

Models

Generative

Discriminative

Inference

Maximum a postiori (especially for generative models)

Maximum likelihood

Learning

Maximum likelihood (overfitting ?)

Maximum a postiori (include some knowledge about the parameters)

(83)

A Probabilistic View

Learning and inference explained by decision theory and probabilities : A summary

(84)

Next

We will connect local models !!!

http://romain.raveaux.free.fr/document/Connecting%20local%20models%20 the%20case%20of%20chains%20.pdf

Références

Documents relatifs

In integrated corpus testing, however, the classifier is machine learned on the merged databases and this gives promisingly robust classification results, which suggest that

23.02.2017: Cognitive, Linguistic, and Literacy Development in Young Bilingual Children Learning English as an Additional Language Dr. Dea Nielsen – Bradford Institute for

topology of the configurations in the training set and the testing set are identical. Changes in the topology was outside the scope of the study, as, in the design of critical

To obtain news for the study, a public data set located in a github repository was used https://github.com/GeorgeMcIntire/fake_real_news_dataset compiled in equal parts for

The GBDT with features selection is among all metrics superior and outperform the deep learning model based on LSTM validating our assumption that on small and imbal- anced data

Among these representations one of the most popular is the Spatial Pyramid Representation (SPR) [7] which incorporates spatial layout information of the fea- tures by dividing the

In this study using animal models (i.e., accelerated-catheter thrombosis, arteriove- nous shunt, and CPB with cardiac surgery), we assessed the clinical applicability of Ir-CPI as

De la même manière qu’un comportement donné peut être envisagé sous l’angle exclusif du rapport à soi ou également comme ayant un impact sur autrui, un