• Aucun résultat trouvé

Connectinglocal models

N/A
N/A
Protected

Academic year: 2022

Partager "Connectinglocal models"

Copied!
113
0
0

Texte intégral

(1)

Supervised machine learning Connecting local models

The case of chains

Romain Raveaux

[email protected] Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Content :

• This presentation is a follow up to « Supervised machine learning : the case of independent samples »

http://romain.raveaux.free.fr/document/courssupervisedmachinelearningRav eaux.pdf

• The case of a chain :

Probabilistic models connected to form a chain:

A first step beyond Independent and Identically Distributed (I.I.D)

Models

Inference

Learning

Directed and undirected models

(3)

Content :

• The case of a chain :

Probabilistic models connected to form a chain:

A first step beyond Independent and Identically Distributed (I.I.D)

Models

Directed

Undirected

Inference

Directed model : HMM

Undirected models : MRF, CRF

Learning

Directed model : HMM

Undirected models : MRF, CRF

(4)

Independent and Identically Distributed

• A set of samples :

This set is described by a single distribution. Pr(x)

Each sample is drawn from Pr(x)

Each sample is independent :

The joint distribution Pr 𝑥1, … , 𝑥𝑀 is the product over all data points of the probability distribution evaluated at each data point.

(5)

Beyond : Independent and Identically Distributed

• For many applications, however, the i.i.d. assumption will be a poor one.

Describe sequential data (time series)

the rainfall measurements on successive days at a particular location

the sequence of characters in an English sentence

the daily values of a currency exchange rate

• The character 𝑥𝑖 depends on the M-1 parents characters 𝑥1, … , 𝑥𝑀−1.

(6)

A first-order Markov chain

• Markov Assumption :

The future depends only on the present. =

(7)

A first-order Markov chain

We have three samples :

(8)

A second-order Markov chain

Markov Assumption :

The future depends only on the present and yesterday.

(9)

Directed models

• Directed models

The joint probability distribution is factorized into a product of conditional distributions

Markov chains are examples of directed models

(10)

Undirected models

• Undirected models

The joint probability distribution is factorized into a product of potential functions (𝜙)

The potential function (𝜙) always returns a positive number

𝜙 expresses the compatibility with variables x.

(11)

Undirected models

• The joint probability distribution is factorized into a product of potential functions (𝜙)

• The potential (𝜙) function always returns a positive number

• The term Z is known as the partition function and normalizes the

product of these positive functions so that the total probability is one.

• C is the number of cliques in the graphical model

(12)

Undirected model : Markov random fields

• Markov random fields (MRF) are undirected models

• MRF are undirected models with Markov assumptions :

Potential functions operate on a subset of the variables (𝑆)

(13)

Undirected model : Conditional random fields

• A conditional random field (CRF) is a special case of MRF

• Where the probability of x is conditioned by another variable

• Pr(𝑥1, . . 𝑥𝑀|𝑧1, … , 𝑧𝑀) for instance.

This will be better described.

(14)

Ok let’s go back on a generative model.

The product rule gives us : Pr(Y,X)= Pr(X|Y)Pr(Y)

a Markov chain of variables y

(15)

Ok let’s go back on a generative model.

Let’s put it together

(16)

I know this model

• This is known as a

Hidden Markov model (HMM) when 𝑦𝑖 is discrete

Kalman Filter model when 𝑦𝑖 is continuous

(17)

HMM : Factor graph

Factor graph representation

(18)

HMM : Factor graph

Factor graph representation

• There is one node per variable (circles)

• One function node per term in the factorization (squares).

• Each function node connects to all of the variables associated with this term.

(19)

Let us see some undirected models

• Markov random fields (MRF)

• Conditional random fields (CRF)

(20)

Markov Random Fields (MRF)

• HMM is a directed model

The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k.  Pr(𝑥𝑖|𝑦𝑖 = 𝑘)

The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)

• Undirected model

Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function

Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1 𝑎𝑛𝑑 𝑦𝑖∈ −1,1

Returns larger values when the adjacent states are more compatible.

(21)

Markov Random Fields (MRF)

MRF = Generative model

(22)

Markov Random Fields (MRF)

HMM MRF

Pr 𝑦1, 𝑦2, 𝑦3 = Pr 𝑦1 Pr 𝑦2 𝑦1 Pr 𝑦3 𝑦3

Pr 𝑦1, … , 𝑦𝑀 = Pr(𝑦1) ෑ

𝑖=2 𝑀

Pr(𝑦𝑖|𝑦𝑖−1)

Pr 𝑦1, 𝑦2, 𝑦3 = 1

𝑍1 𝜁 𝑦1, 𝑦2 𝜁 𝑦2, 𝑦3 = 1

𝑍1

𝑖=2 𝑀

𝜁(𝑦𝑖, 𝑦𝑖−1)

𝑍1 = ෍

𝑦1∈𝑌

𝑦2∈𝑌

… ෍

𝑦𝑀∈𝑌

𝑖=2 𝑀

𝜁(𝑦𝑖, 𝑦𝑖−1) The difference lies on y

variables dependencies

MRF = Generative model

(23)

Condional Random Fields (CRF)

HMM is a directed model

The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k.  Pr(𝑥𝑖|𝑦𝑖 = 𝑘)

The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)

CRF is an undirected model

Pr 𝑥𝑖 𝑦𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥𝑖, 𝑦𝑖 : similarity function

Returns larger values when the measurements 𝑥𝑖 and the world state are more compatible 𝑦𝑖.

Example : 𝜙 𝑥𝑖, 𝑦𝑖 = 𝑥𝑖. 𝑦𝑖 𝑤𝑖𝑡ℎ 𝑥𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦𝑖 ∈ {−1,1}

Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function

Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1

Returns larger values when the adjacent states are more compatible.

CRF are conditional models :

The x variables are given (observed/fixed).

(24)

From Markov Random Fields (MRF) to Conditional Random Fields (CRF)

MRF = Generative model So far, this is still a MRF

(25)

From Markov Random Fields (MRF) to Conditional Random Fields (CRF)

• We use the Bayes’ rule to condition over the data X

• To obtain the posterior distribution

From MRF to CRF

(26)

Conditional Random Fields (CRF)

CRF = Discriminative model From MRF to CRF

CRF is a special case of MRF

The data nodes 𝑥1, … , 𝑥𝑀 are fixed/given

(27)

Inference

• We take a new set of measurements

A new sequence : 𝑥𝑛𝑒𝑤

• And we use the model to tell us about the world state.

We want to make a prediction from 𝑥𝑛𝑒𝑤 (to infer from 𝑥𝑛𝑒𝑤)

• We take a datum 𝑥𝑛𝑒𝑤 and use it to infer the state of the world, the sequence y.

(28)

Inference

• Directed models :

HMM

• Undirected models :

MRF and CRF

(29)

Inference for HMM

(30)

Inference for HMM : maximum a posteriori (MAP)

Inference with a generative model:

𝑥𝑛𝑒𝑤 are observed variables

We seek for the max of the Pr 𝑦1, … , 𝑦𝑀 𝑥1𝑛𝑒𝑤, … , 𝑥𝑀𝑛𝑒𝑤)

Use the baye’s rule to obtain the posterior distribution

Bayes’ rule

(31)

Inference for HMM : maximum a posteriori (MAP)

• does not depend on y variable. So it can be removed from the optimization.

• 

• 

• 

Pr(x) Does not depend on y variables

(32)

Inference for HMM : maximum a posteriori (MAP)

• Let us detail the model :

• 

• 

= the prior. The prior is a Markov chain of the variables y.

(33)

Inference for HMM : maximum a posteriori (MAP)

• Let us introduce a log function

To change the products into sums for a better numerical computation.

• Let us introduce a minus to change from a maximization to minimization problem : negative log

(34)

Inference for HMM

Where

Ui is a unary term and depends only on a single variable 𝑦𝑖 and

Pi is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.

(35)

Inference for HMM

We consider that all the distributions are known

How can we solve this optimization problem ?

y variables are discrete variables

Let us enumerate all possible combinations:

𝑦𝑖 can take K different states

A sequence is composed of M elements

There are 𝐾𝑀 possibilities

Can we do better than a brute force enumeration ???

(36)

Inference for HMM

We can do better !!!

But why ???

The problem comes the dependence between y variables

We can take advantage of the factorization

(37)

Inference for HMM

The problem is broken down into simpler problems !!!

(38)

Inference for HMM

We consider that all the distributions are known

How can we solve this optimization problem ?

y variables are discrete variables

It can be solved in polynomial time using the Viterbi algorithm which is an example of dynamic programming.

(39)

Inference for HMM : MAP

• Case of K classes : 𝑦𝑖 ∈ [1,2,3,4,5]

• Build a graph

The set of vertices {𝑉𝑖,𝑘}𝑖=1,𝑘=1𝑀,5

Each vertex 𝑉𝑖,𝑘 has a set of edges (𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘)𝑙=15

Each vertex 𝑉𝑖,𝑘 has a cost 𝑈𝑖(𝑦𝑖 = 𝑘)

Each edge (𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘) has a cost 𝑃𝑖(𝑦𝑖 = 𝑘, 𝑦𝑖−1 = 𝑙)

• Find the shortest path between left to right

Where the notion of distance : d((𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘)) = 𝑈𝑖−1+𝑃𝑖

Dijkstra can be used

(40)

Inference for HMM : Viterbi in short

For this slide only: Change of notation y variables become w

For this slide only: Change of notation i becomes n

Find the shortest path from left to right

Picure taken from:

(41)

Inference for HMM with Dijsktra’s algorithm

There are two extra nodes at the start and the end of the graph.

(42)

Inference for HMM : Complexity

• MAP inference :

Brute force approach : O(𝐾𝑀)

Enumerating all the combinations

Sequence of length 3, K=5 : 111; 112 ; 113; 114; 115; 121; 122; ….

Viterbi : O(M𝐾2)

(43)

Inference for MRF

(44)

Inference for Markov Random Fields (MRF)

• HMM is a directed model

The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k.  Pr(𝑥𝑖|𝑦𝑖 = 𝑘)

The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)

• Undirected model

Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function

Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1

Returns larger values when the adjacent states are more compatible.

(45)

Inference for Markov Random Fields (MRF)

HMM MRF

Pr 𝑦1, 𝑦2, 𝑦3 = Pr 𝑦1 Pr 𝑦2 𝑦1 Pr 𝑦3 𝑦3

Pr 𝑦1, … , 𝑦𝑀 = Pr(𝑦1) ෑ

𝑖=2 𝑀

Pr(𝑦𝑖|𝑦𝑖−1)

Pr 𝑦1, 𝑦2, 𝑦3 = 1

𝑍1 𝜁 𝑦1, 𝑦2 𝜁 𝑦2, 𝑦3 = 1

𝑍1

𝑖=2 𝑀

𝜁(𝑦𝑖, 𝑦𝑖−1)

𝑍1 = ෍

𝑦1∈𝑌

𝑦2∈𝑌

… ෍

𝑦𝑀∈𝑌

𝑖=2 𝑀

𝜁(𝑦𝑖, 𝑦𝑖−1) The difference lies on y

variables dependencies

MRF = Generative model

(46)

Inference for MRF : maximum a posteriori (MAP)

Inference with a generative model:

𝑥𝑛𝑒𝑤 are observed variables

We seek for the max of the Pr 𝑦1, … , 𝑦𝑀 𝑥1𝑛𝑒𝑤, … , 𝑥1𝑛𝑒𝑤)

Use the baye’s rule to obtain the posterior distribution

Bayes’ rule

MRF are generative models

So far no changes with respect to HMM.

(47)

Inference for MRF : maximum a posteriori (MAP)

• does not depend on y variable. So it can be removed from the optimization.

• 

• 

• 

Pr(x) Does not depend on y variables

MRF are generative models

So far no changes with respect to HMM.

(48)

Inference for MRF : maximum a posteriori (MAP)

• Let us detail the model :

• 

• 

= the prior. The prior is a Markov Random Field of the variables y.

(49)

Inference for MRF : maximum a posteriori (MAP)

• 

• What about 𝑍1?

Does it impact the optimization problem ?

Ask the students ?

(50)

Inference for MRF : maximum a posteriori (MAP)

• 

• What about 𝑍1?

Does it impact the optimization problem ?

Does 𝑍1 depend on a specific value of 𝑦1, … , 𝑦𝑀 ?

𝑍1 is a sum over all possible values of 𝑦1, … , 𝑦𝑀

It is constant with respect to the states 𝑦1, … , 𝑦𝑀

So 𝑍1 does not impact the optimization problem.

So 𝑍1 can be removed from the optimization problem

𝑍1 = ෍

𝑦1∈𝑌

𝑦2∈𝑌

… ෍

𝑦𝑀∈𝑌

𝑖=2 𝑀

𝜁(𝑦𝑖, 𝑦𝑖−1)

(51)

Inference for MRF : maximum a posteriori (MAP)

• Let us introduce a log function

To change the products into sums for a better numerical computation.

• Let us introduce a minus to change from a maximization to minimization problem : negative log

(52)

Inference for MRF

Where

𝑈𝑖 is a unary term and depends only on a single variable 𝑦𝑖 and

𝑃𝑖 is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.

Inference for MRF is close to inference for HMM

Pr 𝑦𝑖 𝑦𝑖−1 is replaced by 𝜁(𝑦𝑖, 𝑦𝑖−1)

Can be solved by dynamic programming (Viterbi algorithm)

In the case of chains

(53)

Inference for CRF

(54)

Inference for Condional Random Fields (CRF)

HMM is a directed model

The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k.  Pr(𝑥𝑖|𝑦𝑖 = 𝑘)

The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)

CRF is an undirected model

Pr 𝑥𝑖 𝑦𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥𝑖, 𝑦𝑖 : similarity function

Returns larger values when the measurements 𝑥𝑖 and the world state are more compatible 𝑦𝑖.

Example : 𝜙 𝑥𝑖, 𝑦𝑖 = 𝑥𝑖. 𝑦𝑖 𝑤𝑖𝑡ℎ 𝑥𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦𝑖 ∈ {−1,1}

Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function

Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1

Returns larger values when the adjacent states are more compatible.

CRF are conditional models :

The x variables are given (observed/fixed).

(55)

Inference for Condional Random Fields (CRF)

Maximum a posteriori : MAP

(56)

Inference for Condional Random Fields (CRF)

Maximum a posteriori : MAP

Does not depend on y

Does not depend on y

So we get :

(57)

Inference for Condional Random Fields (CRF)

• Let us the negative log :

(58)

Inference for Condional Random Fields (CRF)

Where

𝑈𝑖 is a unary term and depends only on a single variable 𝑦𝑖 and

𝑃𝑖 is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.

Inference for CRF is close to inference for MRF

Pr 𝑥𝑖 𝑦𝑖 is replaced by 𝜙(𝑥𝑖, 𝑦𝑖)

Can be solved by dynamic programming (Viterbi algorithm)

In the case of chains

(59)

Learning

• Learning with HMM

• Learning with MRF

• Learning with CRF

(60)

Learning

• The data set:

𝒟 = 𝑥 𝑗 , 𝑦 𝑗

𝑗=1 𝑁

𝑥(𝑗) = 𝑥1(𝑗), 𝑥𝑖(𝑗), … , 𝑥𝑀(𝑗)

𝑦(𝑗) = 𝑦1(𝑗), 𝑦𝑖(𝑗), … , 𝑦𝑀(𝑗)

• 𝑦1(𝑗) ∈ { 1,2, … , 𝐾} = Discrete domain usually

N data samples

One input sequence of M elements One output sequence of M elements

K classes

(61)

Learning with HMM

(62)

Learning with HMM

• So far there is no learning just Inference

We just give a sequence (x) and output the labeled sequence (y)

• Where learning can be introduced ?

Where are the parameters ?

• Let us take an example :

The measurements (x) have a normal distribution

The class variable (y) follows a categorical law.

This hidden Markov model has parameters: W= 𝜇𝑘, 𝜎𝑘, 𝜆𝑘 𝑘=1𝐾

One Gaussian by class to model 𝑥1(𝑗), 𝑥𝑖(𝑗), … , 𝑥𝑀(𝑗)

Categorical distribution

(63)

Learning with HMM

• Let us express the joint probability of the data set 𝒟

For one input and one output, the HMM model is :

(64)

Learning with HMM

• Let us express the joint probability of the data set 𝒟

For the whole data set :

Assuming each data sequence was drawn independentlyfrom the distribution (i.i.d) Independent and identically distributed  i.i.d

(65)

Learning with HMM

• Let us express the joint probability of the data set 𝒟

For the whole data set :

(66)

Learning with HMM

Let us fit the probability model by the maximum of likelihood

(67)

Learning with HMM

Let us fit the probability model by the maximum of likelihood

(68)

Learning with HMM

Let us fit the probability model by the maximum of likelihood

(69)

Learning with HMM

• Supervised learning

Relatively simple. We first isolate the part of the model that we want to learn.

For example, we might learn the parameters

from paired examples of xi and yi.

We can learn these parameters in isolation using the ML, MAP, or Bayesian methods.

The same applies for

(70)

Learning with HMM

• Unsupervised learning

More challenging

Beyond the scope of this presentation (dedicated to supervised learning)

Require notion such as :

1. Expectation Maximization method 2. Forward-Backward method

3. 1+2=Baum-Welch algorithm

1. Gaussian Mixture and Expectation Maximization algorithm

2. http://romain.raveaux.free.fr/document/GaussianMixtureandExpectationMaximization.ht ml

(71)

Limit of HMM

• Parameters (𝑊1et 𝑊2) of the distributions are shared through time

All time steps have the same parameters

Corresponding to the assumption of a stationary time series.

• HMM are generative models

Discriminative models are more direct to infer information on the world state (y)

• Although this is more general than the independence model, it is still very restrictive.

Markov hypothesis : The future depends on the present solely

(72)

Learning with MRF

(73)

Learning with MRF

• Let us express the joint probability of the data set 𝒟

For one input and one output, the MRF model is :

(74)

Learning with MRF

• Let us express the joint probability of the data set 𝒟

For one input and one output, the MRF model is :

(75)

Learning with MRF

• Let us express the joint probability of the data set 𝒟

For the whole data set :

Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed  i.i.d

(76)

Learning with MRF

• Let us express the joint probability of the data set 𝒟

For the whole data set :

(77)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

(78)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

(79)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

(80)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

Like learning with HMM.

Here we have a difference: Z

Z appears in the learning stage !!!

ouch

(81)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

Z does not depend on y (do you remember ? )

Cool, we just have to compute it once

(82)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

Computing Z is challenging ???

𝑲𝑵 summations to do !!!

How to compute Z efficiently ??? An important question !!!

(83)

Learning with MRF

Let us fit the probability model by the maximum of likelihood

• How to compute Z efficiently?

Generally speaking, takes advantage of the structure of the problem you are dealing with

Two arguments can be used in the case of chains:

Exploit the structured factorization of the problem

Based on conditional independence

We can use one of them to compute Z efficiently

Let us exploit the factorization property

(84)

Forward algorithm for computing Z

We observe that not every term in the product is relevant to every summation.

We can re-arrange the summation terms so that only the variables over which they sum are to the right

We proceed from right to left, computing each summation in turn.

This technique is known as variable elimination.

(85)

Forward algorithm for computing Z

• Let us denote the rightmost three terms :

(86)

Forward algorithm for computing Z

• At the 𝑖𝑡ℎ stage we compute :

• We repeat this process until we have computed the full expression.

• This solution consists of M-1 summations over K values (K classes)

• It is much more efficient to compute than explicitly computing all 𝐾𝑁 summations

(87)

Forward algorithm for computing Z

• At the 𝑖𝑡ℎ stage we compute :

• This solution consists of M-1 summations over K values (K classes)

• We can now perform the learning stage for the MRF ???

(88)

Learning with MRF

Well we are not safe yet !!!

Let us take the negative-log

How can we solve this problem ????

Ask the student ?

(89)

Learning with MRF

Well we are not safe yet !!!

Let us take the negative-log

The min is where the derivative is equal to zero :

Let us assume that 𝜁 is differentiable

(90)

Learning with MRF

• This problem can be solved by gradient descent :

Do you remember?

(91)

Learning with MRF

• This problem can be solved by gradient descent :

Where 𝛼 is the learning rate

(92)

Learning with MRF

• This problem can be solved by gradient descent :

Where 𝛼 is the learning rate

• But How to compute the gradient ?

(93)

Learning with MRF

• The second term is easy to compute :

The sum of the derivatives of log of 𝜁

f2 f1

(94)

Learning with MRF

• The first term is touchy :

It contains the sum of the derivatives of log of 𝜁

It involves an intractable sum over all possible values of the y

f2 f1

(95)

Learning with MRF

• The first term is touchy :

It contains the sum of the derivatives of log of 𝜁

It involves an intractable sum over all possible values of the y

Intractable !!! Not for us !!! not for the case of chains :

Thanks to the forward algorithm

We can do that with M-1 summations over K values (K classes)

(96)

Learning with CRF

(97)

Learning with CRF

• Let us express the conditional probability over the observed data x

For one input and one output, the CRF model is :

(98)

Learning with CRF

• Let us express the conditional probability over the observed data x

For one input and one output, the CRF model is :

(99)

Learning with CRF

• Let us express the conditional probability of the data set 𝒟

For the whole data set :

Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed  i.i.d

(100)

Learning with CRF

• Let us express the conditional probability of the data set 𝒟

For the whole data set :

(101)

Learning with CRF

Let us fit the probability model by the maximum of likelihood

(102)

Learning with CRF

Let us fit the probability model by the maximum of likelihood

Z does not depend on y (do you remember ? )

Does not depend on W

(103)

Learning with CRF

Let us fit the probability model by the maximum of likelihood

(104)

Learning with CRF

Let us fit the probability model by the maximum of likelihood

It is close to the case of MRF

We have to compute Z efficiently

To do so, we can use the factorization property of the problem (The case of chains)

(105)

Learning with CRF : Computation of Z

• Let us use the forward algorithm

• W is observed/fixed/given so we remove for the explanation

It makes notation simpler

(106)

Learning with CRF : Computation of Z

We observe that not every term in the product is relevant to every summation.

We can re-arrange the summation terms so that only the variables over which they sum are to the right

We proceed from right to left, computing each summation in turn.

This technique is known as variable elimination.

(107)

Forward algorithm for computing Z

• Let us denote the rightmost three terms :

(108)

Forward algorithm for computing Z

• At the 𝑖𝑡ℎ stage we compute :

• We repeat this process until we have computed the full expression.

• This solution consists of M summations over K values (K classes)

• It is much more efficient to compute than explicitly computing all 𝐾𝑁 summations

(109)

Forward algorithm for computing Z

• At the 𝑖𝑡ℎ stage we compute :

• This solution consists of M summations over K values (K classes)

• We can now perform the learning stage for the CRF

(110)

Some food for thoughts

• For learning :

We have used Maximum Likelihood but Maximum a posteriori (MAP) could have been used. A prior Pr(W) would have appeared.

One possible solution to compute is the contrastive divergence algorithm. This is a method for approximating the gradient.

• The forward algorithm can be performed from the end of the chain:

Then called backward recursion

• Forward-backward algorithm is a special case of belief propagation

• Methods described here could work for Trees not only for chains

(111)

Summary

(112)

Some quotes

(113)

Conclusion

• We have seen :

How to take into account dependencies from a data set

Sequential dependencies (time dependencies)

HMM model

A generative model

How to infer world states (labels y) from a given sequence x

Thanks to the maximum of the posterior distribution (Maximum A Posteriori : MAP)

How supervised learning could be achieved

• Next :

Beyond the case of chains :

Go to graphs

http://romain.raveaux.free.fr/document/Structured%20Output%20Learning.pdf

Références

Documents relatifs

[r]

[r]

Simultaneous approximation, parametric geometry of numbers, function fields, Minkowski successive minima, Mahler duality, compound bodies, Schmidt and Summerer n–systems, Pad´

ROT~, K.F., Rational approximations to algebraic numbers.. M., On simultaneous approximations of two algebraic numbers

[r]

[r]

L es Big Data ou méga-données n’ont pas de définition universelle, mais correspondent aux données dont le traitement dépasse les capacités des technologies courantes du fait de

In this paper I will present an affirmative answer to this Question for the special case of three dimensional mappings with smail dilatation. I believe that the