Supervised machine learning Connecting local models
The case of chains
Romain Raveaux
[email protected] Maître de conférences
Université de Tours
Laboratoire d’informatique (LIFAT) Equipe RFAI
1
Content :
• This presentation is a follow up to « Supervised machine learning : the case of independent samples »
• http://romain.raveaux.free.fr/document/courssupervisedmachinelearningRav eaux.pdf
• The case of a chain :
• Probabilistic models connected to form a chain:
• A first step beyond Independent and Identically Distributed (I.I.D)
• Models
• Inference
• Learning
• Directed and undirected models
Content :
• The case of a chain :
• Probabilistic models connected to form a chain:
• A first step beyond Independent and Identically Distributed (I.I.D)
• Models
• Directed
• Undirected
• Inference
• Directed model : HMM
• Undirected models : MRF, CRF
• Learning
• Directed model : HMM
• Undirected models : MRF, CRF
Independent and Identically Distributed
• A set of samples :
• This set is described by a single distribution. Pr(x)
• Each sample is drawn from Pr(x)
• Each sample is independent :
• The joint distribution Pr 𝑥1, … , 𝑥𝑀 is the product over all data points of the probability distribution evaluated at each data point.
Beyond : Independent and Identically Distributed
• For many applications, however, the i.i.d. assumption will be a poor one.
• Describe sequential data (time series)
• the rainfall measurements on successive days at a particular location
• the sequence of characters in an English sentence
• the daily values of a currency exchange rate
• The character 𝑥𝑖 depends on the M-1 parents characters 𝑥1, … , 𝑥𝑀−1.
A first-order Markov chain
• Markov Assumption :
• The future depends only on the present. =
A first-order Markov chain
We have three samples :
A second-order Markov chain
Markov Assumption :
The future depends only on the present and yesterday.
Directed models
• Directed models
• The joint probability distribution is factorized into a product of conditional distributions
• Markov chains are examples of directed models
Undirected models
• Undirected models
• The joint probability distribution is factorized into a product of potential functions (𝜙)
• The potential function (𝜙) always returns a positive number
• 𝜙 expresses the compatibility with variables x.
Undirected models
• The joint probability distribution is factorized into a product of potential functions (𝜙)
• The potential (𝜙) function always returns a positive number
• The term Z is known as the partition function and normalizes the
product of these positive functions so that the total probability is one.
• C is the number of cliques in the graphical model
Undirected model : Markov random fields
• Markov random fields (MRF) are undirected models
• MRF are undirected models with Markov assumptions :
• Potential functions operate on a subset of the variables (𝑆)
Undirected model : Conditional random fields
• A conditional random field (CRF) is a special case of MRF
• Where the probability of x is conditioned by another variable
• Pr(𝑥1, . . 𝑥𝑀|𝑧1, … , 𝑧𝑀) for instance.
• This will be better described.
Ok let’s go back on a generative model.
• The product rule gives us : Pr(Y,X)= Pr(X|Y)Pr(Y)
a Markov chain of variables y
Ok let’s go back on a generative model.
• Let’s put it together
I know this model
• This is known as a
• Hidden Markov model (HMM) when 𝑦𝑖 is discrete
• Kalman Filter model when 𝑦𝑖 is continuous
HMM : Factor graph
Factor graph representation
HMM : Factor graph
Factor graph representation
• There is one node per variable (circles)
• One function node per term in the factorization (squares).
• Each function node connects to all of the variables associated with this term.
Let us see some undirected models
• Markov random fields (MRF)
• Conditional random fields (CRF)
Markov Random Fields (MRF)
• HMM is a directed model
• The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k. Pr(𝑥𝑖|𝑦𝑖 = 𝑘)
• The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)
• Undirected model
• Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function
• Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1 𝑎𝑛𝑑 𝑦𝑖∈ −1,1
• Returns larger values when the adjacent states are more compatible.
Markov Random Fields (MRF)
MRF = Generative model
Markov Random Fields (MRF)
HMM MRF
Pr 𝑦1, 𝑦2, 𝑦3 = Pr 𝑦1 Pr 𝑦2 𝑦1 Pr 𝑦3 𝑦3
Pr 𝑦1, … , 𝑦𝑀 = Pr(𝑦1) ෑ
𝑖=2 𝑀
Pr(𝑦𝑖|𝑦𝑖−1)
Pr 𝑦1, 𝑦2, 𝑦3 = 1
𝑍1 𝜁 𝑦1, 𝑦2 𝜁 𝑦2, 𝑦3 = 1
𝑍1 ෑ
𝑖=2 𝑀
𝜁(𝑦𝑖, 𝑦𝑖−1)
𝑍1 =
𝑦1∈𝑌
𝑦2∈𝑌
…
𝑦𝑀∈𝑌
ෑ
𝑖=2 𝑀
𝜁(𝑦𝑖, 𝑦𝑖−1) The difference lies on y
variables dependencies
MRF = Generative model
Condional Random Fields (CRF)
• HMM is a directed model
• The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k. Pr(𝑥𝑖|𝑦𝑖 = 𝑘)
• The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)
• CRF is an undirected model
• Pr 𝑥𝑖 𝑦𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥𝑖, 𝑦𝑖 : similarity function
• Returns larger values when the measurements 𝑥𝑖 and the world state are more compatible 𝑦𝑖.
• Example : 𝜙 𝑥𝑖, 𝑦𝑖 = 𝑥𝑖. 𝑦𝑖 𝑤𝑖𝑡ℎ 𝑥𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦𝑖 ∈ {−1,1}
• Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function
• Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1
• Returns larger values when the adjacent states are more compatible.
• CRF are conditional models :
• The x variables are given (observed/fixed).
From Markov Random Fields (MRF) to Conditional Random Fields (CRF)
MRF = Generative model So far, this is still a MRF
From Markov Random Fields (MRF) to Conditional Random Fields (CRF)
• We use the Bayes’ rule to condition over the data X
• To obtain the posterior distribution
From MRF to CRF
Conditional Random Fields (CRF)
CRF = Discriminative model From MRF to CRF
CRF is a special case of MRF
The data nodes 𝑥1, … , 𝑥𝑀 are fixed/given
Inference
• We take a new set of measurements
• A new sequence : 𝑥𝑛𝑒𝑤
• And we use the model to tell us about the world state.
• We want to make a prediction from 𝑥𝑛𝑒𝑤 (to infer from 𝑥𝑛𝑒𝑤)
• We take a datum 𝑥𝑛𝑒𝑤 and use it to infer the state of the world, the sequence y.
Inference
• Directed models :
• HMM
• Undirected models :
• MRF and CRF
Inference for HMM
Inference for HMM : maximum a posteriori (MAP)
Inference with a generative model:
• 𝑥𝑛𝑒𝑤 are observed variables
• We seek for the max of the Pr 𝑦1, … , 𝑦𝑀 𝑥1𝑛𝑒𝑤, … , 𝑥𝑀𝑛𝑒𝑤)
• Use the baye’s rule to obtain the posterior distribution
Bayes’ rule
Inference for HMM : maximum a posteriori (MAP)
• does not depend on y variable. So it can be removed from the optimization.
•
•
•
Pr(x) Does not depend on y variables
Inference for HMM : maximum a posteriori (MAP)
• Let us detail the model :
•
•
= the prior. The prior is a Markov chain of the variables y.
Inference for HMM : maximum a posteriori (MAP)
• Let us introduce a log function
• To change the products into sums for a better numerical computation.
• Let us introduce a minus to change from a maximization to minimization problem : negative log
Inference for HMM
Where
• Ui is a unary term and depends only on a single variable 𝑦𝑖 and
• Pi is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.
Inference for HMM
• We consider that all the distributions are known
• How can we solve this optimization problem ?
• y variables are discrete variables
• Let us enumerate all possible combinations:
• 𝑦𝑖 can take K different states
• A sequence is composed of M elements
• There are 𝐾𝑀 possibilities
• Can we do better than a brute force enumeration ???
Inference for HMM
• We can do better !!!
• But why ???
• The problem comes the dependence between y variables
• We can take advantage of the factorization
Inference for HMM
The problem is broken down into simpler problems !!!
Inference for HMM
• We consider that all the distributions are known
• How can we solve this optimization problem ?
• y variables are discrete variables
• It can be solved in polynomial time using the Viterbi algorithm which is an example of dynamic programming.
Inference for HMM : MAP
• Case of K classes : 𝑦𝑖 ∈ [1,2,3,4,5]
• Build a graph
• The set of vertices {𝑉𝑖,𝑘}𝑖=1,𝑘=1𝑀,5
• Each vertex 𝑉𝑖,𝑘 has a set of edges (𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘)𝑙=15
• Each vertex 𝑉𝑖,𝑘 has a cost 𝑈𝑖(𝑦𝑖 = 𝑘)
• Each edge (𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘) has a cost 𝑃𝑖(𝑦𝑖 = 𝑘, 𝑦𝑖−1 = 𝑙)
• Find the shortest path between left to right
• Where the notion of distance : d((𝑉𝑖−1,𝑙, 𝑉𝑖,𝑘)) = 𝑈𝑖−1+𝑃𝑖
• Dijkstra can be used
Inference for HMM : Viterbi in short
• For this slide only: Change of notation y variables become w
• For this slide only: Change of notation i becomes n
• Find the shortest path from left to right
Picure taken from:
Inference for HMM with Dijsktra’s algorithm
There are two extra nodes at the start and the end of the graph.
Inference for HMM : Complexity
• MAP inference :
• Brute force approach : O(𝐾𝑀)
• Enumerating all the combinations
• Sequence of length 3, K=5 : 111; 112 ; 113; 114; 115; 121; 122; ….
• Viterbi : O(M𝐾2)
Inference for MRF
Inference for Markov Random Fields (MRF)
• HMM is a directed model
• The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k. Pr(𝑥𝑖|𝑦𝑖 = 𝑘)
• The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)
• Undirected model
• Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function
• Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1
• Returns larger values when the adjacent states are more compatible.
Inference for Markov Random Fields (MRF)
HMM MRF
Pr 𝑦1, 𝑦2, 𝑦3 = Pr 𝑦1 Pr 𝑦2 𝑦1 Pr 𝑦3 𝑦3
Pr 𝑦1, … , 𝑦𝑀 = Pr(𝑦1) ෑ
𝑖=2 𝑀
Pr(𝑦𝑖|𝑦𝑖−1)
Pr 𝑦1, 𝑦2, 𝑦3 = 1
𝑍1 𝜁 𝑦1, 𝑦2 𝜁 𝑦2, 𝑦3 = 1
𝑍1 ෑ
𝑖=2 𝑀
𝜁(𝑦𝑖, 𝑦𝑖−1)
𝑍1 =
𝑦1∈𝑌
𝑦2∈𝑌
…
𝑦𝑀∈𝑌
ෑ
𝑖=2 𝑀
𝜁(𝑦𝑖, 𝑦𝑖−1) The difference lies on y
variables dependencies
MRF = Generative model
Inference for MRF : maximum a posteriori (MAP)
Inference with a generative model:
• 𝑥𝑛𝑒𝑤 are observed variables
• We seek for the max of the Pr 𝑦1, … , 𝑦𝑀 𝑥1𝑛𝑒𝑤, … , 𝑥1𝑛𝑒𝑤)
• Use the baye’s rule to obtain the posterior distribution
Bayes’ rule
• MRF are generative models
• So far no changes with respect to HMM.
Inference for MRF : maximum a posteriori (MAP)
• does not depend on y variable. So it can be removed from the optimization.
•
•
•
Pr(x) Does not depend on y variables
• MRF are generative models
• So far no changes with respect to HMM.
Inference for MRF : maximum a posteriori (MAP)
• Let us detail the model :
•
•
= the prior. The prior is a Markov Random Field of the variables y.
Inference for MRF : maximum a posteriori (MAP)
•
• What about 𝑍1?
• Does it impact the optimization problem ?
• Ask the students ?
Inference for MRF : maximum a posteriori (MAP)
•
• What about 𝑍1?
• Does it impact the optimization problem ?
• Does 𝑍1 depend on a specific value of 𝑦1, … , 𝑦𝑀 ?
• 𝑍1 is a sum over all possible values of 𝑦1, … , 𝑦𝑀
• It is constant with respect to the states 𝑦1, … , 𝑦𝑀
• So 𝑍1 does not impact the optimization problem.
• So 𝑍1 can be removed from the optimization problem
𝑍1 =
𝑦1∈𝑌
𝑦2∈𝑌
…
𝑦𝑀∈𝑌
ෑ
𝑖=2 𝑀
𝜁(𝑦𝑖, 𝑦𝑖−1)
Inference for MRF : maximum a posteriori (MAP)
• Let us introduce a log function
• To change the products into sums for a better numerical computation.
• Let us introduce a minus to change from a maximization to minimization problem : negative log
Inference for MRF
Where
• 𝑈𝑖 is a unary term and depends only on a single variable 𝑦𝑖 and
• 𝑃𝑖 is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.
Inference for MRF is close to inference for HMM
• Pr 𝑦𝑖 𝑦𝑖−1 is replaced by 𝜁(𝑦𝑖, 𝑦𝑖−1)
• Can be solved by dynamic programming (Viterbi algorithm)
• In the case of chains
Inference for CRF
Inference for Condional Random Fields (CRF)
• HMM is a directed model
• The tendency to observe the measurements 𝑥𝑖 given that state 𝑦𝑖 takes value k. Pr(𝑥𝑖|𝑦𝑖 = 𝑘)
• The current state is dependent on the previous one Pr(𝑦𝑖|𝑦𝑖−1)
• CRF is an undirected model
• Pr 𝑥𝑖 𝑦𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥𝑖, 𝑦𝑖 : similarity function
• Returns larger values when the measurements 𝑥𝑖 and the world state are more compatible 𝑦𝑖.
• Example : 𝜙 𝑥𝑖, 𝑦𝑖 = 𝑥𝑖. 𝑦𝑖 𝑤𝑖𝑡ℎ 𝑥𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦𝑖 ∈ {−1,1}
• Pr(𝑦𝑖|𝑦𝑖−1) is replaced by 𝜁 𝑦𝑖, 𝑦𝑖−1 : similarity function
• Example : 𝜁 𝑦𝑖, 𝑦𝑖−1 = 𝑦𝑖 . 𝑦𝑖−1𝑖 𝑎𝑛𝑑 𝑦𝑖 ∈ −1,1
• Returns larger values when the adjacent states are more compatible.
• CRF are conditional models :
• The x variables are given (observed/fixed).
Inference for Condional Random Fields (CRF)
Maximum a posteriori : MAP
Inference for Condional Random Fields (CRF)
Maximum a posteriori : MAP
Does not depend on y
Does not depend on y
So we get :
Inference for Condional Random Fields (CRF)
• Let us the negative log :
Inference for Condional Random Fields (CRF)
Where
• 𝑈𝑖 is a unary term and depends only on a single variable 𝑦𝑖 and
• 𝑃𝑖 is a pairwise term, depending on two variables 𝑦𝑖 and𝑦𝑖−1.
Inference for CRF is close to inference for MRF
• Pr 𝑥𝑖 𝑦𝑖 is replaced by 𝜙(𝑥𝑖, 𝑦𝑖)
• Can be solved by dynamic programming (Viterbi algorithm)
• In the case of chains
Learning
• Learning with HMM
• Learning with MRF
• Learning with CRF
Learning
• The data set:
• 𝒟 = 𝑥 𝑗 , 𝑦 𝑗
𝑗=1 𝑁
• 𝑥(𝑗) = 𝑥1(𝑗), 𝑥𝑖(𝑗), … , 𝑥𝑀(𝑗)
• 𝑦(𝑗) = 𝑦1(𝑗), 𝑦𝑖(𝑗), … , 𝑦𝑀(𝑗)
• 𝑦1(𝑗) ∈ { 1,2, … , 𝐾} = Discrete domain usually
N data samples
One input sequence of M elements One output sequence of M elements
K classes
Learning with HMM
Learning with HMM
• So far there is no learning just Inference
• We just give a sequence (x) and output the labeled sequence (y)
• Where learning can be introduced ?
• Where are the parameters ?
• Let us take an example :
• The measurements (x) have a normal distribution
• The class variable (y) follows a categorical law.
• This hidden Markov model has parameters: W= 𝜇𝑘, 𝜎𝑘, 𝜆𝑘 𝑘=1𝐾
• One Gaussian by class to model 𝑥1(𝑗), 𝑥𝑖(𝑗), … , 𝑥𝑀(𝑗)
Categorical distribution
Learning with HMM
• Let us express the joint probability of the data set 𝒟
• For one input and one output, the HMM model is :
Learning with HMM
• Let us express the joint probability of the data set 𝒟
• For the whole data set :
Assuming each data sequence was drawn independentlyfrom the distribution (i.i.d) Independent and identically distributed i.i.d
Learning with HMM
• Let us express the joint probability of the data set 𝒟
• For the whole data set :
Learning with HMM
• Let us fit the probability model by the maximum of likelihood
Learning with HMM
• Let us fit the probability model by the maximum of likelihood
Learning with HMM
• Let us fit the probability model by the maximum of likelihood
Learning with HMM
• Supervised learning
• Relatively simple. We first isolate the part of the model that we want to learn.
• For example, we might learn the parameters
• from paired examples of xi and yi.
• We can learn these parameters in isolation using the ML, MAP, or Bayesian methods.
• The same applies for
Learning with HMM
• Unsupervised learning
• More challenging
• Beyond the scope of this presentation (dedicated to supervised learning)
• Require notion such as :
1. Expectation Maximization method 2. Forward-Backward method
3. 1+2=Baum-Welch algorithm
1. Gaussian Mixture and Expectation Maximization algorithm
2. http://romain.raveaux.free.fr/document/GaussianMixtureandExpectationMaximization.ht ml
Limit of HMM
• Parameters (𝑊1et 𝑊2) of the distributions are shared through time
• All time steps have the same parameters
• Corresponding to the assumption of a stationary time series.
• HMM are generative models
• Discriminative models are more direct to infer information on the world state (y)
• Although this is more general than the independence model, it is still very restrictive.
• Markov hypothesis : The future depends on the present solely
Learning with MRF
Learning with MRF
• Let us express the joint probability of the data set 𝒟
• For one input and one output, the MRF model is :
Learning with MRF
• Let us express the joint probability of the data set 𝒟
• For one input and one output, the MRF model is :
Learning with MRF
• Let us express the joint probability of the data set 𝒟
• For the whole data set :
Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed i.i.d
Learning with MRF
• Let us express the joint probability of the data set 𝒟
• For the whole data set :
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Like learning with HMM.
Here we have a difference: Z
Z appears in the learning stage !!!
ouch
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Z does not depend on y (do you remember ? )
Cool, we just have to compute it once
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
Computing Z is challenging ???
𝑲𝑵 summations to do !!!
How to compute Z efficiently ??? An important question !!!
Learning with MRF
• Let us fit the probability model by the maximum of likelihood
• How to compute Z efficiently?
• Generally speaking, takes advantage of the structure of the problem you are dealing with
• Two arguments can be used in the case of chains:
• Exploit the structured factorization of the problem
• Based on conditional independence
• We can use one of them to compute Z efficiently
• Let us exploit the factorization property
Forward algorithm for computing Z
• We observe that not every term in the product is relevant to every summation.
• We can re-arrange the summation terms so that only the variables over which they sum are to the right
• We proceed from right to left, computing each summation in turn.
• This technique is known as variable elimination.
Forward algorithm for computing Z
• Let us denote the rightmost three terms :
Forward algorithm for computing Z
• At the 𝑖𝑡ℎ stage we compute :
• We repeat this process until we have computed the full expression.
• This solution consists of M-1 summations over K values (K classes)
• It is much more efficient to compute than explicitly computing all 𝐾𝑁 summations
Forward algorithm for computing Z
• At the 𝑖𝑡ℎ stage we compute :
• This solution consists of M-1 summations over K values (K classes)
• We can now perform the learning stage for the MRF ???
Learning with MRF
• Well we are not safe yet !!!
• Let us take the negative-log
• How can we solve this problem ????
• Ask the student ?
Learning with MRF
• Well we are not safe yet !!!
• Let us take the negative-log
• The min is where the derivative is equal to zero :
• Let us assume that 𝜁 is differentiable
Learning with MRF
• This problem can be solved by gradient descent :
• Do you remember?
Learning with MRF
• This problem can be solved by gradient descent :
• Where 𝛼 is the learning rate
Learning with MRF
• This problem can be solved by gradient descent :
• Where 𝛼 is the learning rate
• But How to compute the gradient ?
Learning with MRF
• The second term is easy to compute :
• The sum of the derivatives of log of 𝜁
f2 f1
Learning with MRF
• The first term is touchy :
• It contains the sum of the derivatives of log of 𝜁
• It involves an intractable sum over all possible values of the y
f2 f1
Learning with MRF
• The first term is touchy :
• It contains the sum of the derivatives of log of 𝜁
• It involves an intractable sum over all possible values of the y
• Intractable !!! Not for us !!! not for the case of chains :
• Thanks to the forward algorithm
• We can do that with M-1 summations over K values (K classes)
Learning with CRF
Learning with CRF
• Let us express the conditional probability over the observed data x
• For one input and one output, the CRF model is :
Learning with CRF
• Let us express the conditional probability over the observed data x
• For one input and one output, the CRF model is :
Learning with CRF
• Let us express the conditional probability of the data set 𝒟
• For the whole data set :
Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed i.i.d
Learning with CRF
• Let us express the conditional probability of the data set 𝒟
• For the whole data set :
Learning with CRF
• Let us fit the probability model by the maximum of likelihood
Learning with CRF
• Let us fit the probability model by the maximum of likelihood
Z does not depend on y (do you remember ? )
Does not depend on W
Learning with CRF
• Let us fit the probability model by the maximum of likelihood
Learning with CRF
• Let us fit the probability model by the maximum of likelihood
• It is close to the case of MRF
• We have to compute Z efficiently
• To do so, we can use the factorization property of the problem (The case of chains)
Learning with CRF : Computation of Z
• Let us use the forward algorithm
• W is observed/fixed/given so we remove for the explanation
• It makes notation simpler
Learning with CRF : Computation of Z
• We observe that not every term in the product is relevant to every summation.
• We can re-arrange the summation terms so that only the variables over which they sum are to the right
• We proceed from right to left, computing each summation in turn.
• This technique is known as variable elimination.
Forward algorithm for computing Z
• Let us denote the rightmost three terms :
Forward algorithm for computing Z
• At the 𝑖𝑡ℎ stage we compute :
• We repeat this process until we have computed the full expression.
• This solution consists of M summations over K values (K classes)
• It is much more efficient to compute than explicitly computing all 𝐾𝑁 summations
Forward algorithm for computing Z
• At the 𝑖𝑡ℎ stage we compute :
• This solution consists of M summations over K values (K classes)
• We can now perform the learning stage for the CRF
Some food for thoughts
• For learning :
• We have used Maximum Likelihood but Maximum a posteriori (MAP) could have been used. A prior Pr(W) would have appeared.
• One possible solution to compute is the contrastive divergence algorithm. This is a method for approximating the gradient.
• The forward algorithm can be performed from the end of the chain:
• Then called backward recursion
• Forward-backward algorithm is a special case of belief propagation
• Methods described here could work for Trees not only for chains
Summary
Some quotes
Conclusion
• We have seen :
• How to take into account dependencies from a data set
• Sequential dependencies (time dependencies)
• HMM model
• A generative model
• How to infer world states (labels y) from a given sequence x
• Thanks to the maximum of the posterior distribution (Maximum A Posteriori : MAP)
• How supervised learning could be achieved
• Next :
• Beyond the case of chains :
• Go to graphs
• http://romain.raveaux.free.fr/document/Structured%20Output%20Learning.pdf