Connectinglocal models

(1)

Supervised machine learning Connecting local models

The case of chains

Romain Raveaux

[email protected] Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Content :

• This presentation is a follow up to « Supervised machine learning : the case of independent samples »

• http://romain.raveaux.free.fr/document/courssupervisedmachinelearningRav eaux.pdf

• The case of a chain :

• Probabilistic models connected to form a chain:

• A first step beyond Independent and Identically Distributed (I.I.D)

• Models

• Inference

• Learning

• Directed and undirected models

(3)

Content :

• The case of a chain :

• Probabilistic models connected to form a chain:

• A first step beyond Independent and Identically Distributed (I.I.D)

• Models

• Directed

• Undirected

• Inference

• Directed model : HMM

• Undirected models : MRF, CRF

• Learning

• Directed model : HMM

• Undirected models : MRF, CRF

(4)

Independent and Identically Distributed

• A set of samples :

• This set is described by a single distribution. Pr(x)

• Each sample is drawn from Pr(x)

• Each sample is independent :

• The joint distribution Pr 𝑥₁, … , 𝑥_𝑀 is the product over all data points of the probability distribution evaluated at each data point.

(5)

Beyond : Independent and Identically Distributed

• For many applications, however, the i.i.d. assumption will be a poor one.

• Describe sequential data (time series)

• the rainfall measurements on successive days at a particular location

• the sequence of characters in an English sentence

• the daily values of a currency exchange rate

• The character 𝑥_𝑖 depends on the M-1 parents characters 𝑥₁, … , 𝑥_𝑀−1.

(6)

A first-order Markov chain

• Markov Assumption :

• The future depends only on the present. ⁼

(7)

A first-order Markov chain

We have three samples :

(8)

A second-order Markov chain

Markov Assumption :

The future depends only on the present and yesterday.

(9)

Directed models

• Directed models

• The joint probability distribution is factorized into a product of conditional distributions

• Markov chains are examples of directed models

(10)

Undirected models

• Undirected models

• The joint probability distribution is factorized into a product of potential functions (𝜙)

• The potential function (𝜙) always returns a positive number

• 𝜙 expresses the compatibility with variables x.

(11)

Undirected models

• The joint probability distribution is factorized into a product of potential functions (𝜙)

• The potential (𝜙) function always returns a positive number

• The term Z is known as the partition function and normalizes the

product of these positive functions so that the total probability is one.

• C is the number of cliques in the graphical model

(12)

Undirected model : Markov random fields

• Markov random fields (MRF) are undirected models

• MRF are undirected models with Markov assumptions :

• Potential functions operate on a subset of the variables (𝑆)

(13)

Undirected model : Conditional random fields

• A conditional random field (CRF) is a special case of MRF

• Where the probability of x is conditioned by another variable

• Pr(𝑥₁, . . 𝑥_𝑀|𝑧₁, … , 𝑧_𝑀) for instance.

• This will be better described.

(14)

Ok let’s go back on a generative model.

• The product rule gives us : Pr(Y,X)= Pr(X|Y)Pr(Y)

a Markov chain of variables y

(15)

Ok let’s go back on a generative model.

• Let’s put it together

(16)

I know this model

• This is known as a

• Hidden Markov model (HMM) when 𝑦_𝑖 is discrete

• Kalman Filter model when 𝑦_𝑖 is continuous

(17)

HMM : Factor graph

Factor graph representation

(18)

HMM : Factor graph

Factor graph representation

• There is one node per variable (circles)

• One function node per term in the factorization (squares).

• Each function node connects to all of the variables associated with this term.

(19)

Let us see some undirected models

• Markov random fields (MRF)

• Conditional random fields (CRF)

(20)

Markov Random Fields (MRF)

• HMM is a directed model

• The tendency to observe the measurements 𝑥_𝑖 given that state 𝑦_𝑖 takes value k.  Pr(𝑥_𝑖|𝑦_𝑖 = 𝑘)

• The current state is dependent on the previous one Pr(𝑦_𝑖|𝑦_𝑖−1)

• Undirected model

• Pr(𝑦_𝑖|𝑦_𝑖−1) is replaced by 𝜁 𝑦_𝑖, 𝑦_𝑖−1 : similarity function

• Example : 𝜁 𝑦_𝑖, 𝑦_𝑖−1 = 𝑦_𝑖 . 𝑦_𝑖−1 𝑎𝑛𝑑 𝑦_𝑖∈ −1,1

• Returns larger values when the adjacent states are more compatible.

(21)

Markov Random Fields (MRF)

MRF = Generative model

(22)

Markov Random Fields (MRF)

HMM MRF

Pr 𝑦₁, 𝑦₂, 𝑦₃ = Pr 𝑦₁ Pr 𝑦₂ 𝑦₁ Pr 𝑦₃ 𝑦₃

Pr 𝑦₁, … , 𝑦_𝑀 = Pr(𝑦₁) ෑ

𝑖=2 𝑀

Pr(𝑦_𝑖|𝑦_𝑖−1)

Pr 𝑦₁, 𝑦₂, 𝑦₃ = 1

𝑍₁ 𝜁 𝑦₁, 𝑦₂ 𝜁 𝑦₂, 𝑦₃ = 1

𝑍₁ ෑ

𝑖=2 𝑀

𝜁(𝑦_𝑖, 𝑦_𝑖−1)

𝑍₁ = ෍

𝑦₁∈𝑌

෍

𝑦₂∈𝑌

… ෍

𝑦_𝑀∈𝑌

ෑ

𝑖=2 𝑀

𝜁(𝑦_𝑖, 𝑦_𝑖−1) The difference lies on y

variables dependencies

(23)

Condional Random Fields (CRF)

• CRF is an undirected model

• Pr 𝑥_𝑖 𝑦_𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥_𝑖, 𝑦_𝑖 : similarity function

• Returns larger values when the measurements 𝑥_𝑖 and the world state are more compatible 𝑦_𝑖.

• Example : 𝜙 𝑥_𝑖, 𝑦_𝑖 = 𝑥_𝑖. 𝑦_𝑖 𝑤𝑖𝑡ℎ 𝑥_𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦_𝑖 ∈ {−1,1}

• Example : 𝜁 𝑦_𝑖, 𝑦_𝑖−1 = 𝑦_𝑖 . 𝑦_𝑖−1𝑖 𝑎𝑛𝑑 𝑦_𝑖 ∈ −1,1

• CRF are conditional models :

• The x variables are given (observed/fixed).

(24)

From Markov Random Fields (MRF) to Conditional Random Fields (CRF)

MRF = Generative model So far, this is still a MRF

(25)

From Markov Random Fields (MRF) to Conditional Random Fields (CRF)

• We use the Bayes’ rule to condition over the data X

• To obtain the posterior distribution

From MRF to CRF

(26)

Conditional Random Fields (CRF)

CRF = Discriminative model From MRF to CRF

CRF is a special case of MRF

The data nodes 𝑥₁, … , 𝑥_𝑀 are fixed/given

(27)

Inference

• We take a new set of measurements

• A new sequence : 𝑥^𝑛𝑒𝑤

• And we use the model to tell us about the world state.

• We want to make a prediction from 𝑥^𝑛𝑒𝑤 (to infer from 𝑥^𝑛𝑒𝑤)

• We take a datum 𝑥^𝑛𝑒𝑤 and use it to infer the state of the world, the sequence y.

(28)

Inference

• Directed models :

• HMM

• Undirected models :

• MRF and CRF

(29)

Inference for HMM

(30)

Inference for HMM : maximum a posteriori (MAP)

Inference with a generative model:

• 𝑥^𝑛𝑒𝑤 are observed variables

• We seek for the max of the Pr 𝑦₁, … , 𝑦_𝑀 𝑥₁^𝑛𝑒𝑤, … , 𝑥_𝑀^𝑛𝑒𝑤)

• Use the baye’s rule to obtain the posterior distribution

 Bayes’ rule

(31)

Inference for HMM : maximum a posteriori (MAP)

• does not depend on y variable. So it can be removed from the optimization.

• 

Pr(x) Does not depend on y variables

(32)

Inference for HMM : maximum a posteriori (MAP)

• Let us detail the model :

• 

= the prior. The prior is a Markov chain of the variables y.

(33)

Inference for HMM : maximum a posteriori (MAP)

• Let us introduce a log function

• To change the products into sums for a better numerical computation.

• Let us introduce a minus to change from a maximization to minimization problem : negative log

(34)

Inference for HMM

Where

• Ui is a unary term and depends only on a single variable 𝑦_𝑖 and

• Pi is a pairwise term, depending on two variables 𝑦_𝑖 and𝑦_𝑖−1.

(35)

Inference for HMM

• We consider that all the distributions are known

• How can we solve this optimization problem ?

• y variables are discrete variables

• Let us enumerate all possible combinations:

• 𝑦_𝑖 can take K different states

• A sequence is composed of M elements

•  There are 𝐾^𝑀 possibilities

• Can we do better than a brute force enumeration ???

(36)

Inference for HMM

• We can do better !!!

• But why ???

• The problem comes the dependence between y variables

• We can take advantage of the factorization

(37)

Inference for HMM

The problem is broken down into simpler problems !!!

(38)

Inference for HMM

• We consider that all the distributions are known

• How can we solve this optimization problem ?

• y variables are discrete variables

• It can be solved in polynomial time using the Viterbi algorithm which is an example of dynamic programming.

(39)

Inference for HMM : MAP

• Case of K classes : 𝑦_𝑖 ∈ [1,2,3,4,5]

• Build a graph

• The set of vertices {𝑉_𝑖,𝑘}_{𝑖=1,𝑘=1}^𝑀,5

• Each vertex 𝑉_𝑖,𝑘 has a set of edges (𝑉_{𝑖−1,𝑙}, 𝑉_𝑖,𝑘)_𝑙=1⁵

• Each vertex 𝑉_𝑖,𝑘 has a cost 𝑈_𝑖(𝑦_𝑖 = 𝑘)

• Each edge (𝑉_{𝑖−1,𝑙}, 𝑉_𝑖,𝑘) has a cost 𝑃_𝑖(𝑦_𝑖 = 𝑘, 𝑦_𝑖−1 = 𝑙)

• Find the shortest path between left to right

• Where the notion of distance : d((𝑉_{𝑖−1,𝑙}, 𝑉_𝑖,𝑘)) = 𝑈_𝑖−1+𝑃_𝑖

• Dijkstra can be used

(40)

Inference for HMM : Viterbi in short

• For this slide only: Change of notation y variables become w

• For this slide only: Change of notation i becomes n

• Find the shortest path from left to right

Picure taken from:

(41)

Inference for HMM with Dijsktra’s algorithm

There are two extra nodes at the start and the end of the graph.

(42)

Inference for HMM : Complexity

• MAP inference :

• Brute force approach : O(𝐾^𝑀)

• Enumerating all the combinations

• Sequence of length 3, K=5 : 111; 112 ; 113; 114; 115; 121; 122; ….

• Viterbi : O(M𝐾²)

(43)

Inference for MRF

(44)

Inference for Markov Random Fields (MRF)

• Undirected model

(45)

Inference for Markov Random Fields (MRF)

HMM MRF

Pr 𝑦₁, 𝑦₂, 𝑦₃ = Pr 𝑦₁ Pr 𝑦₂ 𝑦₁ Pr 𝑦₃ 𝑦₃

Pr 𝑦₁, … , 𝑦_𝑀 = Pr(𝑦₁) ෑ

𝑖=2 𝑀

Pr(𝑦_𝑖|𝑦_𝑖−1)

Pr 𝑦₁, 𝑦₂, 𝑦₃ = 1

𝑍₁ 𝜁 𝑦₁, 𝑦₂ 𝜁 𝑦₂, 𝑦₃ = 1

𝑍₁ ෑ

𝑖=2 𝑀

𝜁(𝑦_𝑖, 𝑦_𝑖−1)

𝑍₁ = ෍

𝑦₁∈𝑌

෍

𝑦₂∈𝑌

… ෍

𝑦_𝑀∈𝑌

ෑ

𝑖=2 𝑀

𝜁(𝑦_𝑖, 𝑦_𝑖−1) The difference lies on y

variables dependencies

(46)

Inference for MRF : maximum a posteriori (MAP)

Inference with a generative model:

• 𝑥^𝑛𝑒𝑤 are observed variables

• We seek for the max of the Pr 𝑦₁, … , 𝑦_𝑀 𝑥₁^𝑛𝑒𝑤, … , 𝑥₁^𝑛𝑒𝑤)

• Use the baye’s rule to obtain the posterior distribution

 Bayes’ rule

• MRF are generative models

• So far no changes with respect to HMM.

(47)

Inference for MRF : maximum a posteriori (MAP)

• does not depend on y variable. So it can be removed from the optimization.

• 

Pr(x) Does not depend on y variables

• MRF are generative models

• So far no changes with respect to HMM.

(48)

Inference for MRF : maximum a posteriori (MAP)

• Let us detail the model :

• 

= the prior. The prior is a Markov Random Field of the variables y.

(49)

Inference for MRF : maximum a posteriori (MAP)

• 

• What about 𝑍₁?

• Does it impact the optimization problem ?

• Ask the students ?

(50)

Inference for MRF : maximum a posteriori (MAP)

• 

• What about 𝑍₁?

• Does it impact the optimization problem ?

• Does 𝑍₁ depend on a specific value of 𝑦₁, … , 𝑦_𝑀 ?

• 𝑍₁ is a sum over all possible values of 𝑦₁, … , 𝑦_𝑀

• It is constant with respect to the states 𝑦₁, … , 𝑦_𝑀

• So 𝑍₁ does not impact the optimization problem.

• So 𝑍₁ can be removed from the optimization problem

𝑍₁ = ෍

𝑦₁∈𝑌

෍

𝑦₂∈𝑌

… ෍

𝑦_𝑀∈𝑌

ෑ

𝑖=2 𝑀

𝜁(𝑦_𝑖, 𝑦_𝑖−1)

(51)

Inference for MRF : maximum a posteriori (MAP)

• Let us introduce a log function

• To change the products into sums for a better numerical computation.

• Let us introduce a minus to change from a maximization to minimization problem : negative log

(52)

Inference for MRF

Where

• 𝑈_𝑖 is a unary term and depends only on a single variable 𝑦_𝑖 and

• 𝑃_𝑖 is a pairwise term, depending on two variables 𝑦_𝑖 and𝑦_𝑖−1.

Inference for MRF is close to inference for HMM

• Pr 𝑦_𝑖 𝑦_𝑖−1 is replaced by 𝜁(𝑦_𝑖, 𝑦_𝑖−1)

• Can be solved by dynamic programming (Viterbi algorithm)

• In the case of chains

(53)

Inference for CRF

(54)

Inference for Condional Random Fields (CRF)

• CRF is an undirected model

• Pr 𝑥_𝑖 𝑦_𝑖 = 𝑘 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝜙 𝑥_𝑖, 𝑦_𝑖 : similarity function

• Returns larger values when the measurements 𝑥_𝑖 and the world state are more compatible 𝑦_𝑖.

• Example : 𝜙 𝑥_𝑖, 𝑦_𝑖 = 𝑥_𝑖. 𝑦_𝑖 𝑤𝑖𝑡ℎ 𝑥_𝑖 ∈ −1,1 𝑎𝑛𝑑 𝑦_𝑖 ∈ {−1,1}

• CRF are conditional models :

• The x variables are given (observed/fixed).

(55)

Inference for Condional Random Fields (CRF)

Maximum a posteriori : MAP

(56)

Inference for Condional Random Fields (CRF)

Maximum a posteriori : MAP

Does not depend on y

So we get :

(57)

Inference for Condional Random Fields (CRF)

• Let us the negative log :

(58)

Inference for Condional Random Fields (CRF)

Where

• 𝑈_𝑖 is a unary term and depends only on a single variable 𝑦_𝑖 and

• 𝑃_𝑖 is a pairwise term, depending on two variables 𝑦_𝑖 and𝑦_𝑖−1.

Inference for CRF is close to inference for MRF

• Pr 𝑥_𝑖 𝑦_𝑖 is replaced by 𝜙(𝑥_𝑖, 𝑦_𝑖)

• Can be solved by dynamic programming (Viterbi algorithm)

• In the case of chains

(59)

Learning

• Learning with HMM

• Learning with MRF

• Learning with CRF

(60)

Learning

• The data set:

• 𝒟 = 𝑥 ^𝑗 , 𝑦 ^𝑗

𝑗=1 𝑁

• 𝑥^(𝑗) = 𝑥₁^(𝑗), 𝑥_𝑖^(𝑗), … , 𝑥_𝑀^(𝑗)

• 𝑦^(𝑗) = 𝑦₁^(𝑗), 𝑦_𝑖^(𝑗), … , 𝑦_𝑀^(𝑗)

• 𝑦₁^(𝑗) ∈ { 1,2, … , 𝐾} = Discrete domain usually

N data samples

One input sequence of M elements One output sequence of M elements

K classes

(61)

Learning with HMM

(62)

Learning with HMM

• So far there is no learning just Inference

• We just give a sequence (x) and output the labeled sequence (y)

• Where learning can be introduced ?

• Where are the parameters ?

• Let us take an example :

• The measurements (x) have a normal distribution

• The class variable (y) follows a categorical law.

• This hidden Markov model has parameters: W= 𝜇_𝑘, 𝜎_𝑘, 𝜆_{𝑘 𝑘=1}^𝐾

• One Gaussian by class to model 𝑥₁^(𝑗), 𝑥_𝑖^(𝑗), … , 𝑥_𝑀^(𝑗)

Categorical distribution

(63)

Learning with HMM

• Let us express the joint probability of the data set 𝒟

• For one input and one output, the HMM model is :

(64)

Learning with HMM

• For the whole data set :

Assuming each data sequence was drawn independentlyfrom the distribution (i.i.d) Independent and identically distributed  i.i.d

(65)

Learning with HMM

(66)

Learning with HMM

• Let us fit the probability model by the maximum of likelihood

(67)

Learning with HMM

(68)

Learning with HMM

(69)

Learning with HMM

• Supervised learning

• Relatively simple. We first isolate the part of the model that we want to learn.

• For example, we might learn the parameters

• from paired examples of xi and yi.

• We can learn these parameters in isolation using the ML, MAP, or Bayesian methods.

• The same applies for

(70)

Learning with HMM

• Unsupervised learning

• More challenging

• Beyond the scope of this presentation (dedicated to supervised learning)

• Require notion such as :

1. Expectation Maximization method 2. Forward-Backward method

3. 1+2=Baum-Welch algorithm

1. Gaussian Mixture and Expectation Maximization algorithm

2. http://romain.raveaux.free.fr/document/GaussianMixtureandExpectationMaximization.ht ml

(71)

Limit of HMM

• Parameters (𝑊₁et 𝑊₂) of the distributions are shared through time

• All time steps have the same parameters

• Corresponding to the assumption of a stationary time series.

• HMM are generative models

• Discriminative models are more direct to infer information on the world state (y)

• Although this is more general than the independence model, it is still very restrictive.

• Markov hypothesis : The future depends on the present solely

(72)

Learning with MRF

(73)

Learning with MRF

• For one input and one output, the MRF model is :

(74)

Learning with MRF

• For one input and one output, the MRF model is :

(75)

Learning with MRF

Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed  i.i.d

(76)

Learning with MRF

(77)

Learning with MRF

(78)

Learning with MRF

(79)

Learning with MRF

(80)

Learning with MRF

Like learning with HMM.

Here we have a difference: Z

Z appears in the learning stage !!!

ouch

(81)

Learning with MRF

Z does not depend on y (do you remember ? )

 Cool, we just have to compute it once

(82)

Learning with MRF

Computing Z is challenging ???

𝑲^𝑵 summations to do !!!

How to compute Z efficiently ??? An important question !!!

(83)

Learning with MRF

• How to compute Z efficiently?

• Generally speaking, takes advantage of the structure of the problem you are dealing with

• Two arguments can be used in the case of chains:

• Exploit the structured factorization of the problem

• Based on conditional independence

• We can use one of them to compute Z efficiently

• Let us exploit the factorization property

(84)

Forward algorithm for computing Z

• We observe that not every term in the product is relevant to every summation.

• We can re-arrange the summation terms so that only the variables over which they sum are to the right

• We proceed from right to left, computing each summation in turn.

• This technique is known as variable elimination.

(85)

Forward algorithm for computing Z

• Let us denote the rightmost three terms :

(86)

Forward algorithm for computing Z

• At the 𝑖^𝑡ℎ stage we compute :

• We repeat this process until we have computed the full expression.

• This solution consists of M-1 summations over K values (K classes)

• It is much more efficient to compute than explicitly computing all 𝐾^𝑁 summations

(87)

Forward algorithm for computing Z

• This solution consists of M-1 summations over K values (K classes)

• We can now perform the learning stage for the MRF ???

(88)

Learning with MRF

• Well we are not safe yet !!!

• Let us take the negative-log

• How can we solve this problem ????

• Ask the student ?

(89)

Learning with MRF

• Well we are not safe yet !!!

• Let us take the negative-log

• The min is where the derivative is equal to zero :

• Let us assume that 𝜁 is differentiable

(90)

Learning with MRF

• This problem can be solved by gradient descent :

• Do you remember?

(91)

Learning with MRF

• Where 𝛼 is the learning rate

(92)

Learning with MRF

• Where 𝛼 is the learning rate

• But How to compute the gradient ?

(93)

Learning with MRF

• The second term is easy to compute :

• The sum of the derivatives of log of 𝜁

f2 f1

(94)

Learning with MRF

• The first term is touchy :

• It contains the sum of the derivatives of log of 𝜁

• It involves an intractable sum over all possible values of the y

f2 f1

(95)

Learning with MRF

• The first term is touchy :

• It contains the sum of the derivatives of log of 𝜁

• It involves an intractable sum over all possible values of the y

• Intractable !!! Not for us !!! not for the case of chains :

• Thanks to the forward algorithm

• We can do that with M-1 summations over K values (K classes)

(96)

Learning with CRF

(97)

Learning with CRF

• Let us express the conditional probability over the observed data x

• For one input and one output, the CRF model is :

(98)

Learning with CRF

• Let us express the conditional probability over the observed data x

• For one input and one output, the CRF model is :

(99)

Learning with CRF

• Let us express the conditional probability of the data set 𝒟

Assuming each data sequence was drawn independently from the distribution (i.i.d) Independent and identically distributed  i.i.d

(100)

Learning with CRF

• Let us express the conditional probability of the data set 𝒟

(101)

Learning with CRF

(102)

Learning with CRF

Z does not depend on y (do you remember ? )

Does not depend on W

(103)

Learning with CRF

(104)

Learning with CRF

• It is close to the case of MRF

• We have to compute Z efficiently

• To do so, we can use the factorization property of the problem (The case of chains)

(105)

Learning with CRF : Computation of Z

• Let us use the forward algorithm

• W is observed/fixed/given so we remove for the explanation

• It makes notation simpler

(106)

Learning with CRF : Computation of Z

• We observe that not every term in the product is relevant to every summation.

• We can re-arrange the summation terms so that only the variables over which they sum are to the right

• We proceed from right to left, computing each summation in turn.

• This technique is known as variable elimination.

(107)

Forward algorithm for computing Z

• Let us denote the rightmost three terms :

(108)

Forward algorithm for computing Z

• We repeat this process until we have computed the full expression.

• This solution consists of M summations over K values (K classes)

• It is much more efficient to compute than explicitly computing all 𝐾^𝑁 summations

(109)

Forward algorithm for computing Z

• This solution consists of M summations over K values (K classes)

• We can now perform the learning stage for the CRF

(110)

Some food for thoughts

• For learning :

• We have used Maximum Likelihood but Maximum a posteriori (MAP) could have been used. A prior Pr(W) would have appeared.

• One possible solution to compute is the contrastive divergence algorithm. This is a method for approximating the gradient.

• The forward algorithm can be performed from the end of the chain:

• Then called backward recursion

• Forward-backward algorithm is a special case of belief propagation

• Methods described here could work for Trees not only for chains

(111)

Summary

(112)

Some quotes

(113)

Conclusion

• We have seen :

• How to take into account dependencies from a data set

• Sequential dependencies (time dependencies)

• HMM model

• A generative model

• How to infer world states (labels y) from a given sequence x

• Thanks to the maximum of the posterior distribution (Maximum A Posteriori : MAP)

• How supervised learning could be achieved

• Next :

• Beyond the case of chains :

• Go to graphs

• http://romain.raveaux.free.fr/document/Structured%20Output%20Learning.pdf