• Aucun résultat trouvé

Online Parameter estimation in HMM

N/A
N/A
Protected

Academic year: 2022

Partager "Online Parameter estimation in HMM"

Copied!
35
0
0

Texte intégral

(1)

Online Parameter estimation in HMM

Sylvain Le Corff, Gersende Fort, Eric Moulines

Télécom ParisTech

April 19, 2012

(2)

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(3)

Outline

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(4)

Ydef= {Yt}t∈Zis theobservation processwhereasXdef= {Xt}t∈Zare the hidden states.

The distribution of the HMM is specified by

1 A family of transition kernels{mθ}θ∈ΘonX× B(X)governing the transition of the hidden chain.

2 A family of transition kernels{gθ}θ∈Θ onX× B(Y), the conditional likelihood of the observations.

(5)

Online estimation in HMM

.

1 Objective: estimate the parameterθusingmaximum likelihoodestimator (or a quasi-MLE if the distribution is misspecified !).

2 Requirements

I No storageof the observations, no growing capacity memory.

I Constantcomplexity per incoming observation.

I The parameter are updated as each newobservation.

3 Applications

I Inference of partially observed Markov chains for very large data sets (e.g. proteomics, volatility from high-frequency data, etc...)

I Online localization and mapping in robotics.

(6)

Outline

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(7)

MLE in Missing data: the IID case

(Curved) exponential family model.

pθ(xt,yt) =exp(hs(xt,yt), ψ(θ)i −A(θ)) with respect to someσ-finite dominating measure.

Explicit complete-data MLE.

S7→θ(S) =¯ arg max

θ hS, ψ(θ)i −A(θ) is available inclosed-form.

IID Data.

(Yt)is an iid. process with marginalπ.

(not necessarily equal tofθ?.)

(8)

The (Usual) Expectation-Maximization Algorithm

k-th EM Iteration (withT observations).

E-Step

ST,k =1 n

T

X

t=1

Eθk−1[s(Xt,Yt)|Yt] . M-Step

θk= ¯θ(ST,k) . Can be fully reparameterized in the domain ofsufficient statistics

ST,k= 1 T

T

X

t=1

Eθ(S¯ T,k−1)[s(Xt,Yt)|Yt]defT(ST,k−1).

(9)

The Limiting EM Recursion

AsT goes to infinity, the sequence of EM mappings(ΦT)converges to a limit, Sufficient Statistics Update

Sk=Eπ

h

Eθ(S¯ k−1)[s(X0,Y0)|Y0]i

(Sk−1). Parameter Update

θk= ¯θ(Sk).

Some results

1 The Kullback-Leibler divergence between the marginal distributionfθk andπ, KL(π||fθk)is monotonically decreasing withk.

2 Converge to{θ:∇θD(π|fθ) =0}.

(10)

Outline

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(11)

Online EM: Rationale

Objective: find the roots of Eπ

Eθ(S)¯ [s(X0,Y0)|Y0]

−S=0. Stochastic Approximation(orRobbins-Monro) setup.

Eθ(S)¯ [s(Xn,Yn)|Yn]is seen as anoisy observationofEπ

Eθ(S)¯ [s(X0,Y0)|Y0] . Sn=Sn−1n

Eθ(S¯ n−1)[s(Xn,Yn)|Yn]−Sn−1

, where(γn)is a sequence of decreasing positive stepsizes.

(12)

Online EM Algorithm

Stochastic E-Step

Sn = (1−γn)Sn−1nEθn−1[s(Xn,Yn)|Yn]. M Step

θn= ¯θ(Sn).

Practical Recommendations

γn=c/nαwithα∈[0.7,0.9].

Don’t doMfor the first 10–20 obs.

(optional)Use Polyak-Ruppert averaging.

(13)

Outline

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(14)

The EM Algorithm for HMMs

1 The EM update withT observations is now ST,k= 1

T

T

X

t=1

Eθ(S¯ T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] .

2 Dependence of the conditional expectation on thefuturevaluesY1:T. Problem 1: how to computeadditive functionalrecursively in time ? Problem 2: how to adapt the parameters within such framework ?

(15)

The limiting EM for HMMs

An iteration of the EM algorithm writes ST,k= 1

T

T

X

t=1

Eθ(S¯ T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] . Assuming that,

I (Yt)is anergodicprocess with distributionπ,

I some form offorgetting propertiesfor the HMM model, the limiting EM recursion becomes (asT → ∞)

Sk =Eπ h

Eθ(S¯ k−1)[s(X−1,X0,Y0)|Y−∞:∞]i .

Idea: develop a sequential algorithm allowing to approximate the limiting EM !

(16)

The M-step is performed onblocksof observationsYTk:Tk+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.

The parameters arekept constantwhile accumulating the information brought by the observationsYTk:Tk+1.

Algorithm

1 Blockn

I FromTn−1+1 toTn, compute recursively

S¯τnn−1,Y) = 1 τn−1Eθn

"τn X

t=1

S(Xt−1,Xt,Yt+Tn−1)

YTn−1+1:Tn−1n

# .

2 Parameter update:

I θn

def= ¯θ[¯Sτnn−1,Y)].

(17)

The M-step is performed onblocksof observationsYTk:Tk+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.

The parameters arekept constantwhile accumulating the information brought by the observationsYTk:Tk+1.

Algorithm

1 Blockn

I FromTn−1+1 toTn, compute recursively

S¯τnn−1,Y) = 1 τn−1Eθn

"τn X

t=1

S(Xt−1,Xt,Yt+Tn−1)

YTn−1+1:Tn−1n

# .

I Compute

Σn def=

1 τn

Tn

Σn−1+ τn

Tn

¯Sτnn−1,Y).

2 Parameter update:

I θndef

= ¯θ[¯Sτnn−1,Y)].

I θendef

= ¯θ[Σn].

(18)

Outline

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(19)

Online computation of additive functionals

Consider the following additive functional:

T = 1 TE

"T X

t=1

S(Xt−1,Xt,Yt)

Y1:T

# .

By the tower propertyof the conditional expectation, S¯T =E[ρT(XT)|Y1:T] =φTT]. whereφT is thefiltering distributionat timeT and

ρT(xT)def= E

"

1 T

T

X

t=1

S(Xt−1,Xt,Yt)

Y1:T,xT

# .

(20)

Online computation of additive functionals

Decompose 1 T

T

X

t=1

S(Xt−1,Xt,Yt) =

1− 1 T

1 T−1

T−1

X

t=1

S(Xt−1,Xt,Yt)+1

TS(XT−1,XT,YT). Then, use thatX0:T|Y0:T is a Markov chain

ρT(xT) =

1− 1 T

BT|T−1[xTT−1] + 1

TBT|T−1[xT,S(·,xT,YT)], HereBT|T−1 is thebackward Markov transition kernel

BT|T−1(xT,dxT−1)def= φT−1(dxT−1)m(xT−1,xT) RφT−1(dxT−1)m(xT−1,xT) . whereφT−1 is the filtering distribution at timeT−1.

The computations can be carried out forward in time !

(21)

Online computation for additive functional

This sequential computation can be done only when it is possible to obtain an explicit expression for the filter:

1 Linear Gaussianmodels.

2 HMM withfinitestate-spaces.

In the online framework,sequential Monte Carlomethods (aka. particle filter) are appealing:

1 these methods are easy to implement and to tweak (as long as the dimension of the hidden space is not too large).

2 these methods are amenable to parallel computations.

(22)

Particle approximation of the additive functional

φtis approximated byweighted samples{(ξit, ωti)}Ni=1: φNt[h] =

N

X

i=1

ωith(ξit).

The Backward kernel can be approximated at the current particle locations Bt|t−1Nti,dxT−1)def= φNt−1(dxt−1)m(xt−1, ξit)

Nt−1(dxt−1)m(xt−1, ξti) .

The functionsρt can then be computed at all particle locations (the computational cost grows likeN2; algorithm with linear complexity may be derived, but do not proceed entirely forward in time)

ρNtit) =BT|T−1N

ξti,

1−1 t

ρNt−1(·) +1

tS(·,ξti,Yt)

.

(23)

Particle filtering in action

1 Computation ofφNt withYt andφNt−1.

2 Computation of{ρNtti)}Ni=1 withYtandφNt.

For each particleξit, weights{ωet−1i,jt−1j m(ξjt−1, ξti)}Nj=1are computed to match the target kernel.

Bt|t−1Nti,dxT−1) = PN

j=1ωjt−1m(ξjt−1, ξit

ξjt−1(dxt−1) PN

j=1ωjt−1m(ξt−1j , ξit)

=

N

X

j=1

t−1i,j PN

k=1t−1i,k δ

ξt−1j (dxt−1).

(24)

−1 0 1 2 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(25)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(26)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(27)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(28)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(29)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(30)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(31)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(32)

Consider the followingstochastic volatility model(SVM):

(Xt+1=φXe t+σUt, Yt=βeXt2Vt, whereX0∼ N

0, σ2

1−eφ2

,Ut andVt are i.i.d. N(0,1).

Data sampled usingφ=0.8,σ2=0.2andβ2=1.

Runs started withφ=0.1,σ2=0.6andβ2=2.

(33)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 104 0.55

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Number of observations

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104 0.55

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Number of observations

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of observations

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of observations

Figure:Estimation ofφ,σ2andβ2 without (left) and with (right) averaging. Each graph represents the empirical median (bold line) and first and last quartiles (dotted line) over50 independent Monte Carlo runs. The averaging procedure is started after1500observations.

(34)

50 100 150 200 0

0.2 0.4 0.6 0.8

Number of blocks

50 100 150 200

0 0.02 0.04 0.06 0.08 0.1 0.12

Number of blocks

Figure:Empirical variance of the estimation ofβ2with P-BOEM (top) and its averaged version (bottom) whenNn=

τn (dotted line) and whenNn=τn (bold line).

{fig:varSVM}

(35)

Results on online EM procedures.

Convergence of thelimiting EMto the stationary points of the limiting log-likelihood.

Control of thefluctuation of the Monte Carlo approximationon each block.

Averaging procedure leads to anoptimal rate of convergenceinLp.

Références

Documents relatifs

This approach can be generalized to obtain a continuous CDF estimator and then an unbiased density estimator, via the likelihood ratio (LR) simulation-based derivative estimation

Example systems showing the Midir architecture: software in tiles need a capability to authorize access to resources in other tiles (solid lines); capability modifications in a tile

We derive a perspective projection matrix for 3D lines expressed in Pl¨ucker coordinates and a joint projection ma- trix mapping a 3D line to a set of image lines in the second set

The efficiency of the simulation methods under analysis is evaluated in terms of four quantities: the failure probability estimate Pˆ (F), the sample standard de viation σ ˆ of

This paper is concerned with the estimation of parameters of the radial bearing using an algebraic approach based on the work of Fliess and Sira-Ramirez [6], [7].. The

Measure each segment in inches.. ©X k2v0G1615 AK7uHtpa6 tS7offStPw9aJr4e4 ULpLaCc.5 H dAylWlN yrtilgMh4tcs7 UrqersaezrrvHe9dK.L i jMqacdreJ ywJiZtYhg

Measure each segment in centimeters.. ©c i2Q0y1N1P nK8urtPal kS3oIfLtRwEa0rmeH 6LxLMCm.o m fAAlQla WrqiNgEhxtts4 drUeOsEeir6vje1dV.J U 0MzavdEeD ewDiRtehI

The fi rst, youngest paleostress regime, found in all rocks and hence younger than 2 Ma, is a strike-slip faulting with S Hmax oriented N160 – 170°E near the Median Tectonic Line to