Online Parameter estimation in HMM

(1)

Online Parameter estimation in HMM

Sylvain Le Corff, Gersende Fort, Eric Moulines

Télécom ParisTech

April 19, 2012

(2)

1 Motivation

2 MLE in Missing Data Models

3 Online EM Algorithm for IID Models

4 The online EM algorithm in the HMM case

5 Online computations

(3)

Outline

1 Motivation

(4)

Y^def= {Yt}t∈Zis theobservation processwhereasX^def= {Xt}t∈Zare the hidden states.

The distribution of the HMM is specified by

1 A family of transition kernels{m_θ}θ∈ΘonX× B(X)governing the transition of the hidden chain.

2 A family of transition kernels{g_θ}θ∈Θ onX× B(Y), the conditional likelihood of the observations.

(5)

Online estimation in HMM

.

1 Objective: estimate the parameterθusingmaximum likelihoodestimator (or a quasi-MLE if the distribution is misspecified !).

2 Requirements

I No storageof the observations, no growing capacity memory.

I Constantcomplexity per incoming observation.

I The parameter are updated as each newobservation.

3 Applications

I Inference of partially observed Markov chains for very large data sets (e.g. proteomics, volatility from high-frequency data, etc...)

I Online localization and mapping in robotics.

(6)

Outline

1 Motivation

(7)

MLE in Missing data: the IID case

(Curved) exponential family model.

pθ(xt,yt) =exp(hs(xt,yt), ψ(θ)i −A(θ)) with respect to someσ-finite dominating measure.

Explicit complete-data MLE.

S7→θ(S) =¯ arg max

θ hS, ψ(θ)i −A(θ) is available inclosed-form.

IID Data.

(Yt)is an iid. process with marginalπ.

(not necessarily equal tofθ?.)

(8)

The (Usual) Expectation-Maximization Algorithm

k-th EM Iteration (withT observations).

E-Step

ST,k =1 n

T

X

t=1

Eθ_k−1[s(Xt,Yt)|Yt] . M-Step

θk= ¯θ(S_T_,k) . Can be fully reparameterized in the domain ofsufficient statistics

ST,k= 1 T

T

X

t=1

Eθ(S^¯ _T,k−1)[s(Xt,Yt)|Yt]^def=ΦT(ST,k−1).

(9)

The Limiting EM Recursion

AsT goes to infinity, the sequence of EM mappings(ΦT)converges to a limit, Sufficient Statistics Update

Sk=Eπ

h

Eθ(S^¯ _k−1)[s(X0,Y0)|Y0]i

=Φ∞(Sk−1). Parameter Update

θk= ¯θ(Sk).

Some results

1 The Kullback-Leibler divergence between the marginal distributionfθ_k andπ, KL(π||fθ_k)is monotonically decreasing withk.

2 Converge to{θ:∇θD(π|fθ) =0}.

(10)

Outline

1 Motivation

(11)

Online EM: Rationale

Objective: find the roots of Eπ

Eθ(S)^¯ [s(X0,Y0)|Y0]

−S=0. Stochastic Approximation(orRobbins-Monro) setup.

Eθ(S)^¯ [s(Xn,Yn)|Yn]is seen as anoisy observationofEπ

Eθ(S)^¯ [s(X0,Y0)|Y0] . Sn=Sn−1+γn

Eθ(S^¯ _n−1)[s(Xn,Yn)|Yn]−Sn−1

, where(γn)is a sequence of decreasing positive stepsizes.

(12)

Online EM Algorithm

Stochastic E-Step

Sn = (1−γn)Sn−1+γnEθn−1[s(Xn,Yn)|Yn]. M Step

θn= ¯θ(Sn).

Practical Recommendations

γn=c/n^αwithα∈[0.7,0.9].

Don’t doMfor the first 10–20 obs.

(optional)Use Polyak-Ruppert averaging.

(13)

Outline

1 Motivation

(14)

The EM Algorithm for HMMs

1 The EM update withT observations is now ST,k= 1

T

X

t=1

Eθ(S^¯ _T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] .

2 Dependence of the conditional expectation on thefuturevaluesY1:T. Problem 1: how to computeadditive functionalrecursively in time ? Problem 2: how to adapt the parameters within such framework ?

(15)

The limiting EM for HMMs

An iteration of the EM algorithm writes ST,k= 1

T

X

t=1

Eθ(S^¯ _T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] . Assuming that,

I (Yt)is anergodicprocess with distributionπ,

I some form offorgetting propertiesfor the HMM model, the limiting EM recursion becomes (asT → ∞)

Sk =E^π h

Eθ(S^¯ _k−1)[s(X−1,X0,Y0)|Y−∞:∞]i .

Idea: develop a sequential algorithm allowing to approximate the limiting EM !

(16)

The M-step is performed onblocksof observationsYT_k:T_k+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.

The parameters arekept constantwhile accumulating the information brought by the observationsYT_k:T_k+1.

Algorithm

1 Blockn

I FromTn−1+1 toTn, compute recursively

S¯τ_n(θn−1,Y) = 1 τn−1Eθ_n

"_τ_n X

t=1

S(Xt−1,Xt,Yt+T_n−1)

YT_n−1+1:T_n−1+τn

# .

2 Parameter update:

I θn

def= ¯θ[¯Sτn(θn−1,Y)].

(17)

The M-step is performed onblocksof observationsYT_k:T_k+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.

The parameters arekept constantwhile accumulating the information brought by the observationsYT_k:T_k+1.

Algorithm

1 Blockn

I FromTn−1+1 toTn, compute recursively

S¯τ_n(θn−1,Y) = 1 τn−1Eθ_n

"_τ_n X

t=1

S(Xt−1,Xt,Yt+T_n−1)

YT_n−1+1:T_n−1+τn

# .

I Compute

Σn def=

1− τn

Tn

Σn−1+ τn

Tn

¯Sτn(θn−1,Y).

2 Parameter update:

I θndef

= ¯θ[¯Sτ_n(θn−1,Y)].

I θendef

= ¯θ[Σn].

(18)

Outline

1 Motivation

(19)

Online computation of additive functionals

Consider the following additive functional:

S¯T = 1 TE

"_T X

t=1

S(Xt−1,Xt,Yt)

Y1:T

# .

By the tower propertyof the conditional expectation, S¯T =E[ρT(XT)|Y1:T] =φT[ρT]. whereφT is thefiltering distributionat timeT and

ρT(xT)^def= E

"

1 T

T

X

t=1

S(Xt−1,Xt,Yt)

Y1:T,xT

# .

(20)

Online computation of additive functionals

Decompose 1 T

T

X

t=1

S(Xt−1,Xt,Yt) =

1− 1 T

1 T−1

T−1

X

t=1

S(Xt−1,Xt,Yt)+1

TS(XT−1,XT,YT). Then, use thatX0:T|Y0:T is a Markov chain

ρT(xT) =

1− 1 T

B_T|T−1[xT,ρT−1] + 1

TB_T|T−1[xT,S(·,xT,YT)], HereB_T|T−1 is thebackward Markov transition kernel

BT|T−1(xT,dxT−1)^def= φT−1(dxT−1)m(xT−1,xT) RφT−1(dxT−1)m(xT−1,xT) . whereφT−1 is the filtering distribution at timeT−1.

The computations can be carried out forward in time !

(21)

Online computation for additive functional

This sequential computation can be done only when it is possible to obtain an explicit expression for the filter:

1 Linear Gaussianmodels.

2 HMM withfinitestate-spaces.

In the online framework,sequential Monte Carlomethods (aka. particle filter) are appealing:

1 these methods are easy to implement and to tweak (as long as the dimension of the hidden space is not too large).

2 these methods are amenable to parallel computations.

(22)

Particle approximation of the additive functional

φtis approximated byweighted samples{(ξⁱt, ωtⁱ)}^N_i=1: φ^N_t[h] =

N

X

i=1

ωⁱ_th(ξⁱ_t).

The Backward kernel can be approximated at the current particle locations Bt|t−1^N (ξtⁱ,dxT−1)^def= φ^Nt−1(dxt−1)m(xt−1, ξⁱt)

Rφ^N_t−1(dxt−1)m(xt−1, ξ_tⁱ) .

The functionsρt can then be computed at all particle locations (the computational cost grows likeN²; algorithm with linear complexity may be derived, but do not proceed entirely forward in time)

ρ^Nt(ξⁱt) =B_T|T−1^N

ξtⁱ,

1−1 t

ρ^Nt−1(·) +1

tS(·,ξtⁱ,Yt)

.

(23)

Particle filtering in action

1 Computation ofφ^Nt withYt andφ^Nt−1.

2 Computation of{ρ^Nt(ξtⁱ)}^Ni=1 withYtandφ^Nt.

For each particleξⁱt, weights{ωe_t−1^i,j =ω_t−1^j m(ξ^j_t−1, ξtⁱ)}^N_j=1are computed to match the target kernel.

B_t|t−1^N (ξtⁱ,dxT−1) = PN

j=1ω^j_t−1m(ξ^j_t−1, ξⁱt)δ

ξ^j_t−1(dxt−1) PN

j=1ω^j_t−1m(ξ_t−1^j , ξⁱ_t)

=

N

X

j=1

eω_t−1^i,j PN

k=1eω_t−1^i,k δ

ξ_t−1^j (dxt−1).

(24)

−1 0 1 2 0

0.5 1 1.5 2 2.5

Backward kernel from time t =8 to time t =7

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

Filtering distributions at time t =8

−2

−1 0 1 2

Genealogical history

(25)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(26)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(27)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(28)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(29)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(30)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(31)

−1 −0.5 0 0.5 1 0

0.5 1 1.5 2 2.5

−1 0 1 2

0 0.2 0.4 0.6 0.8 1

−2

−1 0 1 2

(32)

Consider the followingstochastic volatility model(SVM):

(Xt+1=φXe t+σUt, Yt=βe^Xt²Vt, whereX0∼ N

0, ^σ²

1−eφ²

,Ut andVt are i.i.d. N(0,1).

Data sampled usingφ=0.8,σ²=0.2andβ²=1.

Runs started withφ=0.1,σ²=0.6andβ²=2.

(33)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴ 0.55

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Number of observations

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴ 0.55

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴ 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴ 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:Estimation ofφ,σ²andβ² without (left) and with (right) averaging. Each graph represents the empirical median (bold line) and first and last quartiles (dotted line) over50 independent Monte Carlo runs. The averaging procedure is started after1500observations.

(34)

50 100 150 200 0

0.2 0.4 0.6 0.8

Number of blocks

50 100 150 200

0 0.02 0.04 0.06 0.08 0.1 0.12

Number of blocks

Figure:Empirical variance of the estimation ofβ²with P-BOEM (top) and its averaged version (bottom) whenNn=√

τn (dotted line) and whenNn=τn (bold line).

{fig:varSVM}

(35)

Results on online EM procedures.

Convergence of thelimiting EMto the stationary points of the limiting log-likelihood.

Control of thefluctuation of the Monte Carlo approximationon each block.

Averaging procedure leads to anoptimal rate of convergenceinLp.