Online Parameter estimation in HMM
Sylvain Le Corff, Gersende Fort, Eric Moulines
Télécom ParisTech
April 19, 2012
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
Outline
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
Ydef= {Yt}t∈Zis theobservation processwhereasXdef= {Xt}t∈Zare the hidden states.
The distribution of the HMM is specified by
1 A family of transition kernels{mθ}θ∈ΘonX× B(X)governing the transition of the hidden chain.
2 A family of transition kernels{gθ}θ∈Θ onX× B(Y), the conditional likelihood of the observations.
Online estimation in HMM
.
1 Objective: estimate the parameterθusingmaximum likelihoodestimator (or a quasi-MLE if the distribution is misspecified !).
2 Requirements
I No storageof the observations, no growing capacity memory.
I Constantcomplexity per incoming observation.
I The parameter are updated as each newobservation.
3 Applications
I Inference of partially observed Markov chains for very large data sets (e.g. proteomics, volatility from high-frequency data, etc...)
I Online localization and mapping in robotics.
Outline
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
MLE in Missing data: the IID case
(Curved) exponential family model.
pθ(xt,yt) =exp(hs(xt,yt), ψ(θ)i −A(θ)) with respect to someσ-finite dominating measure.
Explicit complete-data MLE.
S7→θ(S) =¯ arg max
θ hS, ψ(θ)i −A(θ) is available inclosed-form.
IID Data.
(Yt)is an iid. process with marginalπ.
(not necessarily equal tofθ?.)
The (Usual) Expectation-Maximization Algorithm
k-th EM Iteration (withT observations).
E-Step
ST,k =1 n
T
X
t=1
Eθk−1[s(Xt,Yt)|Yt] . M-Step
θk= ¯θ(ST,k) . Can be fully reparameterized in the domain ofsufficient statistics
ST,k= 1 T
T
X
t=1
Eθ(S¯ T,k−1)[s(Xt,Yt)|Yt]def=ΦT(ST,k−1).
The Limiting EM Recursion
AsT goes to infinity, the sequence of EM mappings(ΦT)converges to a limit, Sufficient Statistics Update
Sk=Eπ
h
Eθ(S¯ k−1)[s(X0,Y0)|Y0]i
=Φ∞(Sk−1). Parameter Update
θk= ¯θ(Sk).
Some results
1 The Kullback-Leibler divergence between the marginal distributionfθk andπ, KL(π||fθk)is monotonically decreasing withk.
2 Converge to{θ:∇θD(π|fθ) =0}.
Outline
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
Online EM: Rationale
Objective: find the roots of Eπ
Eθ(S)¯ [s(X0,Y0)|Y0]
−S=0. Stochastic Approximation(orRobbins-Monro) setup.
Eθ(S)¯ [s(Xn,Yn)|Yn]is seen as anoisy observationofEπ
Eθ(S)¯ [s(X0,Y0)|Y0] . Sn=Sn−1+γn
Eθ(S¯ n−1)[s(Xn,Yn)|Yn]−Sn−1
, where(γn)is a sequence of decreasing positive stepsizes.
Online EM Algorithm
Stochastic E-Step
Sn = (1−γn)Sn−1+γnEθn−1[s(Xn,Yn)|Yn]. M Step
θn= ¯θ(Sn).
Practical Recommendations
γn=c/nαwithα∈[0.7,0.9].
Don’t doMfor the first 10–20 obs.
(optional)Use Polyak-Ruppert averaging.
Outline
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
The EM Algorithm for HMMs
1 The EM update withT observations is now ST,k= 1
T
T
X
t=1
Eθ(S¯ T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] .
2 Dependence of the conditional expectation on thefuturevaluesY1:T. Problem 1: how to computeadditive functionalrecursively in time ? Problem 2: how to adapt the parameters within such framework ?
The limiting EM for HMMs
An iteration of the EM algorithm writes ST,k= 1
T
T
X
t=1
Eθ(S¯ T,k−1)[s(Xt−1,Xt,Yt)|Y1:T] . Assuming that,
I (Yt)is anergodicprocess with distributionπ,
I some form offorgetting propertiesfor the HMM model, the limiting EM recursion becomes (asT → ∞)
Sk =Eπ h
Eθ(S¯ k−1)[s(X−1,X0,Y0)|Y−∞:∞]i .
Idea: develop a sequential algorithm allowing to approximate the limiting EM !
The M-step is performed onblocksof observationsYTk:Tk+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.
The parameters arekept constantwhile accumulating the information brought by the observationsYTk:Tk+1.
Algorithm
1 Blockn
I FromTn−1+1 toTn, compute recursively
S¯τn(θn−1,Y) = 1 τn−1Eθn
"τn X
t=1
S(Xt−1,Xt,Yt+Tn−1)
YTn−1+1:Tn−1+τn
# .
2 Parameter update:
I θn
def= ¯θ[¯Sτn(θn−1,Y)].
The M-step is performed onblocksof observationsYTk:Tk+1, for an appropriately chosen sequence of time instants{Tk,k≥1}.
The parameters arekept constantwhile accumulating the information brought by the observationsYTk:Tk+1.
Algorithm
1 Blockn
I FromTn−1+1 toTn, compute recursively
S¯τn(θn−1,Y) = 1 τn−1Eθn
"τn X
t=1
S(Xt−1,Xt,Yt+Tn−1)
YTn−1+1:Tn−1+τn
# .
I Compute
Σn def=
1− τn
Tn
Σn−1+ τn
Tn
¯Sτn(θn−1,Y).
2 Parameter update:
I θndef
= ¯θ[¯Sτn(θn−1,Y)].
I θendef
= ¯θ[Σn].
Outline
1 Motivation
2 MLE in Missing Data Models
3 Online EM Algorithm for IID Models
4 The online EM algorithm in the HMM case
5 Online computations
Online computation of additive functionals
Consider the following additive functional:
S¯T = 1 TE
"T X
t=1
S(Xt−1,Xt,Yt)
Y1:T
# .
By the tower propertyof the conditional expectation, S¯T =E[ρT(XT)|Y1:T] =φT[ρT]. whereφT is thefiltering distributionat timeT and
ρT(xT)def= E
"
1 T
T
X
t=1
S(Xt−1,Xt,Yt)
Y1:T,xT
# .
Online computation of additive functionals
Decompose 1 T
T
X
t=1
S(Xt−1,Xt,Yt) =
1− 1 T
1 T−1
T−1
X
t=1
S(Xt−1,Xt,Yt)+1
TS(XT−1,XT,YT). Then, use thatX0:T|Y0:T is a Markov chain
ρT(xT) =
1− 1 T
BT|T−1[xT,ρT−1] + 1
TBT|T−1[xT,S(·,xT,YT)], HereBT|T−1 is thebackward Markov transition kernel
BT|T−1(xT,dxT−1)def= φT−1(dxT−1)m(xT−1,xT) RφT−1(dxT−1)m(xT−1,xT) . whereφT−1 is the filtering distribution at timeT−1.
The computations can be carried out forward in time !
Online computation for additive functional
This sequential computation can be done only when it is possible to obtain an explicit expression for the filter:
1 Linear Gaussianmodels.
2 HMM withfinitestate-spaces.
In the online framework,sequential Monte Carlomethods (aka. particle filter) are appealing:
1 these methods are easy to implement and to tweak (as long as the dimension of the hidden space is not too large).
2 these methods are amenable to parallel computations.
Particle approximation of the additive functional
φtis approximated byweighted samples{(ξit, ωti)}Ni=1: φNt[h] =
N
X
i=1
ωith(ξit).
The Backward kernel can be approximated at the current particle locations Bt|t−1N (ξti,dxT−1)def= φNt−1(dxt−1)m(xt−1, ξit)
RφNt−1(dxt−1)m(xt−1, ξti) .
The functionsρt can then be computed at all particle locations (the computational cost grows likeN2; algorithm with linear complexity may be derived, but do not proceed entirely forward in time)
ρNt(ξit) =BT|T−1N
ξti,
1−1 t
ρNt−1(·) +1
tS(·,ξti,Yt)
.
Particle filtering in action
1 Computation ofφNt withYt andφNt−1.
2 Computation of{ρNt(ξti)}Ni=1 withYtandφNt.
For each particleξit, weights{ωet−1i,j =ωt−1j m(ξjt−1, ξti)}Nj=1are computed to match the target kernel.
Bt|t−1N (ξti,dxT−1) = PN
j=1ωjt−1m(ξjt−1, ξit)δ
ξjt−1(dxt−1) PN
j=1ωjt−1m(ξt−1j , ξit)
=
N
X
j=1
eωt−1i,j PN
k=1eωt−1i,k δ
ξt−1j (dxt−1).
−1 0 1 2 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
−1 −0.5 0 0.5 1 0
0.5 1 1.5 2 2.5
Backward kernel from time t =8 to time t =7
−1 0 1 2
0 0.2 0.4 0.6 0.8 1
Filtering distributions at time t =8
−2
−1 0 1 2
Genealogical history
Consider the followingstochastic volatility model(SVM):
(Xt+1=φXe t+σUt, Yt=βeXt2Vt, whereX0∼ N
0, σ2
1−eφ2
,Ut andVt are i.i.d. N(0,1).
Data sampled usingφ=0.8,σ2=0.2andβ2=1.
Runs started withφ=0.1,σ2=0.6andβ2=2.
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 104 0.55
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Number of observations
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104 0.55
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Number of observations
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104 0.2
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Number of observations
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104 0.2
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Number of observations
Figure:Estimation ofφ,σ2andβ2 without (left) and with (right) averaging. Each graph represents the empirical median (bold line) and first and last quartiles (dotted line) over50 independent Monte Carlo runs. The averaging procedure is started after1500observations.
50 100 150 200 0
0.2 0.4 0.6 0.8
Number of blocks
50 100 150 200
0 0.02 0.04 0.06 0.08 0.1 0.12
Number of blocks
Figure:Empirical variance of the estimation ofβ2with P-BOEM (top) and its averaged version (bottom) whenNn=√
τn (dotted line) and whenNn=τn (bold line).
{fig:varSVM}
Results on online EM procedures.
Convergence of thelimiting EMto the stationary points of the limiting log-likelihood.
Control of thefluctuation of the Monte Carlo approximationon each block.
Averaging procedure leads to anoptimal rate of convergenceinLp.