Data analysis and stochastic modeling
Lecture 5 – Mixture models and the EM algorithm
Guillaume Gravier
guillaume.gravier@irisa.fr
Data analysis and stochastic modeling – Mixture models. – p.
What are we gonna talk about?
1. A gentle introduction to probability 2. Data analysis
3. Cluster analysis
4. Estimation theory and practice 5. Mixture models
◦ notion of mixture models
◦ Gaussian mixture models
◦ parameter estimation with the EM algorithm
◦ EM and cluster analysis 6. Random processes
7. Hypothesis testing 8. Queing systems
Data analysis and stochastic modeling – Mixture models. – p.
MIXTURE MODEL DEFINITION
Why mixture models?
true distribution Gaussian model Mixture model
Why mixture models?
[courtesy of J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p.
Mixture model – definition
A mixture model is a weighted sum of laws, the likelihood of a sample
x
being given by
f (x) = X
Ki=1
π
if
i(x)
with the constraint that
X
K i=1π
i= 1 .
◦ fi()can be any density and thefi()’s need not be from the same family
◦ f()is a density sinceR∞
−∞f(x)dx= 1
◦ the model can extend to discrete variables
Data analysis and stochastic modeling – Mixture models. – p.
Examples of mixture model densities
0 0.1 0.2 0.3 0.4 0.5 0.6
-4 -3 -2 -1 0 1 2 3 4 5
w1=0.7, w2=0.3
0 0.1 0.2 0.3 0.4 0.5 0.6
-4 -3 -2 -1 0 1 2 3 4 5
w1=0.3, w2=0.7
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
-5 0 5 10 15 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
-4 -3 -2 -1 0 1 2 3 4 5
w1=0.9, w2=0.1
Multivariate Gaussian mixture model
◦ Each componentof the mixture is amultivariate Gaussian density
f
i(x) = 1
(2π)
n/2|Σ
i|
1/2exp
− 1
2 (x − µ
i)
′Σ
−i 1(x − µ
i)
◦ Parametersof the model
⊲ weights of each component (K)
⊲ mean vectors (Kn)
⊲ covariance matrices (Kn(n+ 1)/2)
◦ In practice, we oftenassume diagonal covariance matrices
⊲ much less parameters (Kn << Kn(n+ 1)/2)
⊲ much less computation(saves matrix inversion)
⊲ can be compensated with more components
Gaussian mixture model
[courtesy of A. A. D’Souza and J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p.
Gaussian mixture model
Data analysis and stochastic modeling – Mixture models. – p. 10
Mixture models from the generative viewpoint
A sample is drawn according to the law of the component
f
i()
of the mixture with probabiliyπ
i.Practically, sampling is a two step process
1. choose a component
i
of the mixture according to the discrete law defined by the weightsw
j2. draw a sample according to the law
f
i()
Example:
w = [0 . 3 , 0 . 7] , f
1= N (0 , 1) , f
2= N (2 , 1)
Mixture models from the generative viewpoint
Interpretation:[from a generative point of view,]the samples in a mixture model are drawn from either one of the component of the models with the proportion defined by the weights, i.e.
◦ for each component
i
, there is a set of samples distributed according tof
i( x )
◦ the proportion of such samples is given by
π
i⇒ hidden variable indicating the component to which a sample
belong!
Mixture models and hidden variables
The law of
x
is the marginal over the hidden variableZ
, i.e.f (x) = X
Ki=1
π
i|{z}
P [Z = i]
f
i(x)
| {z } p(x|Z = i)
Sampling
⇓ (Z, X)
Likelihood
⇓
(X) =X
Z
(X, Z)
Data analysis and stochastic modeling – Mixture models. – p. 13
Mixture models and hidden variables
◦ Conditional densityof
x
givenz
p(x|z) = f
z(x) = X
Ki=1
f
i(x)I
(z=i)◦ Joint densityof
( x, z )
p ( x, z ) = π
zf
z( x ) = X
Ki=1
f
i( x )I
(z=i)!
KX
i=1
π
iI
(z=i)!
◦ Marginal densityof
x p(x) = X
z
p(x, z) = X
z
X
i
π
if
i(x)I
(z=i)= X
i
π
if
i(x)
Data analysis and stochastic modeling – Mixture models. – p. 14
PARAMETER ESTIMATION
Maximum likelihood parameter estimation
◦ Let
x = {x
1, . . . , x
N}
be a set of training samples from which we want to estimate the parameters of a Gaussian mixture model withK
components, i.e.
⊲ weights
{w
1, . . . , w
K}
,⊲ mean vectors
{µ
1, . . . , µ
2}
,⊲ variance vectors
{σ
1, . . . , σ
2}
.◦ Maximum likelihood criterion
ln f ( x ) = X
Ni=1
ln X
Kj=1
w
jf
j(x
i; θ
j)
!
⇒ direct maximization (nearly) impossible!
Direct maximum likelihood parameter estimation
◦ Directly solving the ML equations
⊲ they often do not exist!
⊲ complex equations when they do exist
◦ Gradient descent algorithms
⊲ non convex likelihood function⇒non unicity of the solution
⊲ need for prior knowledge on the domain ofθ
◦ The Expectation-Maximization algorithm
⊲ nice and elegant solution!
Data analysis and stochastic modeling – Mixture models. – p. 17
Maximum likelihood with complete data
◦ The set of training samples
x
is incomplete!◦ Assume for each sample
x
i, we know the component indicator functionz
i◦ The set
{x
1, z
1, . . . , x
N, z
N}
is known as complete data◦ Maximum likelihood estimates can be obtained from the complete data, e.g.
w b
i= 1 N
X
N j=1I
(zj=i) andµ b
i= X
Nj=1
x
jI
(zj=i)X
N j=1I
(zj=i)⇒ but the variables z
jare not known!
Data analysis and stochastic modeling – Mixture models. – p. 18
The Expectation-Maximization principle
The Expectation-Maximization (EM) principle compensates for missing (aka latent) data, replacing them by their expectations.
EM Iterative principle
1. estimate the missing variablesgiven a current estimate of the parameters
2. estimate new parametersgiven the current estimate of the missing variables
3. repeatsteps 1 and 2 until convergence
Note: this principle applies to many problems, not only for maximum likelihood parameter estimation!
The auxiliary function
The EM algorithm aims atmaximizing an auxiliary functiondefined as
Q ( θ, θ b ) = E [ln f (z , x; θ )| x; θ b ]
where
f ( z, x ; θ )
is the likelihood function of the complete data.Estimation step
E compute the expected quantities in
Q ( θ, θ b )
(givenθ b = θ
n) Maximization stepM maximize the auxiliary function w.r.t. the (true) parameters
θ
(given the expected quantities)to obtain a new estimateθ b = θ
i+1, i.e.θ
i+1= arg max
θ
Q ( θ, θ
i)
A simple example
◦
x = {x
1, . . . , x
N}
from two classes with prior probabilitiesπ
and1 − π
◦ class indicator function
z = {z
1, . . . , z
N}
,z
i∈ {−1, 1}
◦ joint distribution of
(x , z)
lnp(x,z;θ) = XN i=1
lnp(xi|zi;θ) + lnP[Z=zi;θ]
= XN i=1
X
j∈{−1,1}
(lnp(xi|j;θ) + lnP[Z=j;θ])I(zi=j)
◦ auxiliary function
Q(θ, θn) = XN i=1
X
j∈{−1,1}
(lnp(xi|j;θ) + lnP[Z=j;θ])E[I(zi=j)|x;θn]
Data analysis and stochastic modeling – Mixture models. – p. 21
A simple example: maximization equations
◦ auxiliary function (cont’d)
Q(θ, θn) = X
j∈{−1,1}
XN i=1
lnp(xi;θ(j))E[I(zi=j)|x;θn]
+ ln(π) XN i=1
E[I(zi=−1)|x;θn] + ln(1−π) XN i=1
E[I(zi=1)|x;θn]
◦ Maximization w.r.t.
π
πn+1= X
i
E[I(zi=−1)|x;θn] X
j∈{−1,1}
X
i
E[I(zi=j)|x;θn]
◦ Maximization w.r.t.
θ(j)
⊲ only depends on the expecation
E[I
(zi=j)| x ; θ
n]
Data analysis and stochastic modeling – Mixture models. – p. 22
Simple example: expectation computation
The maximization (M-step) of
Q(θ, θ
n)
requires the computation of the expectations (E-step)E[I(zi=j)|x;θn] =P[Zi =j|xi, θn] =γj(n)(i) ,
for
i ∈ [1, N ]
andj ∈ {−1, 1}
, which is given byP[Zi =j|xi;θn] = p(xi|Zi =j;θn)P[Zi=j;θn] X
k∈{−1,1}
p(xi|Zi =k;θn)P[Zi =k;θn] .
In our two class example, forj =−1we have
p(xi|Zi =j;θn)P[Zi=j;θn] =πnp(xi;θn(−1))
whereπnandθn(−1)are the current estimates of the parameters.
EM and hidden class
In the case thelatent variable indicates membership to a class, we have the following interpretation:
◦
P [Z
i= j|x
i; θ
n]
= degree (∈ [0, 1]
) of membership to classj
of thei
’th observation◦ maximization relies on standard estimators based on the degree of membership (weighted standard estimators)[see Gaussian mixture example]
EM from an algorithmic viewpoint
1. choose some initial (good) parameters
θ
02.
n ← 0
3. while not happy (with convergence) (a) for
i = 1 → N
andj = 1 → K
compute the component posterior
γ
j(n)(i) = P [Z
i= j|x
i; θ
n]
(b) foreach paramter
α
inθ
compute new parameter value
α
n+1from the quantitiesγ
j(n)(i)
(by maximizing
Q ( θ, θ
n)
)(c)
n ← n + 1
Data analysis and stochastic modeling – Mixture models. – p. 25
Properties of the EM algorithm
Property 1
The serie of estimators
{θ
(n)}
is such that the likelihood of the data increases with each iteration of the algorithm.It can be shown that
Q(θ, θ
i+1) − Q(θ, θ
i) =
ln p( x ; θ
i+1) − ln p( x ; θ
i) + E
ln p (z | x; θ
i+1) p( z | x ; θ
i) | x ; θ
i| {z }
< 0
which implies that
Q ( θ, θ
i+1) ≥ Q ( θ, θ
i) = ⇒ p (x; θ
i+1) ≥ p (x; θ
i)
Data analysis and stochastic modeling – Mixture models. – p. 26
Properties of the EM algorithm
Property 2
The EM algorithm enables the computation of the gradient of the log-likelihood function at the points
θ
i.It can be verified that under some non very restrictive assumptions
∂Q ( θ, θ
i)
∂θ
θ=θi
= ∂ ln f (x; θ )
∂θ
θ=θi
+ ∂E [ln p (z | x; θ )| x; θ
i]
∂θ
θ=θi
| {z }
= 0
Convergence of the EM algorithm
Property 1
The serie of estimators
{θ
(n)}
is such that the likelihood of the data increases with each iteration of the algorithm.Property 2
The EM algorithm enables the computation of the gradient of the log-likelihood function at the points
θ
i.⇓
The EM estimate converges toward stationary points of the
log-likelihood function ln p( x ; θ) .
Convergence of the EM algorithm in practice
◦ Theconvergenceis only guaranteedtoward a local maximumof the likelihood function
ln p (x; θ )
.⊲ need for a good initial guess
θ
0⊲ need to avoid degenerate solutions!
◦ In practice,convergence is controled by two factors
⊲ increase of the log-likelihood of the data
⊲ fixed number of iterations
◦ Constraints on the parameter spaceare often used to avoid bad or degenerated solutions, e.g.
⊲ minimum variance floor
⊲ initialization based on (segmental) k-means algorithm
Data analysis and stochastic modeling – Mixture models. – p. 29
EM for Gaussian mixtures
◦ Joint likelihood of
( x , z )
lnf(x,z) = XN i=1
XK j=1
ln(wjf(xi;µj, σj))I(zi=j)
◦ Auxiliary function
Q(θ, θn)∝ XK j=1
XN i=1
ln(wj)E[I(zi=j)|x;θn]
− 1 2
XK j=1
XN i=1
Xd k=1
ln(σjk2 ) +(xik−µjk)2 σ2jk
!
E[I(zi=j)|x;θn]
Data analysis and stochastic modeling – Mixture models. – p. 30
EM for Gaussian mixtures (cont’d)
◦ Compute the expectations at iteration
n
γj(n)(i) = wj(n) p(xi;µ(n)j , σj(n)) X
k
w(n)k p(xi;µ(n)k , σ(n)k )
where the parameters correspond to the current estimate
θ
(n).◦ Maximization
wj(n+1)= XN i=1
γ(n)j (i)
XK k=1
XN i=1
γk(n)(i)
µ(n+1)jk = XN i=1
γ(n)j (i) xik
XN i=1
γj(n)(i)
The EM at work: initialization
[courtesy of A. W. Moore and J-F. Bonastre]
The EM at work: iteration 1
[courtesy of A. W. Moore and J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p. 33
The EM at work: iteration 2
[courtesy of A. W. Moore and J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p. 34
The EM at work: iteration 3
[courtesy of A. W. Moore and J-F. Bonastre]
The EM at work: iteration 4
[courtesy of A. W. Moore and J-F. Bonastre]
The EM at work: iteration 20
[courtesy of A. W. Moore and J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p. 37
The EM at work: another example
[courtesy J-F. Bonastre]
Data analysis and stochastic modeling – Mixture models. – p. 38
EM and k-means clustering
◦ Maximum likelihood estimates with known class membership
wbi = 1 N
XN j=1
I(zj=i) µbi= XN j=1
xj I(zj=i)
XN j=1
I(zj=i)
◦ EM estimates with unknown class membership
wbi = 1 N
XN j=1
E[I(zj=i)|x, θn] µbi= XN j=1
xj E[I(zj=i)|x, θn]
XN j=1
E[I(zj=i)|x, θn]
EM and k-means clustering
EM algorithm K-means
A practical use of the Gaussian law
4 5 6 7 8 9 10 11 12
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
low energy Gaussian
high energy Gaussian threshold = m1 − a * s1
a * s1
Data analysis and stochastic modeling – Mixture models. – p. 41
EM, sufficient statistics and the exponential famil
◦ Joint density is from the exponential family
f(x,z;θ) = exp (α(θ)′a(x,z) +b(x,z)−β(θ))
◦ E-step
⇒
estimate the sufficient statistica( x , z )
by computing its expectation under the posterior law given a current estimation of the parameters◦ Examples:
X
j
I(zj=i) −→ X
j
E[I(zj=i)|x;θn] X
j
xj I(zj=i) −→ X
j
xj E[I(zj=i)|x;θn] X
j
(xj−µbj)2I(zj=i) −→ X
j
(xj−µbj)2E[I(zj=i)|x;θn]
Data analysis and stochastic modeling – Mixture models. – p. 42
Variants
The EM principle enables many variants when the E-step and/or the M-step are intractable
◦ Monte-Carlo EM: replace the exact computation of the expected quantities by some Monte-Carlo approximations obtained using the current parameters
◦ Generalized EM: simply increase the auxiliary function rather than maximizing it, e.g. using a gradient algorithm
◦ Variational EM: replace the auxiliary function
Q
by a more simple variational approximation based on factorial distributionQ ≃ Q
i
Q
i.◦ . . .
Choosing the number of components
◦ Experimentations...
◦ Information criterion
I ( x , θ) = ln p( x ; θ) − g(#
parameters, #
data)
⊲ Akaike
⊲ Bayesian Information criterion (BIC)
⊲ . . .
Bibliography
◦ A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society. Series B, Vol. 39, No. 1, pp. 1–38, 1977.
◦ T. K. Moon. The Expectation-Maximization Algorithm. IEEE Signal Processing Magazine, pp. 46–60, November, 1996.
◦ G. J. McLachlan, T. Krishnan. The EM algorithm and extensions. Wiley Series in Probability and Statistics, 1997.
Data analysis and stochastic modeling – Mixture models. – p. 45