Lecture 5 – Mixture models and the EM algorithm

(1)

Data analysis and stochastic modeling

Lecture 5 – Mixture models and the EM algorithm

Guillaume Gravier

guillaume.gravier@irisa.fr

Data analysis and stochastic modeling – Mixture models. – p.

What are we gonna talk about?

1. A gentle introduction to probability 2. Data analysis

3. Cluster analysis

4. Estimation theory and practice 5. Mixture models

◦ notion of mixture models

◦ Gaussian mixture models

◦ parameter estimation with the EM algorithm

◦ EM and cluster analysis 6. Random processes

7. Hypothesis testing 8. Queing systems

MIXTURE MODEL DEFINITION

Why mixture models?

true distribution Gaussian model Mixture model

(2)

Why mixture models?

[courtesy of J-F. Bonastre]

Mixture model – definition

A mixture model is a weighted sum of laws, the likelihood of a sample

x

being given by

f (x) = X

K

i=1

π

_i

f

_i

(x)

with the constraint that

X

K i=1

π

_i

= 1 .

◦ fi()can be any density and thefi()’s need not be from the same family

◦ f()is a density sinceR∞

−∞f(x)dx= 1

◦ the model can extend to discrete variables

Examples of mixture model densities

0 0.1 0.2 0.3 0.4 0.5 0.6

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.7, w2=0.3

0 0.1 0.2 0.3 0.4 0.5 0.6

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.3, w2=0.7

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

-5 0 5 10 15 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.9, w2=0.1

Multivariate Gaussian mixture model

◦ Each componentof the mixture is amultivariate Gaussian density

f

_i

(x) = 1

(2π)

^n/2

|Σ

i

|

^1/2

exp

− 1

2 (x − µ

_i

)

^′

Σ

⁻_i ¹

(x − µ

_i

)

◦ Parametersof the model

⊲ weights of each component (K)

⊲ mean vectors (Kn)

⊲ covariance matrices (Kn(n+ 1)/2)

◦ In practice, we oftenassume diagonal covariance matrices

⊲ much less parameters (Kn << Kn(n+ 1)/2⁾

⊲ much less computation(saves matrix inversion)

⊲ can be compensated with more components

(3)

Gaussian mixture model

[courtesy of A. A. D’Souza and J-F. Bonastre]

Gaussian mixture model

Data analysis and stochastic modeling – Mixture models. – p. 10

Mixture models from the generative viewpoint

A sample is drawn according to the law of the component

f

_i

()

of the mixture with probabiliy

π

_i.

Practically, sampling is a two step process

1. choose a component

i

of the mixture according to the discrete law defined by the weights

w

_j

2. draw a sample according to the law

f

_i

()

Example:

w = [0 . 3 , 0 . 7] , f

₁

= N (0 , 1) , f

₂

= N (2 , 1)

Mixture models from the generative viewpoint

Interpretation:[from a generative point of view,]the samples in a mixture model are drawn from either one of the component of the models with the proportion defined by the weights, i.e.

◦ for each component

i

, there is a set of samples distributed according to

f

_i

( x )

◦ the proportion of such samples is given by

π

_i

⇒ hidden variable indicating the component to which a sample

belong!

(4)

Mixture models and hidden variables

The law of

x

is the marginal over the hidden variable

Z

, i.e.

f (x) = X

K

i=1

π

i

|{z}

P [Z = i]

f

i

(x)

| {z } p(x|Z = i)

Sampling

⇓ (Z, X)

Likelihood

⇓

(X) =X

Z

(X, Z)

Mixture models and hidden variables

◦ Conditional densityof

x

given

z

p(x|z) = f

z

(x) = X

K

i=1

f

i

(x)I

(z=i)

◦ Joint densityof

( x, z )

p ( x, z ) = π

_z

f

_z

( x ) = X

K

i=1

f

_i

( x )I

(z=i)

!

_K

X

i=1

π

_i

I

(z=i)

!

◦ Marginal densityof

x p(x) = X

z

p(x, z) = X

z

X

i

π

i

f

i

(x)I

(z=i)

= X

i

π

i

f

i

(x)

PARAMETER ESTIMATION

Maximum likelihood parameter estimation

◦ Let

x = {x

1

, . . . , x

_N

}

be a set of training samples from which we want to estimate the parameters of a Gaussian mixture model with

K

components, i.e.

⊲ weights

{w

1

, . . . , w

_K

}

,

⊲ mean vectors

{µ

₁

, . . . , µ

₂

}

^,

⊲ variance vectors

{σ

₁

, . . . , σ

₂

}

.

◦ Maximum likelihood criterion

ln f ( x ) = X

N

i=1

ln X

K

j=1

w

_j

f

_j

(x

i

; θ

_j

)

!

⇒ direct maximization (nearly) impossible!

(5)

Direct maximum likelihood parameter estimation

◦ Directly solving the ML equations

⊲ they often do not exist!

⊲ complex equations when they do exist

◦ Gradient descent algorithms

⊲ non convex likelihood function⇒non unicity of the solution

⊲ need for prior knowledge on the domain ofθ

◦ The Expectation-Maximization algorithm

⊲ nice and elegant solution!

Maximum likelihood with complete data

◦ The set of training samples

x

is incomplete!

◦ Assume for each sample

x

i, we know the component indicator function

z

i

◦ The set

{x

1

, z

₁

, . . . , x

_N

, z

_N

}

is known as complete data

◦ Maximum likelihood estimates can be obtained from the complete data, e.g.

w b

_i

= 1 N

X

N j=1

I

(zj=i) and

µ b

_i

= X

N

j=1

x

_j

I

(zj=i)

X

N j=1

I

(zj=i)

⇒ but the variables z

_j

are not known!

The Expectation-Maximization principle

The Expectation-Maximization (EM) principle compensates for missing (aka latent) data, replacing them by their expectations.

EM Iterative principle

1. estimate the missing variablesgiven a current estimate of the parameters

2. estimate new parametersgiven the current estimate of the missing variables

3. repeatsteps 1 and 2 until convergence

Note: this principle applies to many problems, not only for maximum likelihood parameter estimation!

The auxiliary function

The EM algorithm aims atmaximizing an auxiliary functiondefined as

Q ( θ, θ b ) = E [ln f (z , x; θ )| x; θ b ]

where

f ( z, x ; θ )

is the likelihood function of the complete data.

Estimation step

E compute the expected quantities in

Q ( θ, θ b )

^(given

θ b = θ

_n) Maximization step

M maximize the auxiliary function w.r.t. the (true) parameters

θ

(given the expected quantities)to obtain a new estimate

θ b = θ

_i+1, i.e.

θ

_i+1

= arg max

θ

Q ( θ, θ

_i

)

(6)

A simple example

◦

x = {x

1

, . . . , x

_N

}

from two classes with prior probabilities

π

and

1 − π

◦ class indicator function

z = {z

1

, . . . , z

N

}

,

z

i

∈ {−1, 1}

◦ joint distribution of

(x , z)

lnp(x,z;θ) = XN i=1

lnp(xi|zi;θ) + lnP[Z=zi;θ]

= XN i=1

X

j∈{−1,1}

(lnp(x_i|j;θ) + lnP[Z=j;θ])I(zi=j)

◦ auxiliary function

Q(θ, θn) = XN i=1

X

j∈{−1,1}

(lnp(xi|j;θ) + lnP[Z=j;θ])E[I(zi=j)|x;θn]

A simple example: maximization equations

◦ auxiliary function (cont’d)

Q(θ, θn) = X

j∈{−1,1}

XN i=1

lnp(xi;θ(j))E[I(zi=j)|x;θn]

+ ln(π) XN i=1

E[I(zi=−1)|x;θn] + ln(1−π) XN i=1

E[I(zi=1)|x;θn]

◦ Maximization w.r.t.

π

π_n+1= X

i

E[I(zi=−1)|x;θ_n] X

j∈{−1,1}

X

i

E[I(zi=j)|x;θn]

◦ Maximization w.r.t.

θ(j)

⊲ only depends on the expecation

E[I

(zi=j)

| x ; θ

n

]

Simple example: expectation computation

The maximization (M-step) of

Q(θ, θ

n

)

requires the computation of the expectations (E-step)

E[I(zi=j)|x;θ_n] =P[Z_i =j|xi, θ_n] =γ_j⁽ⁿ⁾(i) ,

for

i ∈ [1, N ]

^and

j ∈ {−1, 1}

, which is given by

P[Zi =j|xi;θn] = p(x_i|Zi =j;θ_n)P[Z_i=j;θ_n] X

k∈{−1,1}

p(x_i|Zi =k;θ_n)P[Z_i =k;θ_n] .

In our two class example, forj =−1^{we have}

p(xi|Zi =j;θn)P[Zi=j;θn] =πnp(xi;θn(−1))

whereπ_nandθ_n(−1)are the current estimates of the parameters.

EM and hidden class

In the case thelatent variable indicates membership to a class, we have the following interpretation:

◦

P [Z

i

= j|x

i

; θ

_n

]

^{= degree (}

∈ [0, 1]

) of membership to class

j

of the

i

’th observation

◦ maximization relies on standard estimators based on the degree of membership (weighted standard estimators)[see Gaussian mixture example]

(7)

EM from an algorithmic viewpoint

1. choose some initial (good) parameters

θ

₀

2.

n ← 0

3. while not happy (with convergence) (a) for

i = 1 → N

and

j = 1 → K

compute the component posterior

γ

_j⁽ⁿ⁾

(i) = P [Z

i

= j|x

i

; θ

_n

]

(b) foreach paramter

α

in

θ

compute new parameter value

α

n+1from the quantities

γ

_j⁽ⁿ⁾

(i)

(by maximizing

Q ( θ, θ

_n

)

⁾

(c)

n ← n + 1

Properties of the EM algorithm

Property 1

The serie of estimators

{θ

⁽ⁿ⁾

}

is such that the likelihood of the data increases with each iteration of the algorithm.

It can be shown that

Q(θ, θ

i+1

) − Q(θ, θ

i

) =

ln p( x ; θ

_i+1

) − ln p( x ; θ

_i

) + E

ln p (z | x; θ

_i+1

) p( z | x ; θ

i

) | x ; θ

_i

| {z }

< 0

which implies that

Q ( θ, θ

_i+1

) ≥ Q ( θ, θ

_i

) = ⇒ p (x; θ

_i+1

) ≥ p (x; θ

_i

)

Properties of the EM algorithm

Property 2

The EM algorithm enables the computation of the gradient of the log-likelihood function at the points

θ

_i.

It can be verified that under some non very restrictive assumptions

∂Q ( θ, θ

_i

)

∂θ

θ=θi

= ∂ ln f (x; θ )

∂θ

θ=θi

+ ∂E [ln p (z | x; θ )| x; θ

_i

]

∂θ

θ=θi

| {z }

= 0

Convergence of the EM algorithm

Property 1

The serie of estimators

{θ

⁽ⁿ⁾

}

is such that the likelihood of the data increases with each iteration of the algorithm.

Property 2

The EM algorithm enables the computation of the gradient of the log-likelihood function at the points

θ

_i.

⇓

The EM estimate converges toward stationary points of the

log-likelihood function ln p( x ; θ) .

(8)

Convergence of the EM algorithm in practice

◦ Theconvergenceis only guaranteedtoward a local maximumof the likelihood function

ln p (x; θ )

^.

⊲ need for a good initial guess

θ

0

⊲ need to avoid degenerate solutions!

◦ In practice,convergence is controled by two factors

⊲ increase of the log-likelihood of the data

⊲ fixed number of iterations

◦ Constraints on the parameter spaceare often used to avoid bad or degenerated solutions, e.g.

⊲ minimum variance floor

⊲ initialization based on (segmental) k-means algorithm

EM for Gaussian mixtures

◦ Joint likelihood of

( x , z )

lnf(x,z) = XN i=1

XK j=1

ln(wjf(xi;µj, σj))I(zi=j)

◦ Auxiliary function

Q(θ, θn)∝ XK j=1

XN i=1

ln(wj)E[I(zi=j)|x;θn]

− 1 2

XK j=1

XN i=1

Xd k=1

ln(σ_jk² ) +(xik−µjk)² σ²_jk

!

E[I(zi=j)|x;θn]

EM for Gaussian mixtures (cont’d)

◦ Compute the expectations at iteration

n

γ_j⁽ⁿ⁾(i) = w_j⁽ⁿ⁾ p(x_i;µ⁽ⁿ⁾_j , σ_j⁽ⁿ⁾) X

k

w⁽ⁿ⁾_k p(x_i;µ⁽ⁿ⁾_k , σ⁽ⁿ⁾_k )

where the parameters correspond to the current estimate

θ

⁽ⁿ⁾.

◦ Maximization

w_j⁽ⁿ⁺¹⁾= XN i=1

γ⁽ⁿ⁾_j (i)

XK k=1

XN i=1

γ_k⁽ⁿ⁾(i)

µ⁽ⁿ⁺¹⁾_jk = XN i=1

γ⁽ⁿ⁾_j (i) x_ik

XN i=1

γ_j⁽ⁿ⁾(i)

The EM at work: initialization

[courtesy of A. W. Moore and J-F. Bonastre]

(9)

The EM at work: iteration 1

The EM at work: iteration 2

The EM at work: iteration 3

The EM at work: iteration 4

(10)

The EM at work: iteration 20

The EM at work: another example

[courtesy J-F. Bonastre]

EM and k-means clustering

◦ Maximum likelihood estimates with known class membership

wb_i = 1 N

XN j=1

I(zj=i) µb_i= XN j=1

x_j I(zj=i)

XN j=1

I(zj=i)

◦ EM estimates with unknown class membership

wb_i = 1 N

XN j=1

E[I(zj=i)|x, θ_n] µb_i= XN j=1

xj E[I(zj=i)|x, θn]

XN j=1

E[I(zj=i)|x, θn]

EM and k-means clustering

EM algorithm K-means

(11)

A practical use of the Gaussian law

4 5 6 7 8 9 10 11 12

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

low energy Gaussian

high energy Gaussian threshold = m1 − a * s1

a * s1

EM, sufficient statistics and the exponential famil

◦ Joint density is from the exponential family

f(x,z;θ) = exp (α(θ)^′a(x,z) +b(x,z)−β(θ))

◦ E-step

⇒

estimate the sufficient statistic

a( x , z )

by computing its expectation under the posterior law given a current estimation of the parameters

◦ Examples:

X

j

I(zj=i) −→ X

j

E[I(zj=i)|x;θ_n] X

j

xj I(zj=i) −→ X

j

xj E[I(zj=i)|x;θn] X

j

(xj−µbj)²I(zj=i) −→ X

j

(xj−µbj)²E[I(zj=i)|x;θn]

Variants

The EM principle enables many variants when the E-step and/or the M-step are intractable

◦ Monte-Carlo EM: replace the exact computation of the expected quantities by some Monte-Carlo approximations obtained using the current parameters

◦ Generalized EM: simply increase the auxiliary function rather than maximizing it, e.g. using a gradient algorithm

◦ Variational EM: replace the auxiliary function

Q

by a more simple variational approximation based on factorial distribution

Q ≃ Q

i

Q

_i.

◦ . . .

Choosing the number of components

◦ Experimentations...

◦ Information criterion

I ( x , θ) = ln p( x ; θ) − g(#

^parameters

, #

^data

)

⊲ Akaike

⊲ Bayesian Information criterion (BIC)

⊲ . . .

(12)

Bibliography

◦ A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society. Series B, Vol. 39, No. 1, pp. 1–38, 1977.

◦ T. K. Moon. The Expectation-Maximization Algorithm. IEEE Signal Processing Magazine, pp. 46–60, November, 1996.

◦ G. J. McLachlan, T. Krishnan. The EM algorithm and extensions. Wiley Series in Probability and Statistics, 1997.