• Aucun résultat trouvé

Lecture 5 – Mixture models and the EM algorithm

N/A
N/A
Protected

Academic year: 2022

Partager "Lecture 5 – Mixture models and the EM algorithm"

Copied!
12
0
0

Texte intégral

(1)

Data analysis and stochastic modeling

Lecture 5 – Mixture models and the EM algorithm

Guillaume Gravier

guillaume.gravier@irisa.fr

Data analysis and stochastic modeling – Mixture models. – p.

What are we gonna talk about?

1. A gentle introduction to probability 2. Data analysis

3. Cluster analysis

4. Estimation theory and practice 5. Mixture models

◦ notion of mixture models

◦ Gaussian mixture models

◦ parameter estimation with the EM algorithm

◦ EM and cluster analysis 6. Random processes

7. Hypothesis testing 8. Queing systems

Data analysis and stochastic modeling – Mixture models. – p.

MIXTURE MODEL DEFINITION

Why mixture models?

true distribution Gaussian model Mixture model

(2)

Why mixture models?

[courtesy of J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p.

Mixture model – definition

A mixture model is a weighted sum of laws, the likelihood of a sample

x

being given by

f (x) = X

K

i=1

π

i

f

i

(x)

with the constraint that

X

K i=1

π

i

= 1 .

◦ fi()can be any density and thefi()’s need not be from the same family

◦ f()is a density sinceR

−∞f(x)dx= 1

the model can extend to discrete variables

Data analysis and stochastic modeling – Mixture models. – p.

Examples of mixture model densities

0 0.1 0.2 0.3 0.4 0.5 0.6

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.7, w2=0.3

0 0.1 0.2 0.3 0.4 0.5 0.6

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.3, w2=0.7

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

-5 0 5 10 15 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

-4 -3 -2 -1 0 1 2 3 4 5

w1=0.9, w2=0.1

Multivariate Gaussian mixture model

◦ Each componentof the mixture is amultivariate Gaussian density

f

i

(x) = 1

(2π)

n/2

i

|

1/2

exp

− 1

2 (x − µ

i

)

Σ

i 1

(x − µ

i

)

◦ Parametersof the model

weights of each component (K)

mean vectors (Kn)

covariance matrices (Kn(n+ 1)/2)

◦ In practice, we oftenassume diagonal covariance matrices

much less parameters (Kn << Kn(n+ 1)/2)

much less computation(saves matrix inversion)

can be compensated with more components

(3)

Gaussian mixture model

[courtesy of A. A. D’Souza and J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p.

Gaussian mixture model

Data analysis and stochastic modeling – Mixture models. – p. 10

Mixture models from the generative viewpoint

A sample is drawn according to the law of the component

f

i

()

of the mixture with probabiliy

π

i.

Practically, sampling is a two step process

1. choose a component

i

of the mixture according to the discrete law defined by the weights

w

j

2. draw a sample according to the law

f

i

()

Example:

w = [0 . 3 , 0 . 7] , f

1

= N (0 , 1) , f

2

= N (2 , 1)

Mixture models from the generative viewpoint

Interpretation:[from a generative point of view,]the samples in a mixture model are drawn from either one of the component of the models with the proportion defined by the weights, i.e.

◦ for each component

i

, there is a set of samples distributed according to

f

i

( x )

◦ the proportion of such samples is given by

π

i

⇒ hidden variable indicating the component to which a sample

belong!

(4)

Mixture models and hidden variables

The law of

x

is the marginal over the hidden variable

Z

, i.e.

f (x) = X

K

i=1

π

i

|{z}

P [Z = i]

f

i

(x)

| {z } p(x|Z = i)

Sampling

⇓ (Z, X)

Likelihood

(X) =X

Z

(X, Z)

Data analysis and stochastic modeling – Mixture models. – p. 13

Mixture models and hidden variables

◦ Conditional densityof

x

given

z

p(x|z) = f

z

(x) = X

K

i=1

f

i

(x)I

(z=i)

◦ Joint densityof

( x, z )

p ( x, z ) = π

z

f

z

( x ) = X

K

i=1

f

i

( x )I

(z=i)

!

K

X

i=1

π

i

I

(z=i)

!

◦ Marginal densityof

x p(x) = X

z

p(x, z) = X

z

X

i

π

i

f

i

(x)I

(z=i)

= X

i

π

i

f

i

(x)

Data analysis and stochastic modeling – Mixture models. – p. 14

PARAMETER ESTIMATION

Maximum likelihood parameter estimation

◦ Let

x = {x

1

, . . . , x

N

}

be a set of training samples from which we want to estimate the parameters of a Gaussian mixture model with

K

components, i.e.

weights

{w

1

, . . . , w

K

}

,

mean vectors

1

, . . . , µ

2

}

,

variance vectors

1

, . . . , σ

2

}

.

◦ Maximum likelihood criterion

ln f ( x ) = X

N

i=1

ln X

K

j=1

w

j

f

j

(x

i

; θ

j

)

!

⇒ direct maximization (nearly) impossible!

(5)

Direct maximum likelihood parameter estimation

◦ Directly solving the ML equations

they often do not exist!

complex equations when they do exist

◦ Gradient descent algorithms

non convex likelihood functionnon unicity of the solution

need for prior knowledge on the domain ofθ

◦ The Expectation-Maximization algorithm

nice and elegant solution!

Data analysis and stochastic modeling – Mixture models. – p. 17

Maximum likelihood with complete data

◦ The set of training samples

x

is incomplete!

◦ Assume for each sample

x

i, we know the component indicator function

z

i

◦ The set

{x

1

, z

1

, . . . , x

N

, z

N

}

is known as complete data

◦ Maximum likelihood estimates can be obtained from the complete data, e.g.

w b

i

= 1 N

X

N j=1

I

(zj=i) and

µ b

i

= X

N

j=1

x

j

I

(zj=i)

X

N j=1

I

(zj=i)

⇒ but the variables z

j

are not known!

Data analysis and stochastic modeling – Mixture models. – p. 18

The Expectation-Maximization principle

The Expectation-Maximization (EM) principle compensates for missing (aka latent) data, replacing them by their expectations.

EM Iterative principle

1. estimate the missing variablesgiven a current estimate of the parameters

2. estimate new parametersgiven the current estimate of the missing variables

3. repeatsteps 1 and 2 until convergence

Note: this principle applies to many problems, not only for maximum likelihood parameter estimation!

The auxiliary function

The EM algorithm aims atmaximizing an auxiliary functiondefined as

Q ( θ, θ b ) = E [ln f (z , x; θ )| x; θ b ]

where

f ( z, x ; θ )

is the likelihood function of the complete data.

Estimation step

E compute the expected quantities in

Q ( θ, θ b )

(given

θ b = θ

n) Maximization step

M maximize the auxiliary function w.r.t. the (true) parameters

θ

(given the expected quantities)to obtain a new estimate

θ b = θ

i+1, i.e.

θ

i+1

= arg max

θ

Q ( θ, θ

i

)

(6)

A simple example

x = {x

1

, . . . , x

N

}

from two classes with prior probabilities

π

and

1 − π

◦ class indicator function

z = {z

1

, . . . , z

N

}

,

z

i

∈ {−1, 1}

◦ joint distribution of

(x , z)

lnp(x,z;θ) = XN i=1

lnp(xi|zi;θ) + lnP[Z=zi;θ]

= XN i=1

X

j∈{−1,1}

(lnp(xi|j;θ) + lnP[Z=j;θ])I(zi=j)

◦ auxiliary function

Q(θ, θn) = XN i=1

X

j∈{−1,1}

(lnp(xi|j;θ) + lnP[Z=j;θ])E[I(zi=j)|x;θn]

Data analysis and stochastic modeling – Mixture models. – p. 21

A simple example: maximization equations

◦ auxiliary function (cont’d)

Q(θ, θn) = X

j∈{−1,1}

XN i=1

lnp(xi;θ(j))E[I(zi=j)|x;θn]

+ ln(π) XN i=1

E[I(zi=−1)|x;θn] + ln(1−π) XN i=1

E[I(zi=1)|x;θn]

◦ Maximization w.r.t.

π

πn+1= X

i

E[I(zi=−1)|x;θn] X

j∈{−1,1}

X

i

E[I(zi=j)|x;θn]

◦ Maximization w.r.t.

θ(j)

only depends on the expecation

E[I

(zi=j)

| x ; θ

n

]

Data analysis and stochastic modeling – Mixture models. – p. 22

Simple example: expectation computation

The maximization (M-step) of

Q(θ, θ

n

)

requires the computation of the expectations (E-step)

E[I(zi=j)|x;θn] =P[Zi =j|xi, θn] =γj(n)(i) ,

for

i ∈ [1, N ]

and

j ∈ {−1, 1}

, which is given by

P[Zi =j|xin] = p(xi|Zi =j;θn)P[Zi=j;θn] X

k∈{−1,1}

p(xi|Zi =k;θn)P[Zi =k;θn] .

In our two class example, forj =−1we have

p(xi|Zi =j;θn)P[Zi=j;θn] =πnp(xin(−1))

whereπnandθn(−1)are the current estimates of the parameters.

EM and hidden class

In the case thelatent variable indicates membership to a class, we have the following interpretation:

P [Z

i

= j|x

i

; θ

n

]

= degree (

∈ [0, 1]

) of membership to class

j

of the

i

’th observation

◦ maximization relies on standard estimators based on the degree of membership (weighted standard estimators)[see Gaussian mixture example]

(7)

EM from an algorithmic viewpoint

1. choose some initial (good) parameters

θ

0

2.

n ← 0

3. while not happy (with convergence) (a) for

i = 1 → N

and

j = 1 → K

compute the component posterior

γ

j(n)

(i) = P [Z

i

= j|x

i

; θ

n

]

(b) foreach paramter

α

in

θ

compute new parameter value

α

n+1from the quantities

γ

j(n)

(i)

(by maximizing

Q ( θ, θ

n

)

)

(c)

n ← n + 1

Data analysis and stochastic modeling – Mixture models. – p. 25

Properties of the EM algorithm

Property 1

The serie of estimators

(n)

}

is such that the likelihood of the data increases with each iteration of the algorithm.

It can be shown that

Q(θ, θ

i+1

) − Q(θ, θ

i

) =

ln p( x ; θ

i+1

) − ln p( x ; θ

i

) + E

ln p (z | x; θ

i+1

) p( z | x ; θ

i

) | x ; θ

i

| {z }

< 0

which implies that

Q ( θ, θ

i+1

) ≥ Q ( θ, θ

i

) = ⇒ p (x; θ

i+1

) ≥ p (x; θ

i

)

Data analysis and stochastic modeling – Mixture models. – p. 26

Properties of the EM algorithm

Property 2

The EM algorithm enables the computation of the gradient of the log-likelihood function at the points

θ

i.

It can be verified that under some non very restrictive assumptions

∂Q ( θ, θ

i

)

∂θ

θ=θi

= ∂ ln f (x; θ )

∂θ

θ=θi

+ ∂E [ln p (z | x; θ )| x; θ

i

]

∂θ

θ=θi

| {z }

= 0

Convergence of the EM algorithm

Property 1

The serie of estimators

(n)

}

is such that the likelihood of the data increases with each iteration of the algorithm.

Property 2

The EM algorithm enables the computation of the gradient of the log-likelihood function at the points

θ

i.

The EM estimate converges toward stationary points of the

log-likelihood function ln p( x ; θ) .

(8)

Convergence of the EM algorithm in practice

◦ Theconvergenceis only guaranteedtoward a local maximumof the likelihood function

ln p (x; θ )

.

need for a good initial guess

θ

0

need to avoid degenerate solutions!

◦ In practice,convergence is controled by two factors

increase of the log-likelihood of the data

fixed number of iterations

◦ Constraints on the parameter spaceare often used to avoid bad or degenerated solutions, e.g.

minimum variance floor

initialization based on (segmental) k-means algorithm

Data analysis and stochastic modeling – Mixture models. – p. 29

EM for Gaussian mixtures

◦ Joint likelihood of

( x , z )

lnf(x,z) = XN i=1

XK j=1

ln(wjf(xij, σj))I(zi=j)

◦ Auxiliary function

Q(θ, θn)∝ XK j=1

XN i=1

ln(wj)E[I(zi=j)|x;θn]

− 1 2

XK j=1

XN i=1

Xd k=1

ln(σjk2 ) +(xik−µjk)2 σ2jk

!

E[I(zi=j)|x;θn]

Data analysis and stochastic modeling – Mixture models. – p. 30

EM for Gaussian mixtures (cont’d)

◦ Compute the expectations at iteration

n

γj(n)(i) = wj(n) p(xi(n)j , σj(n)) X

k

w(n)k p(xi(n)k , σ(n)k )

where the parameters correspond to the current estimate

θ

(n).

◦ Maximization

wj(n+1)= XN i=1

γ(n)j (i)

XK k=1

XN i=1

γk(n)(i)

µ(n+1)jk = XN i=1

γ(n)j (i) xik

XN i=1

γj(n)(i)

The EM at work: initialization

[courtesy of A. W. Moore and J-F. Bonastre]

(9)

The EM at work: iteration 1

[courtesy of A. W. Moore and J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p. 33

The EM at work: iteration 2

[courtesy of A. W. Moore and J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p. 34

The EM at work: iteration 3

[courtesy of A. W. Moore and J-F. Bonastre]

The EM at work: iteration 4

[courtesy of A. W. Moore and J-F. Bonastre]

(10)

The EM at work: iteration 20

[courtesy of A. W. Moore and J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p. 37

The EM at work: another example

[courtesy J-F. Bonastre]

Data analysis and stochastic modeling – Mixture models. – p. 38

EM and k-means clustering

◦ Maximum likelihood estimates with known class membership

wbi = 1 N

XN j=1

I(zj=i) µbi= XN j=1

xj I(zj=i)

XN j=1

I(zj=i)

◦ EM estimates with unknown class membership

wbi = 1 N

XN j=1

E[I(zj=i)|x, θn] µbi= XN j=1

xj E[I(zj=i)|x, θn]

XN j=1

E[I(zj=i)|x, θn]

EM and k-means clustering

EM algorithm K-means

(11)

A practical use of the Gaussian law

4 5 6 7 8 9 10 11 12

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

low energy Gaussian

high energy Gaussian threshold = m1 − a * s1

a * s1

Data analysis and stochastic modeling – Mixture models. – p. 41

EM, sufficient statistics and the exponential famil

◦ Joint density is from the exponential family

f(x,z;θ) = exp (α(θ)a(x,z) +b(x,z)−β(θ))

◦ E-step

estimate the sufficient statistic

a( x , z )

by computing its expectation under the posterior law given a current estimation of the parameters

◦ Examples:

X

j

I(zj=i) −→ X

j

E[I(zj=i)|x;θn] X

j

xj I(zj=i) −→ X

j

xj E[I(zj=i)|x;θn] X

j

(xj−µbj)2I(zj=i) −→ X

j

(xj−µbj)2E[I(zj=i)|x;θn]

Data analysis and stochastic modeling – Mixture models. – p. 42

Variants

The EM principle enables many variants when the E-step and/or the M-step are intractable

◦ Monte-Carlo EM: replace the exact computation of the expected quantities by some Monte-Carlo approximations obtained using the current parameters

◦ Generalized EM: simply increase the auxiliary function rather than maximizing it, e.g. using a gradient algorithm

◦ Variational EM: replace the auxiliary function

Q

by a more simple variational approximation based on factorial distribution

Q ≃ Q

i

Q

i.

◦ . . .

Choosing the number of components

◦ Experimentations...

◦ Information criterion

I ( x , θ) = ln p( x ; θ) − g(#

parameters

, #

data

)

Akaike

Bayesian Information criterion (BIC)

. . .

(12)

Bibliography

◦ A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society. Series B, Vol. 39, No. 1, pp. 1–38, 1977.

◦ T. K. Moon. The Expectation-Maximization Algorithm. IEEE Signal Processing Magazine, pp. 46–60, November, 1996.

◦ G. J. McLachlan, T. Krishnan. The EM algorithm and extensions. Wiley Series in Probability and Statistics, 1997.

Data analysis and stochastic modeling – Mixture models. – p. 45

Références

Documents relatifs

Histogram fitting and region uniformity measures on synthetic and real images reveal the effectiveness of the proposed model compared to the generalized Gaussian mixture

Finally, since the Riemann and tempered approximations are two separate methods that fulfil very different practical purposes, we also propose to associate the two approxi- mations

We also need a criterion when to stop expanding the tree, and one could use among others bounds on the variation of the data posteriors inside a node like in [2], a bound on the size

In this paper we propose an alternative solution to the above problem of estimating a general conditional density function of tar- gets given inputs, in the form of a Gaussian

I further developed relative total function (RTF, the summed relative functions of all individual taxa in a community compared to their monoculture) as an index for net positive

ML estimation in the presence of a nuisance vector Using the Expectation Max- imization (EM) algorithm Or the Bayesian Expectation Maximization (BEM) algorithm... ML estimation in

From ESI mass spectra of the ppD4 THF extract (25% w/w) it was found that plasma‐polymerization of  D4  had  produced  cyclolinear  polysiloxanes  described  as 

In section 3, we characterize strong and weak impatience in the context of discrete and continuous time flows of income (consumption) valued through a Choquet integral with respect to