An R-package for Finite Mixture Models

(1)

An R-package for finite mixture models

Jang SCHILTZ (University of Luxembourg) joint work with

Jean-Daniel GUIGOU (University of Luxembourg),

& Bruno LOVAT (University of Lorraine)

December 7, 2013

Jang SCHILTZ (University of Luxembourg) joint work with Jean-Daniel GUIGOU (University of Luxembourg), & Bruno LOVAT (University of Lorraine) ()An R-package for finite mixture models December 7, 2013 1 / 38

(2)

An R-package for finite mixture models

Jang SCHILTZ (University of Luxembourg) joint work with

Jean-Daniel GUIGOU (University of Luxembourg),

& Bruno LOVAT (University of Lorraine)

December 7, 2013

(3)

Outline

1 The Basic Finite Mixture Model of Nagin

2 Generalizations of the basic model

3 Muth´en’s model

4 Research Agenda

(4)

Outline

4 Research Agenda

(5)

Outline

4 Research Agenda

(6)

Outline

4 Research Agenda

(7)

Outline

4 Research Agenda

(8)

General description of Nagin’s model

We have a collection of individual trajectories.

We try to divide the population into a number of homogenous

sub-populations and to estimate a mean trajectory for each sub-population. Hence, this model can be interpreted as functional fuzzy cluster analysis.

(9)

General description of Nagin’s model

sub-populations and to estimate a mean trajectory for each sub-population.

Hence, this model can be interpreted as functional fuzzy cluster analysis.

(10)

General description of Nagin’s model

sub-populations and to estimate a mean trajectory for each sub-population.

Hence, this model can be interpreted as functional fuzzy cluster analysis.

(11)

The Likelihood Function (1)

Consider a population of size N and a variable of interest Y .

Let Yi = yi1, yi2, ..., yiT be T measures of the variable, taken at times t1, ...t_T for subject number i .

P(Y_i) denotes the probability of Y_i count data ⇒ Poisson distribution

binary data ⇒ Binary logit distribution

censored data ⇒ Censored normal distribution

Aim of the analysis: Find r groups of trajectories of a given kind (for instance polynomials of degree 4, P(t) = β0+ β1t + β2t²+ β3t³+ β4t⁴.

(12)

The Likelihood Function (1)

P(Y_i) denotes the probability of Y_i count data ⇒ Poisson distribution binary data ⇒ Binary logit distribution

(13)

The Likelihood Function (1)

P(Y_i) denotes the probability of Y_i

count data ⇒ Poisson distribution binary data ⇒ Binary logit distribution

(14)

The Likelihood Function (1)

P(Y_i) denotes the probability of Y_i count data ⇒ Poisson distribution

binary data ⇒ Binary logit distribution

(15)

The Likelihood Function (1)

(16)

The Likelihood Function (1)

(17)

The Likelihood Function (1)

(18)

The Likelihood Function (2)

πj : probability of a given subject to belong to group number j

⇒ π_j is the size of group j . We try to estimate a set of parameters Ω =

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

o which allow to maximize the probability of the measured data.

P^j(Y_i) : probability of Y_i if subject i belongs to group j

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

Finite mixture model Daniel S. Nagin (Carnegie Mellon University) finite : sums across a finite number of groups

mixture : population composed of a mixture of unobserved groups

(19)

The Likelihood Function (2)

⇒ π_j is the size of group j .

We try to estimate a set of parameters Ω = n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

(20)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

(21)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

(22)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

(23)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

Finite mixture model Daniel S. Nagin (Carnegie Mellon University)

finite : sums across a finite number of groups

(24)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

Finite mixture model Daniel S. Nagin (Carnegie Mellon University)

(25)

The Likelihood Function (2)

n

β₀^j, β₁^j, β₂^j, β₃^j, β₄^j, πj

⇒ P(Y_i) =

r

X

j =1

π_jP^j(Y_i). (1)

(26)

The Likelihood Function (3)

Hypothesis: In a given group, conditional independence is assumed for the sequential realizations of the elements of Yi!!!

⇒ P^j(Yi) =

T

Y

t=1

p^j(yit), (2)

where p^j(y_i_t) denotes the probability of y_i_t given membership in group j . Likelihood of the estimator:

L =

N

Y

i =1

P(Y_i) =

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

p^j(y_i_t). (3)

(27)

The Likelihood Function (3)

⇒ P^j(Yi) =

T

Y

t=1

p^j(yit), (2)

where p^j(y_i_t) denotes the probability of y_i_t given membership in group j .

Likelihood of the estimator:

L =

N

Y

i =1

P(Y_i) =

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

p^j(y_i_t). (3)

(28)

The Likelihood Function (3)

⇒ P^j(Yi) =

T

Y

t=1

p^j(yit), (2)

L =

N

Y

i =1

P(Y_i) =

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

p^j(y_i_t). (3)

(29)

The Likelihood Function (3)

⇒ P^j(Yi) =

T

Y

t=1

p^j(yit), (2)

L =

N

Y

i =1

P(Y_i)

=

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

p^j(y_i_t). (3)

(30)

The Likelihood Function (3)

⇒ P^j(Yi) =

T

Y

t=1

p^j(yit), (2)

L =

N

Y

i =1

P(Y_i) =

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

p^j(y_i_t). (3)

(31)

The case of a normal distribution (1)

y_i_t = β₀^j + β^j₁Age_i_t + β₂^jAge_i²_t + β₃^jAge_i³_t + β₄^jAge_i⁴_t + ε_i_t, (4) where ε_i_t ∼ N (0, σ), σ being a constant standard deviation.

Notations :

β^jt_i_t = β₀^j + β₁^jAge_i_t + β₂^jAge_i²

t + β₃^jAge_i³

t + β₄^jAge_i⁴

t. φ: density of standard centered normal law.

Hence,

p^j(yit) = 1

σφ y_i_t − β^jt_it

σ .

(5)

(32)

The case of a normal distribution (1)

Notations :

t + β₃^jAge_i³

t + β₄^jAge_i⁴

Hence,

p^j(yit) = 1

σ .

(5)

(33)

The case of a normal distribution (1)

Notations :

t + β₃^jAge_i³

t + β₄^jAge_i⁴

t.

φ: density of standard centered normal law. Hence,

p^j(yit) = 1

σ .

(5)

(34)

The case of a normal distribution (1)

Notations :

t + β₃^jAge_i³

t + β₄^jAge_i⁴

Hence,

p^j(yit) = 1

σ .

(5)

(35)

The case of a normal distribution (1)

Notations :

t + β₃^jAge_i³

t + β₄^jAge_i⁴

Hence,

p^j(yit) = 1

σ .

(5)

(36)

The case of a normal distribution (1)

Notations :

t + β₃^jAge_i³

t + β₄^jAge_i⁴

Hence,

p^j(yit) = 1

σφ y_i_t − β^jtit

σ .

(5)

(37)

The case of a normal distribution (2)

So we get

L = 1 σ

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

φ y_i_t − β^jt_i_t σ

. (6)

The estimations of πj must be in [0, 1].

It is difficult to force this constraint in model estimation.

(38)

The case of a normal distribution (2)

So we get

L = 1 σ

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

. (6)

(39)

The case of a normal distribution (2)

So we get

L = 1 σ

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

. (6)

(40)

The case of a normal distribution (2)

So we get

L = 1 σ

N

Y

i =1 r

X

j =1

π_j

T

Y

t=1

. (6)

(41)

A computational trick

Instead, we estimate the real parameters θj such that

πj = e^θ^j

r

X

j =1

e^θ^j

, (7)

Finally,

L = 1 σ

N

Y

i =1 r

X

j =1

e^θ^j

r

X

j =1

e^θ^j

T

Y

t=1

. (8)

It is too complicated to get closed-forms equations.

(42)

A computational trick

Instead, we estimate the real parameters θj such that πj = e^θ^j

r

X

j =1

e^θ^j

, (7)

Finally,

L = 1 σ

N

Y

i =1 r

X

j =1

e^θ^j

r

X

j =1

e^θ^j

T

Y

t=1

. (8)

(43)

A computational trick

r

X

j =1

e^θ^j

, (7)

Finally,

L = 1 σ

N

Y

i =1 r

X

j =1

e^θ^j

r

X

j =1

e^θ^j

T

Y

t=1

. (8)

(44)

A computational trick

r

X

j =1

e^θ^j

, (7)

Finally,

L = 1 σ

N

Y

i =1 r

X

j =1

e^θ^j

r

X

j =1

e^θ^j

T

Y

t=1

. (8)

(45)

Available Software

SAS-based Proc Traj procedure

By Bobby L. Jones (Carnegie Mellon University).

Uses a quasi-Newton procedure maximum research routine.

Since the likelyhood is nor convex, nor a contraction, there are issues with local maxima.

R-package crimCV

By Jason D. Nielsen (Carleton University Ottawa). Just implements a zero-inflation Poission model.

(46)

Available Software

R-package crimCV

(47)

Available Software

R-package crimCV

(48)

Available Software

R-package crimCV

(49)

Available Software

R-package crimCV

By Jason D. Nielsen (Carleton University Ottawa).

Just implements a zero-inflation Poission model.

(50)

Available Software

R-package crimCV

By Jason D. Nielsen (Carleton University Ottawa).

Just implements a zero-inflation Poission model.

(51)

Future Software

R-package FMM

By Jang Schiltz & Mounir Shal (University of Luxembourg). Uses the EM Algortihm.

Allows the estimation of a generalised version of Nagin’s model, as well as Muthen’s model.

Will take us probably another year before completion.

(52)

Future Software

R-package FMM

By Jang Schiltz & Mounir Shal (University of Luxembourg).

Uses the EM Algortihm.

(53)

Future Software

R-package FMM

(54)

Future Software

R-package FMM

(55)

Future Software

R-package FMM

(56)

Model Selection (1)

Bayesian Information Criterion:

BIC = log(L) − 0, 5k log(N), (9)

where k denotes the number of parameters in the model. Rule:

The bigger the BIC, the better the model!

(57)

Model Selection (1)

BIC = log(L) − 0, 5k log(N), (9)

where k denotes the number of parameters in the model.

Rule:

(58)

Model Selection (1)

BIC = log(L) − 0, 5k log(N), (9)

where k denotes the number of parameters in the model.

Rule:

(59)

Model Selection (2)

Leave-one-out Cross-Validation Apporach:

CVE = 1 N

N

X

i =1

1 T

T

X

t=1

y_i_t − ˆy_i^{[−i ]}

t

. (10)

Rule:

The smaller the CVE, the better the model!

(60)

Model Selection (2)

CVE = 1 N

N

X

i =1

1 T

T

X

t=1

y_i_t − ˆy_i^{[−i ]}

t

. (10)

Rule:

(61)

Model Selection (2)

CVE = 1 N

N

X

i =1

1 T

T

X

t=1

y_i_t − ˆy_i^{[−i ]}

t

. (10)

Rule:

(62)

Their ”proof ” that CVE is better than BIC

(63)

Their ”proof ” that CVE is better than BIC

(64)

Posterior Group-Membership Probabilities

Posterior probability of individual i ’s membership in group j : P(j /Y_i). Bayes’s theorem

⇒ P(j/Y_i) = P(Y_i/j )ˆπ_j

r

X

j =1

P(Y_i/j )ˆπ_j

. (11)

Bigger groups have on average larger probability estimates.

To be classified into a small group, an individual really needs to be strongly consistent with it.

(65)

Posterior Group-Membership Probabilities

Posterior probability of individual i ’s membership in group j : P(j /Y_i).

Bayes’s theorem

r

X

j =1

P(Y_i/j )ˆπ_j

. (11)

(66)

Posterior Group-Membership Probabilities

Bayes’s theorem

r

X

j =1

P(Y_i/j )ˆπ_j

. (11)

(67)

Posterior Group-Membership Probabilities

Bayes’s theorem

r

X

j =1

P(Y_i/j )ˆπ_j

. (11)

(68)

Posterior Group-Membership Probabilities

Bayes’s theorem

r

X

j =1

P(Y_i/j )ˆπ_j

. (11)

(69)

Model Fit (1)

Diagnostic 1: Average Posterior Probability of Assignment AvePP should be at least 0, 7 for all groups.

Diagonostic 2: Odds of Correct Classification OCC_j = AvePPj/1 − AvePPj

ˆ

π_j/1 − ˆπ_j . (12)

OCCj should be greater than 5 for all groups.

(70)

Model Fit (1)

ˆ

π_j/1 − ˆπ_j . (12)

(71)

Model Fit (1)

ˆ

π_j/1 − ˆπ_j . (12)

(72)

Model Fit (1)

ˆ

π_j/1 − ˆπ_j . (12)

(73)

Model Fit (2)

Diagonostic 3: Comparing ˆπj to the Proportion of the Sample Assigned to Group j

The ratio of the two should be close to 1.

Diagonostic 4: Confidence Intervals for Group Membership Probabilities

The confidence intervals for group membership probabilities estimates should be narrow, i.e. standard deviation of ˆπj should be small.

(74)

Model Fit (2)

Diagonostic 3: Comparing ˆπj to the Proportion of the Sample Assigned to Group j

The ratio of the two should be close to 1.

Diagonostic 4: Confidence Intervals for Group Membership Probabilities

The confidence intervals for group membership probabilities estimates should be narrow, i.e. standard deviation of ˆπ_j should be small.

(75)

An application example

The data : first dataset Salaries of workers in the private sector in Luxembourg from 1940 to 2006.

About 7 million salary lines corresponding to 718.054 workers. Some sociological variables:

gender (male, female)

nationality and residentship (luxemburgish residents, foreign residents, foreign non residents)

working status (white collar worker, blue collar worker) year of birth

age in the first year of professional activity

(76)

An application example

About 7 million salary lines corresponding to 718.054 workers. Some sociological variables:

(77)

An application example

About 7 million salary lines corresponding to 718.054 workers.

Some sociological variables: gender (male, female)

(78)

An application example

Some sociological variables:

(79)

An application example

(80)

An application example

working status (white collar worker, blue collar worker)

year of birth

(81)

An application example

(82)

An application example

(83)

Result for 9 groups (dataset 1)

(84)

Result for 9 groups (dataset 1)

(85)

Results for 9 groups (dataset 1)

(86)

Outline

4 Research Agenda