Dealing with missing data in model-based clustering through a MNAR model

(1)

HAL Id: hal-02103347

https://hal.inria.fr/hal-02103347

Submitted on 18 Apr 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Dealing with missing data in model-based clustering through a MNAR model

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte

To cite this version:

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte. Dealing with missing data in model-

based clustering through a MNAR model. CRoNos & MDA 2019 - Meeting and Workshop on Multi-

variate Data Analysis and Software, Apr 2019, Limassol, Cyprus. �hal-02103347�

(2)

Dealing with missing data in model-based clustering through a MNAR model

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte

Final CRoNoS meeting and Workshop on Multivariate Data Analysis and Software 14-16 April 2019, Limassol, Cyprus

(3)

Take home message

1 The missing datapatternmay convey some information on clustering

2 Embed the missingness mechanismdirectly within the clustering modeling step

(4)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(5)

Missing data: an inevitable event

The larger the datasets, the more missing data may appear. . .

Two traditional solutions(for obtaining a filled dataset)

Discardindividuals with missing data: expect to addvarianceinto analysis Imputemissing data: expect to addbiasmodeling into analysis

General guidelines

Obtaining a non-missing dataset isnotthe final goal

Missing data management shouldtake into account the initial analysis target

Our analysis target: model-based clustering

Embed missing data management into this paradigm. . .

(6)

Missing data: notations

y={y₁, . . . ,y_n}: full dataset withnindividuals y_i= (y_i¹, . . . ,y_i^d)∈R^d: full individuali∈ {1, . . . ,n}

c={c1, . . . ,cn}: pattern of missing data for the full dataset

c_i= (c_i¹, . . . ,c^d_i)∈ {0,1}^d: pattern of missing data for individuali∈ {1, . . . ,n}

c_i^j= 1⇔y_i^j is missing

oi={j:c_i^j= 0}: the observed variables indexes for individuali y^oi_i : the observed variables values for individuali

yô={yô₁¹, . . . ,yôn_n}: the observed values iny

mi={j:c_i^j= 1}: the missing variables indexes for individuali y^mi_i : the missing variables values for individuali

y^m={y^m₁¹, . . . ,y^mn_n }: the missing values iny

y={y^o,y^m}is the full dataset with its observed and missing parts

(7)

Missing data: typology of the missing mechanisms

Missing completely at random (MCAR):

P(c|y;ψ) =P(c;ψ) ∀y

Missing at random (MAR):

P(c|y;ψ) =P(c|y^o;ψ) ∀y^m

Missing not at random (MNAR): the mechanism is not MCAR nor MAR

(8)

Clustering: model-based approach

Partition withK clusters:z= (z1, . . . ,zn) where z_i= (z_i¹, . . . ,z_i^K)∈ {0,1}^K

z_i^k= 1 ify_ibelongs to clusterk,z_i^k= 0 otherwise Gaussian mixture:y₁, . . . ,y_nare i.i.d. from the mixture

f(y_i;θ) =

K

X

k=1

πkφk(y_i;θk)

where:

πk=P(z_i^k= 1)

φ_k(.;θ_k):d-variate Gaussian pdf with mean vector and covariance matrix θ_k= (µ_k,Σ_k),d-multinomial pdf withθ_k=p_kprobabilities vector or mixed pdf.

θ= (π₁, . . . , π_K, θ₁, . . . , θ_K)

Question we address in this work

Which distributionP(c|y,z;ψ) to propose in this clustering context?

(9)

Outline

1 Introduction

(10)

Logistic model: a natural and flexible candidate

P(c|y,z;ψ) =

n

Y

i=1 d

Y

j=1

P(c_i^j|y,z;ψ)

MCAR , withψ=α₀

logit(P(c_i^j= 1|y,z;ψ)) =α0

MNARz (MNARz^j ), withψ= (α0, β^1...d₁ , . . . , β_K^1...d)

logit(P(c_i^j= 1|y,z;ψ)) =α0+

K

X

k=1

β^j_kz_i^k

MNARy, withψ= (α₀, α₁, . . . , α_d)

logit(P(c_i^j= 1|y,z;ψ)) =α0+αjy_i^j MNARyz, withψ= (α₀, α₁, . . . , α_d, β₁, . . . , β_K)

logit(P(c_i^j= 1|y,z;ψ)) =α₀+α_jy_i^j+

K

X

k=1

β_kz_i^k

(11)

MNARz analysis: it depends on y through z!

P(c_i^j= 1|y;θ,ψ) =

K

X

k=1

P(c_i^j= 1|y,z;ψ)P(z|y;θ)

Example of a univariate Gaussian model with the three components 0.2N(·; 0,1) + 0.3N(·; 1,2) + 0.5N(·; 2,3)

and with parameters of the logit expression:α0= 1, β1= 1, β2=−1, β₃= 1

(12)

MNARz analysis: pattern c gives information on partition z!

Draw Bayes error of a MNARzmodel with two components and 20% of missing data πk= 0.5,kµ2−µ1kvaries,Σ1=Σ2=I,|β2−β1|varies

0.50.60.70.80.91.0

Distance along one variable

Center's distance

Good Classification

0.250.5 0.75 1 1.251.5 1.75 2 2.252.5 2.75 3

●

● ●

●

||β||2 0 0.4 0.8 1.2

Bothµkandβkact on the Bayes error

(13)

Outline

1 Introduction

(14)

Ignorable vs. non ignorable model

A missing mechanism is ignorable if likelihoods can be decomposed as

L(θ,ψ; y^o,c

| {z }

observed data

) =L(ψ;c|y^o)×L(θ;y^o)

Some simple algebra show that this occurs when missing mechanism is not MNAR

Inference of θ

“If the missing mechanism isignorablethen likelihood-based inferences forθfrom L(θ;y^o) will be the same as likelihood based inference forθfromL(θ,ψ;y^o,c).”

([Little and Rubin, 2002] Section 6.2)

MCAR is ignorable

MNARz, MNARyand MNARyzare non ignorable

(15)

EM algorithm: looks simple

Decomposition of Q(θ, ψ; θ, b ψ) b

The expected complete log-likelihood conditional related to observed data is:

Eh

L(θ,ψ;c,y,z);θ,bψ|yb ^o,ci

=Qy(θ;θ,bψ) +b Qc(ψ;θ,bψ)b

Qy(θ;θ,bψ)b =

n

X

i=1 K

X

k=1

τ_i^kEh

log(π_kφ_k(y_i;θ_k))|y^o_iⁱ,ci;θ,bψbi

Qc(ψ;θ,bψ)b =

n

X

i=1 K

X

k=1

τ_i^kEh

log(P(ci|z_i^k= 1,y_i;θ,ψ))|y^o_iⁱ,ci;θ,bψbi

τ_i^k=P(z_i^k= 1|c_i,yô_iⁱ;θ,bψ)b = bπ_kφ_k(yô_iⁱ;θ_kôⁱ)P(ci|z_i^k= 1,yô_iⁱ;ψ)b PK

h=1bπhφh(y^o_iⁱ)P(ci|z_i^h= 1,y^o_iⁱ;ψ)b

(16)

EM and/or SEM algorithms

MCAR (and also MAR . . . ): classical formula! . . . (EM , SEM) τ_i^k∝bπ_kφ_k(y^oi_i ;θ^oi_k)

MNARz: needs some new calculus but still simple . . . (EM , SEM)

τ_i^k∝bπkφk(y^oi_i ;θ_k^oi)

d

Y

j=1

(1 + exp(−r_i^jβbk))⁻¹wherer_i^j=

1 ifc^j_i= 1

−1 otherwise MNARy: needs approximations . . . .(EM, SEM)

P(c_i^j|y^oi_i ,z_i^k= 1;ψ) =









 R+∞

−∞

1 1+exp(−(αj yj

i))

φ_k(y_i^j;θ^j_k)dy_i^j ifc_i^j= 1

1 1+exp(αj yj

i)

otherwise

In the Gaussian case, there isno closed form[Pirjol, 2013] (same for MNARyz) ButSEM is still simplein that case thanks to random drawing instead of expectation

(17)

Link with some usual procedures!

Concatenation [Jones, 1996]: model equivalence

MNARz^j(y^o,c)⇐⇒ MCAR( (y^o|c) )

MNARz(y^o,c)⇐⇒ MCAR







y^o





d

X

j=1

c^j_.













“All Available Cases” [Little and Rubin, 2002]: estimation equivalence

In case ofconditional independencebetween variables, whatever MCAR or MNAR*:

Classical (S)EM⇐⇒(S)EMwithout estimating missingy^m . . . an opportunity to reduce the computing time

(18)

What about model selection?

Can select between MCAR and MNAR∗with any information criterion (BIC, ICL)

Even if the missing mechanism is ignorable for MCAR. . .

. . . need to modelcto compare a MCAR and a MNAR model

(19)

Outline

1 Introduction

(20)

Hospital Data

Number of patients:n= 5 146 Number of features: d= 7

Age Size Weight Cardiac frequency Hemoglobin concentration Temperature

Minimum Diastolic and Systolic Blood Pressure Percentage of missing data: 6.4%

(21)

ICL comparison

1 2 3 4 5 6 7 8

−97400−97000−96600−96200

ICL Comparison

Number of Clusters

ICL

●

●MCAR MNARz MNARy MNARyz

MCAR, MNARyand MNARzare equivalent untilK= 3

MNARzand MNARyzclearly indicate presence of an additional cluster (K= 4)

(22)

Missing Pattern

It seems that MNARzmodelling leads to a missing free cluster

(23)

Outline

1 Introduction

(24)

Summary

Interest to put a model onc

Interest of the simple but meaningful model MNARz Link between our models and usual methods

Ongoing works

Deeper analysis of the previous results with doctors. . . Implement the proposed models/algo. in the Mixmod software^a

Usemixed dataalgorithms for medical study withthe same MNAR* models

ahttp://www.mixmod.org

(25)

References

Jones, M. P. (1996).

Indicator and stratification methods for missing explanatory variables in multiple linear regression.

Journal of the American statistical association, 91(433):222–230.

Little, R. J. and Rubin, D. B. (2002).

Statistical Analysis with Missing Data.

Wiley.

Pirjol, D. (2013).

The logistic-normal integral and its generalizations.

Journal of Computational and Applied Mathematics, 237(1):460–469.