• Aucun résultat trouvé

Dealing with missing data in model-based clustering through a MNAR model

N/A
N/A
Protected

Academic year: 2021

Partager "Dealing with missing data in model-based clustering through a MNAR model"

Copied!
25
0
0

Texte intégral

(1)

HAL Id: hal-02103347

https://hal.inria.fr/hal-02103347

Submitted on 18 Apr 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Dealing with missing data in model-based clustering through a MNAR model

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte

To cite this version:

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte. Dealing with missing data in model-

based clustering through a MNAR model. CRoNos & MDA 2019 - Meeting and Workshop on Multi-

variate Data Analysis and Software, Apr 2019, Limassol, Cyprus. �hal-02103347�

(2)

Dealing with missing data in model-based clustering through a MNAR model

Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte

Final CRoNoS meeting and Workshop on Multivariate Data Analysis and Software 14-16 April 2019, Limassol, Cyprus

(3)

Take home message

1 The missing datapatternmay convey some information on clustering

2 Embed the missingness mechanismdirectly within the clustering modeling step

(4)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(5)

Missing data: an inevitable event

The larger the datasets, the more missing data may appear. . .

Two traditional solutions(for obtaining a filled dataset)

Discardindividuals with missing data: expect to addvarianceinto analysis Imputemissing data: expect to addbiasmodeling into analysis

General guidelines

Obtaining a non-missing dataset isnotthe final goal

Missing data management shouldtake into account the initial analysis target

Our analysis target: model-based clustering

Embed missing data management into this paradigm. . .

(6)

Missing data: notations

y={y1, . . . ,yn}: full dataset withnindividuals yi= (yi1, . . . ,yid)∈Rd: full individuali∈ {1, . . . ,n}

c={c1, . . . ,cn}: pattern of missing data for the full dataset

ci= (ci1, . . . ,cdi)∈ {0,1}d: pattern of missing data for individuali∈ {1, . . . ,n}

cij= 1⇔yij is missing

oi={j:cij= 0}: the observed variables indexes for individuali yoii : the observed variables values for individuali

yo={yo11, . . . ,yonn}: the observed values iny

mi={j:cij= 1}: the missing variables indexes for individuali ymii : the missing variables values for individuali

ym={ym11, . . . ,ymnn }: the missing values iny

y={yo,ym}is the full dataset with its observed and missing parts

(7)

Missing data: typology of the missing mechanisms

Missing completely at random (MCAR):

P(c|y;ψ) =P(c;ψ) ∀y

Missing at random (MAR):

P(c|y;ψ) =P(c|yo;ψ) ∀ym

Missing not at random (MNAR): the mechanism is not MCAR nor MAR

(8)

Clustering: model-based approach

Partition withK clusters:z= (z1, . . . ,zn) where zi= (zi1, . . . ,ziK)∈ {0,1}K

zik= 1 ifyibelongs to clusterk,zik= 0 otherwise Gaussian mixture:y1, . . . ,ynare i.i.d. from the mixture

f(yi;θ) =

K

X

k=1

πkφk(yik)

where:

πk=P(zik= 1)

φk(.;θk):d-variate Gaussian pdf with mean vector and covariance matrix θk= (µkk),d-multinomial pdf withθk=pkprobabilities vector or mixed pdf.

θ= (π1, . . . , πK, θ1, . . . , θK)

Question we address in this work

Which distributionP(c|y,z;ψ) to propose in this clustering context?

(9)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(10)

Logistic model: a natural and flexible candidate

P(c|y,z;ψ) =

n

Y

i=1 d

Y

j=1

P(cij|y,z;ψ)

MCAR , withψ=α0

logit(P(cij= 1|y,z;ψ)) =α0

MNARz (MNARzj ), withψ= (α0, β1...d1 , . . . , βK1...d)

logit(P(cij= 1|y,z;ψ)) =α0+

K

X

k=1

βjkzik

MNARy, withψ= (α0, α1, . . . , αd)

logit(P(cij= 1|y,z;ψ)) =α0jyij MNARyz, withψ= (α0, α1, . . . , αd, β1, . . . , βK)

logit(P(cij= 1|y,z;ψ)) =α0jyij+

K

X

k=1

βkzik

(11)

MNARz analysis: it depends on y through z!

P(cij= 1|y;θ,ψ) =

K

X

k=1

P(cij= 1|y,z;ψ)P(z|y;θ)

Example of a univariate Gaussian model with the three components 0.2N(·; 0,1) + 0.3N(·; 1,2) + 0.5N(·; 2,3)

and with parameters of the logit expression:α0= 1, β1= 1, β2=−1, β3= 1

(12)

MNARz analysis: pattern c gives information on partition z!

Draw Bayes error of a MNARzmodel with two components and 20% of missing data πk= 0.5,kµ2−µ1kvaries,Σ12=I,|β2−β1|varies

0.50.60.70.80.91.0

Distance along one variable

Center's distance

Good Classification

0.250.5 0.75 1 1.251.5 1.75 2 2.252.5 2.75 3

||β||2 0 0.4 0.8 1.2

Bothµkandβkact on the Bayes error

(13)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(14)

Ignorable vs. non ignorable model

A missing mechanism is ignorable if likelihoods can be decomposed as

L(θ,ψ; yo,c

| {z }

observed data

) =L(ψ;c|yo)×L(θ;yo)

Some simple algebra show that this occurs when missing mechanism is not MNAR

Inference of θ

“If the missing mechanism isignorablethen likelihood-based inferences forθfrom L(θ;yo) will be the same as likelihood based inference forθfromL(θ,ψ;yo,c).”

([Little and Rubin, 2002] Section 6.2)

MCAR is ignorable

MNARz, MNARyand MNARyzare non ignorable

(15)

EM algorithm: looks simple

Decomposition of Q(θ, ψ; θ, b ψ) b

The expected complete log-likelihood conditional related to observed data is:

Eh

L(θ,ψ;c,y,z);θ,bψ|yb o,ci

=Qy(θ;θ,bψ) +b Qc(ψ;θ,bψ)b

Qy(θ;θ,bψ)b =

n

X

i=1 K

X

k=1

τikEh

log(πkφk(yik))|yoii,ci;θ,bψbi

Qc(ψ;θ,bψ)b =

n

X

i=1 K

X

k=1

τikEh

log(P(ci|zik= 1,yi;θ,ψ))|yoii,ci;θ,bψbi

τik=P(zik= 1|ci,yoii;θ,bψ)b = bπkφk(yoiikoi)P(ci|zik= 1,yoii;ψ)b PK

h=1hφh(yoii)P(ci|zih= 1,yoii;ψ)b

(16)

EM and/or SEM algorithms

MCAR (and also MAR . . . ): classical formula! . . . (EM , SEM) τik∝bπkφk(yoiioik)

MNARz: needs some new calculus but still simple . . . (EM , SEM)

τik∝bπkφk(yoiikoi)

d

Y

j=1

(1 + exp(−rijβbk))−1whererij=

1 ifcji= 1

−1 otherwise MNARy: needs approximations . . . .(EM, SEM)

P(cij|yoii ,zik= 1;ψ) =





 R+∞

−∞

1 1+exp(−(αj yj

i))

φk(yijjk)dyij ifcij= 1

1 1+exp(αj yj

i)

otherwise

In the Gaussian case, there isno closed form[Pirjol, 2013] (same for MNARyz) ButSEM is still simplein that case thanks to random drawing instead of expectation

(17)

Link with some usual procedures!

Concatenation [Jones, 1996]: model equivalence

MNARzj(yo,c)⇐⇒ MCAR( (yo|c) )

MNARz(yo,c)⇐⇒ MCAR

yo

d

X

j=1

cj.

“All Available Cases” [Little and Rubin, 2002]: estimation equivalence

In case ofconditional independencebetween variables, whatever MCAR or MNAR*:

Classical (S)EM⇐⇒(S)EMwithout estimating missingym . . . an opportunity to reduce the computing time

(18)

What about model selection?

Can select between MCAR and MNAR∗with any information criterion (BIC, ICL)

Even if the missing mechanism is ignorable for MCAR. . .

. . . need to modelcto compare a MCAR and a MNAR model

(19)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(20)

Hospital Data

Number of patients:n= 5 146 Number of features: d= 7

Age Size Weight Cardiac frequency Hemoglobin concentration Temperature

Minimum Diastolic and Systolic Blood Pressure Percentage of missing data: 6.4%

(21)

ICL comparison

1 2 3 4 5 6 7 8

−97400−97000−96600−96200

ICL Comparison

Number of Clusters

ICL

MCAR MNARz MNARy MNARyz

MCAR, MNARyand MNARzare equivalent untilK= 3

MNARzand MNARyzclearly indicate presence of an additional cluster (K= 4)

(22)

Missing Pattern

It seems that MNARzmodelling leads to a missing free cluster

(23)

Outline

1 Introduction

2 A model-based MNAR clustering approach

3 Inference procedures

4 Medical study illustration

5 Concluding remarks

(24)

Summary

Interest to put a model onc

Interest of the simple but meaningful model MNARz Link between our models and usual methods

Ongoing works

Deeper analysis of the previous results with doctors. . . Implement the proposed models/algo. in the Mixmod softwarea

Usemixed dataalgorithms for medical study withthe same MNAR* models

ahttp://www.mixmod.org

(25)

References

Jones, M. P. (1996).

Indicator and stratification methods for missing explanatory variables in multiple linear regression.

Journal of the American statistical association, 91(433):222–230.

Little, R. J. and Rubin, D. B. (2002).

Statistical Analysis with Missing Data.

Wiley.

Pirjol, D. (2013).

The logistic-normal integral and its generalizations.

Journal of Computational and Applied Mathematics, 237(1):460–469.

Références

Documents relatifs

The proposed model allows thus to fully exploit the structure of the data, compared to classical latent block clustering models for continuous non functional data, which ignore

Thus, anyone wishing to use EM correlations (or covariances) as matrix input for the FACTOR or RELIABILITY procedures has some preliminary work to do: They

After having recalled the bases of model-based clustering, this article will review dimension reduction approaches, regularization-based techniques, parsimonious modeling,

data generating process is necessary for the processing of missing values, which requires identifying the type of missingness [Rubin, 1976]: Missing Completely At Random (MCAR)

Abstract The MIXMOD ( MIX ture MOD eling) program fits mixture models to a given data set for the purposes of density estimation, clustering or discriminant analysis.. A large

Finally, for the mixed annotation u C , ICAL most frequently selects three clusters (79 times) or four clusters (21 times). Because the annotation u C is mixed, there is less

The first step consists of constructing a collection of models {H (k,J) } (k,J)∈M in which H (k,J) is defined by equation (7), and the model collection is indexed by M = K × J. We

Key words: Inverse probability weighted estimation, Large-sample properties, Linear transformation model, Missing data, Simulation study.. Short title: Estimation in a reliability