HAL Id: hal-02103347
https://hal.inria.fr/hal-02103347
Submitted on 18 Apr 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Dealing with missing data in model-based clustering through a MNAR model
Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte
To cite this version:
Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte. Dealing with missing data in model-
based clustering through a MNAR model. CRoNos & MDA 2019 - Meeting and Workshop on Multi-
variate Data Analysis and Software, Apr 2019, Limassol, Cyprus. �hal-02103347�
Dealing with missing data in model-based clustering through a MNAR model
Christophe Biernacki, Gilles Celeux, Julie Josse, Fabien Laporte
Final CRoNoS meeting and Workshop on Multivariate Data Analysis and Software 14-16 April 2019, Limassol, Cyprus
Take home message
1 The missing datapatternmay convey some information on clustering
2 Embed the missingness mechanismdirectly within the clustering modeling step
Outline
1 Introduction
2 A model-based MNAR clustering approach
3 Inference procedures
4 Medical study illustration
5 Concluding remarks
Missing data: an inevitable event
The larger the datasets, the more missing data may appear. . .
Two traditional solutions(for obtaining a filled dataset)
Discardindividuals with missing data: expect to addvarianceinto analysis Imputemissing data: expect to addbiasmodeling into analysis
General guidelines
Obtaining a non-missing dataset isnotthe final goal
Missing data management shouldtake into account the initial analysis target
Our analysis target: model-based clustering
Embed missing data management into this paradigm. . .
Missing data: notations
y={y1, . . . ,yn}: full dataset withnindividuals yi= (yi1, . . . ,yid)∈Rd: full individuali∈ {1, . . . ,n}
c={c1, . . . ,cn}: pattern of missing data for the full dataset
ci= (ci1, . . . ,cdi)∈ {0,1}d: pattern of missing data for individuali∈ {1, . . . ,n}
cij= 1⇔yij is missing
oi={j:cij= 0}: the observed variables indexes for individuali yoii : the observed variables values for individuali
yo={yo11, . . . ,yonn}: the observed values iny
mi={j:cij= 1}: the missing variables indexes for individuali ymii : the missing variables values for individuali
ym={ym11, . . . ,ymnn }: the missing values iny
y={yo,ym}is the full dataset with its observed and missing parts
Missing data: typology of the missing mechanisms
Missing completely at random (MCAR):
P(c|y;ψ) =P(c;ψ) ∀y
Missing at random (MAR):
P(c|y;ψ) =P(c|yo;ψ) ∀ym
Missing not at random (MNAR): the mechanism is not MCAR nor MAR
Clustering: model-based approach
Partition withK clusters:z= (z1, . . . ,zn) where zi= (zi1, . . . ,ziK)∈ {0,1}K
zik= 1 ifyibelongs to clusterk,zik= 0 otherwise Gaussian mixture:y1, . . . ,ynare i.i.d. from the mixture
f(yi;θ) =
K
X
k=1
πkφk(yi;θk)
where:
πk=P(zik= 1)
φk(.;θk):d-variate Gaussian pdf with mean vector and covariance matrix θk= (µk,Σk),d-multinomial pdf withθk=pkprobabilities vector or mixed pdf.
θ= (π1, . . . , πK, θ1, . . . , θK)
Question we address in this work
Which distributionP(c|y,z;ψ) to propose in this clustering context?
Outline
1 Introduction
2 A model-based MNAR clustering approach
3 Inference procedures
4 Medical study illustration
5 Concluding remarks
Logistic model: a natural and flexible candidate
P(c|y,z;ψ) =
n
Y
i=1 d
Y
j=1
P(cij|y,z;ψ)
MCAR , withψ=α0
logit(P(cij= 1|y,z;ψ)) =α0
MNARz (MNARzj ), withψ= (α0, β1...d1 , . . . , βK1...d)
logit(P(cij= 1|y,z;ψ)) =α0+
K
X
k=1
βjkzik
MNARy, withψ= (α0, α1, . . . , αd)
logit(P(cij= 1|y,z;ψ)) =α0+αjyij MNARyz, withψ= (α0, α1, . . . , αd, β1, . . . , βK)
logit(P(cij= 1|y,z;ψ)) =α0+αjyij+
K
X
k=1
βkzik
MNARz analysis: it depends on y through z!
P(cij= 1|y;θ,ψ) =
K
X
k=1
P(cij= 1|y,z;ψ)P(z|y;θ)
Example of a univariate Gaussian model with the three components 0.2N(·; 0,1) + 0.3N(·; 1,2) + 0.5N(·; 2,3)
and with parameters of the logit expression:α0= 1, β1= 1, β2=−1, β3= 1
MNARz analysis: pattern c gives information on partition z!
Draw Bayes error of a MNARzmodel with two components and 20% of missing data πk= 0.5,kµ2−µ1kvaries,Σ1=Σ2=I,|β2−β1|varies
0.50.60.70.80.91.0
Distance along one variable
Center's distance
Good Classification
0.250.5 0.75 1 1.251.5 1.75 2 2.252.5 2.75 3
●
●
●
●
●
●
●
●
● ●
● ●
●
||β||2 0 0.4 0.8 1.2
Bothµkandβkact on the Bayes error
Outline
1 Introduction
2 A model-based MNAR clustering approach
3 Inference procedures
4 Medical study illustration
5 Concluding remarks
Ignorable vs. non ignorable model
A missing mechanism is ignorable if likelihoods can be decomposed as
L(θ,ψ; yo,c
| {z }
observed data
) =L(ψ;c|yo)×L(θ;yo)
Some simple algebra show that this occurs when missing mechanism is not MNAR
Inference of θ
“If the missing mechanism isignorablethen likelihood-based inferences forθfrom L(θ;yo) will be the same as likelihood based inference forθfromL(θ,ψ;yo,c).”
([Little and Rubin, 2002] Section 6.2)
MCAR is ignorable
MNARz, MNARyand MNARyzare non ignorable
EM algorithm: looks simple
Decomposition of Q(θ, ψ; θ, b ψ) b
The expected complete log-likelihood conditional related to observed data is:
Eh
L(θ,ψ;c,y,z);θ,bψ|yb o,ci
=Qy(θ;θ,bψ) +b Qc(ψ;θ,bψ)b
Qy(θ;θ,bψ)b =
n
X
i=1 K
X
k=1
τikEh
log(πkφk(yi;θk))|yoii,ci;θ,bψbi
Qc(ψ;θ,bψ)b =
n
X
i=1 K
X
k=1
τikEh
log(P(ci|zik= 1,yi;θ,ψ))|yoii,ci;θ,bψbi
τik=P(zik= 1|ci,yoii;θ,bψ)b = bπkφk(yoii;θkoi)P(ci|zik= 1,yoii;ψ)b PK
h=1bπhφh(yoii)P(ci|zih= 1,yoii;ψ)b
EM and/or SEM algorithms
MCAR (and also MAR . . . ): classical formula! . . . (EM , SEM) τik∝bπkφk(yoii ;θoik)
MNARz: needs some new calculus but still simple . . . (EM , SEM)
τik∝bπkφk(yoii ;θkoi)
d
Y
j=1
(1 + exp(−rijβbk))−1whererij=
1 ifcji= 1
−1 otherwise MNARy: needs approximations . . . .(EM, SEM)
P(cij|yoii ,zik= 1;ψ) =
R+∞
−∞
1 1+exp(−(αj yj
i))
φk(yij;θjk)dyij ifcij= 1
1 1+exp(αj yj
i)
otherwise
In the Gaussian case, there isno closed form[Pirjol, 2013] (same for MNARyz) ButSEM is still simplein that case thanks to random drawing instead of expectation
Link with some usual procedures!
Concatenation [Jones, 1996]: model equivalence
MNARzj(yo,c)⇐⇒ MCAR( (yo|c) )
MNARz(yo,c)⇐⇒ MCAR
yo
d
X
j=1
cj.
“All Available Cases” [Little and Rubin, 2002]: estimation equivalence
In case ofconditional independencebetween variables, whatever MCAR or MNAR*:
Classical (S)EM⇐⇒(S)EMwithout estimating missingym . . . an opportunity to reduce the computing time
What about model selection?
Can select between MCAR and MNAR∗with any information criterion (BIC, ICL)
Even if the missing mechanism is ignorable for MCAR. . .
. . . need to modelcto compare a MCAR and a MNAR model
Outline
1 Introduction
2 A model-based MNAR clustering approach
3 Inference procedures
4 Medical study illustration
5 Concluding remarks
Hospital Data
Number of patients:n= 5 146 Number of features: d= 7
Age Size Weight Cardiac frequency Hemoglobin concentration Temperature
Minimum Diastolic and Systolic Blood Pressure Percentage of missing data: 6.4%
ICL comparison
1 2 3 4 5 6 7 8
−97400−97000−96600−96200
ICL Comparison
Number of Clusters
ICL
●
●
●
●
●
●
●
●
●MCAR MNARz MNARy MNARyz
MCAR, MNARyand MNARzare equivalent untilK= 3
MNARzand MNARyzclearly indicate presence of an additional cluster (K= 4)
Missing Pattern
It seems that MNARzmodelling leads to a missing free cluster
Outline
1 Introduction
2 A model-based MNAR clustering approach
3 Inference procedures
4 Medical study illustration
5 Concluding remarks
Summary
Interest to put a model onc
Interest of the simple but meaningful model MNARz Link between our models and usual methods
Ongoing works
Deeper analysis of the previous results with doctors. . . Implement the proposed models/algo. in the Mixmod softwarea
Usemixed dataalgorithms for medical study withthe same MNAR* models
ahttp://www.mixmod.org
References
Jones, M. P. (1996).
Indicator and stratification methods for missing explanatory variables in multiple linear regression.
Journal of the American statistical association, 91(433):222–230.
Little, R. J. and Rubin, D. B. (2002).
Statistical Analysis with Missing Data.
Wiley.
Pirjol, D. (2013).
The logistic-normal integral and its generalizations.
Journal of Computational and Applied Mathematics, 237(1):460–469.