HAL Id: hal-01949171
https://hal.archives-ouvertes.fr/hal-01949171
Submitted on 9 Dec 2018
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
mixedClust: an R package for mixed data classiication, clustering and co-clustering
Margot Selosse, Julien Jacques, Christophe Biernacki
To cite this version:
Margot Selosse, Julien Jacques, Christophe Biernacki. mixedClust: an R package for mixed data classiication, clustering and co-clustering. 25th Summer Session Working Group on Model-Based Clustering, Jul 2018, Ann Arbor, United States. �hal-01949171�
mixedClust: an R package for mixed data classification, clustering and co-clustering
Package functionalities
The package provides model-based algorithm for clustering, co-clustering and classification with mixed-type data.
Principal functions are:
1 m i x e d C l u s t # t o p e r f o r m c l u s t e r i n g m i x e d C o c l u s t # t o p e r f o r m co−c l u s t e r i n g
3 m i x e d C l a s s i f # t o p e r f o r m c l a s s i f i c a t i o n , i n a p a r s i m o n i o u s way o r n o t p r e d i c t i o n s # u s e t h e r e s u l t from m i x e d C l a s s i f f o r p r e d i c t i o n s
Notations
I x:N rows andJ = J1+. . .+Jd +. . .+JD columns I xcomposed of several matrices :x1, . . . ,xD I xd:N ×Jd matrix
I xd is made of variables from one of5different types:Continuous,Nominal, Ordinal,IntegerorFunctional.
I In unsupervised methods:Gclusters in line,H1. . .HDclusters in column
x=
x1
. . .
xD
,xd =(xijd)1≤i≤N; 1≤j≤Jd.
Models (for D = 2)
Legend :-Observed partitions-Latent partitions
I Clustering:
p(x;Θ)= Í
v ∈Vp(v;Θ) ×p(x1|v;Θ)p(x2|v;Θ)
I Co-clustering:
p(x;Θ)= Í
v,w1,w2
p(v;Θ)p(w1;Θ)p(w2;Θ) ×p(x1|v,w1;Θ)p(x2|v,w2;Θ)
I Classification without parsimony:
p(x;Θ)=p(v;Θ) ×p(x1|v;Θ)p(x2|v;Θ)
I Classification with parsimony (obtained by clustering the features):
p(x;Θ)= Í
w1,w2
p(v;Θ)p(w1;Θ)p(w2;Θ) ×p(x1|v,w1;Θ)p(x2|v,w2;Θ)
The parameters we want to estimate are:
Θ= αдhd ,γд, ρhd
1≤h≤Hdand1≤d≤D
I αдhd : parameters of distribution ofдth row-cluster andhthcolumn-cluster of xd.It will depend on the type ofxd.
I γд: mixing proportion ofдth row-cluster
I ρdh: mixing proportion ofhth column cluster forxd
Inference
EM and BIC not tractable in co-clustering, due to the double missing structure.
Consequently, we use:
I Stochastic EM algorithm, with a Gibbs sampler for the latent variables simulation
I ICL-BIC criterion for model selection
Results for classification on real dataset
Dataset
I Trauma-survey:823persons answered to88psychological questions about anxiety, depression, anger and possibly traumatizing life events.307of them were diagnosed with trauma, and516were declared not traumatized.
I x1: Categorical data from17questions about traumatizing life events.
I x2: Ordinal data from71questions about anger, depression and anxiety.
I 2/3of the dataset was used to train the model. The last1/3was then used for prediction.
Results
precision recall specificity not parsimonious 0.75 0.80 0.83
(H1,H2)= (3,8) 0.78 0.88 0.85 (H1,H2)= (2,5) 0.82 0.92 0.88 Table:Precision, recall and specificity for different kc.
On classification with parsimony:
I Better results are obtained on predictions when we introduce parsimony than when we don’t.
I Parsimony training result gives less parameters, which makes easier the interpretation.
Features clusters for parsimonious classification
Figure:Patients classification and features clusters. Categorical answers about life events on the left. Ordinal answers about Anger/Anxiety/Depression on the right.
R code
# # # # # # # # D e f i n i n g t h e d a t a s e t p r o p e r t i e s # # # # # #
2 d l i s t = c( 0 , 1 7 ) # d e f i n i n g where e a c h t y p e b e g i n s i n t h e c o m p l e t e d a t a s e t
d i s t r i b = c(" M u l t i n o m i a l " ," Bos ") # d e f i n i n g t h e d i s t r i b u t i o n t y p e s
4
# # # # # # # # d e f i n i n g t h e SEM−G i b b s a l g o r i t h m c o n f i g u r a t i o n # # # # # # # #
6 nbSEM = 2 5 0 # t o t a l number o f i t e r a t i o n s nbSEMburn = 2 0 0 # burn−i n p e r i o d
8 n b i n d m i n i = 10 # minimum number o f e l e m e n t s i n one b l o c k i n i t = " kmeans " # i n i t i a l i z a t i o n t y p e
10
# # # # # # # # d e f i n i n g t h e number o f c l u s t e r s # # # # # # # #
12 k r = 2 # Two c l a s s e s : T r a u m a t i z e d/Not T r a u m a t i z e d
kc = c( 2 , 5 ) # I n t r o d u c i n g p a r s i m o n y ( p u t t o 0 f o r no p a r s i m o n y )
14
# # # # # # # # r u n n i n g t h e c l a s s i f i c a t i o n f u n c t i o n # # # # # # # #
16 c l a s s i f = m i x e d C l a s s i f (M. t r a i n , y . t r a i n , d l i s t , k r = kr , kc = kc , i n i t , d i s t r i b , nbSEM , nbSEMburn , n b i n d m i n i )
18 p r e d i c t i o n <− p r e d i c t i o n s ( c l a s s i f , M. t e s t )
20 # # # # # # # # p r i n t i n g p r e d i c t e d l a b e l s # # # # # # # # p r e d i c t i o n @ z r_t o p r e d i c t ;
References
[1] V. Robert. Classification et détection de signaux de pharmacovigilance dans des bases de données de grandes dimensions. PhD Thesis, Université Paris Sud, 2017.
[2] G. Govaert, M. Nadif. Co-Clustering.John Wiley & Sons, 2013.
[3] C. Keribin, G. Govaert, and G. Celeux. Estimation d’un modèle à blocs latents par l’algorithme SEM.42èmes Journées de Statistique, 2010.
M. Selosse1, J. Jacques1, C. Biernacki2
1: Université de Lyon, Lyon 2, ERIC EA 3083 ,2: Université Lille 1, Inria