mixedClust: an R package for mixed data classiication, clustering and co-clustering

(1)

HAL Id: hal-01949171

https://hal.archives-ouvertes.fr/hal-01949171

Submitted on 9 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

mixedClust: an R package for mixed data classiication, clustering and co-clustering

Margot Selosse, Julien Jacques, Christophe Biernacki

To cite this version:

Margot Selosse, Julien Jacques, Christophe Biernacki. mixedClust: an R package for mixed data classiication, clustering and co-clustering. 25th Summer Session Working Group on Model-Based Clustering, Jul 2018, Ann Arbor, United States. �hal-01949171�

(2)

mixedClust: an R package for mixed data classification, clustering and co-clustering

Package functionalities

The package provides model-based algorithm for clustering, co-clustering and classification with mixed-type data.

Principal functions are:

1 m i x e d C l u s t # t o p e r f o r m c l u s t e r i n g m i x e d C o c l u s t # t o p e r f o r m co−c l u s t e r i n g

3 m i x e d C l a s s i f # t o p e r f o r m c l a s s i f i c a t i o n , i n a p a r s i m o n i o u s way o r n o t p r e d i c t i o n s # u s e t h e r e s u l t from m i x e d C l a s s i f f o r p r e d i c t i o n s

Notations

I x:N rows andJ = J₁+. . .+J_d +. . .+J_D columns I xcomposed of several matrices :x¹, . . . ,x^D I x^d:N ×J_d matrix

I x^d is made of variables from one of5different types:Continuous,Nominal, Ordinal,IntegerorFunctional.

I In unsupervised methods:Gclusters in line,H₁. . .H_Dclusters in column

x=











 x¹





 . . .





 x^D













,x^d =(x_ij^d)_1≤i_≤N_{; 1≤j}_≤J_d.

Models (for D = 2)

Legend :-Observed partitions-Latent partitions

I Clustering:

p(x;Θ)= Í

v ∈Vp(v;Θ) ×p(x¹|v;Θ)p(x²|v;Θ)

I Co-clustering:

p(x;Θ)= Í

v,w¹,w²

p(v;Θ)p(w¹;Θ)p(w²;Θ) ×p(x¹|v,w¹;Θ)p(x²|v,w²;Θ)

I Classification without parsimony:

p(x;Θ)=p(v;Θ) ×p(x¹|v;Θ)p(x²|v;Θ)

I Classification with parsimony (obtained by clustering the features):

p(x;Θ)= Í

w¹,w²

p(v;Θ)p(w¹;Θ)p(w²;Θ) ×p(x¹|v,w¹;Θ)p(x²|v,w²;Θ)

The parameters we want to estimate are:

Θ= α_дh^d ,γ_д, ρ_h^d

1≤h≤Hdand1≤d≤D

I α_дh^d : parameters of distribution ofд^th row-cluster andh^thcolumn-cluster of x^d.It will depend on the type ofx^d.

I γд: mixing proportion ofд^th row-cluster

I ρ^d_h: mixing proportion ofh^th column cluster forx^d

Inference

EM and BIC not tractable in co-clustering, due to the double missing structure.

Consequently, we use:

I Stochastic EM algorithm, with a Gibbs sampler for the latent variables simulation

I ICL-BIC criterion for model selection

Results for classification on real dataset

Dataset

I Trauma-survey:823persons answered to88psychological questions about anxiety, depression, anger and possibly traumatizing life events.307of them were diagnosed with trauma, and516were declared not traumatized.

I x¹: Categorical data from17questions about traumatizing life events.

I x²: Ordinal data from71questions about anger, depression and anxiety.

I 2/3of the dataset was used to train the model. The last1/3was then used for prediction.

Results

precision recall specificity not parsimonious 0.75 0.80 0.83

(H₁,H₂)= (3,8) 0.78 0.88 0.85 (H1,H2)= (2,5) 0.82 0.92 0.88 Table:Precision, recall and specificity for different kc.

On classification with parsimony:

I Better results are obtained on predictions when we introduce parsimony than when we don’t.

I Parsimony training result gives less parameters, which makes easier the interpretation.

Features clusters for parsimonious classification

Figure:Patients classification and features clusters. Categorical answers about life events on the left. Ordinal answers about Anger/Anxiety/Depression on the right.

R code

# # # # # # # # D e f i n i n g t h e d a t a s e t p r o p e r t i e s # # # # # #

2 d l i s t = c( 0 , 1 7 ) # d e f i n i n g where e a c h t y p e b e g i n s i n t h e c o m p l e t e d a t a s e t

d i s t r i b = c(" M u l t i n o m i a l " ," Bos ") # d e f i n i n g t h e d i s t r i b u t i o n t y p e s

4

# # # # # # # # d e f i n i n g t h e SEM−G i b b s a l g o r i t h m c o n f i g u r a t i o n # # # # # # # #

6 nbSEM = 2 5 0 # t o t a l number o f i t e r a t i o n s nbSEMburn = 2 0 0 # burn−i n p e r i o d

8 n b i n d m i n i = 10 # minimum number o f e l e m e n t s i n one b l o c k i n i t = " kmeans " # i n i t i a l i z a t i o n t y p e

10

# # # # # # # # d e f i n i n g t h e number o f c l u s t e r s # # # # # # # #

12 k r = 2 # Two c l a s s e s : T r a u m a t i z e d/Not T r a u m a t i z e d

kc = c( 2 , 5 ) # I n t r o d u c i n g p a r s i m o n y ( p u t t o 0 f o r no p a r s i m o n y )

14

# # # # # # # # r u n n i n g t h e c l a s s i f i c a t i o n f u n c t i o n # # # # # # # #

16 c l a s s i f = m i x e d C l a s s i f (M. t r a i n , y . t r a i n , d l i s t , k r = kr , kc = kc , i n i t , d i s t r i b , nbSEM , nbSEMburn , n b i n d m i n i )

18 p r e d i c t i o n <− p r e d i c t i o n s ( c l a s s i f , M. t e s t )

20 # # # # # # # # p r i n t i n g p r e d i c t e d l a b e l s # # # # # # # # p r e d i c t i o n @ z r_t o p r e d i c t ;

References

[1] V. Robert. Classification et détection de signaux de pharmacovigilance dans des bases de données de grandes dimensions. PhD Thesis, Université Paris Sud, 2017.

[2] G. Govaert, M. Nadif. Co-Clustering.John Wiley & Sons, 2013.

[3] C. Keribin, G. Govaert, and G. Celeux. Estimation d’un modèle à blocs latents par l’algorithme SEM.42èmes Journées de Statistique, 2010.

M. Selosse¹, J. Jacques¹, C. Biernacki²

1: Université de Lyon, Lyon 2, ERIC EA 3083 ,²: Université Lille 1, Inria