• Aucun résultat trouvé

mixedClust: an R package for mixed data classiication, clustering and co-clustering

N/A
N/A
Protected

Academic year: 2021

Partager "mixedClust: an R package for mixed data classiication, clustering and co-clustering"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01949171

https://hal.archives-ouvertes.fr/hal-01949171

Submitted on 9 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

mixedClust: an R package for mixed data classiication, clustering and co-clustering

Margot Selosse, Julien Jacques, Christophe Biernacki

To cite this version:

Margot Selosse, Julien Jacques, Christophe Biernacki. mixedClust: an R package for mixed data classiication, clustering and co-clustering. 25th Summer Session Working Group on Model-Based Clustering, Jul 2018, Ann Arbor, United States. �hal-01949171�

(2)

mixedClust: an R package for mixed data classification, clustering and co-clustering

Package functionalities

The package provides model-based algorithm for clustering, co-clustering and classification with mixed-type data.

Principal functions are:

1 m i x e d C l u s t # t o p e r f o r m c l u s t e r i n g m i x e d C o c l u s t # t o p e r f o r m coc l u s t e r i n g

3 m i x e d C l a s s i f # t o p e r f o r m c l a s s i f i c a t i o n , i n a p a r s i m o n i o u s way o r n o t p r e d i c t i o n s # u s e t h e r e s u l t from m i x e d C l a s s i f f o r p r e d i c t i o n s

Notations

I x:N rows andJ = J1+. . .+Jd +. . .+JD columns I xcomposed of several matrices :x1, . . . ,xD I xd:N ×Jd matrix

I xd is made of variables from one of5different types:Continuous,Nominal, Ordinal,IntegerorFunctional.

I In unsupervised methods:Gclusters in line,H1. . .HDclusters in column

x=

x1

. . .

xD

,xd =(xijd)1≤i≤N; 1≤j≤Jd.

Models (for D = 2)

Legend :-Observed partitions-Latent partitions

I Clustering:

p(x;Θ)= Í

v Vp(v;Θ) ×p(x1|v;Θ)p(x2|v;Θ)

I Co-clustering:

p(x;Θ)= Í

v,w1,w2

p(v;Θ)p(w1;Θ)p(w2;Θ) ×p(x1|v,w1;Θ)p(x2|v,w2;Θ)

I Classification without parsimony:

p(x;Θ)=p(v;Θ) ×p(x1|v;Θ)p(x2|v;Θ)

I Classification with parsimony (obtained by clustering the features):

p(x;Θ)= Í

w1,w2

p(v;Θ)p(w1;Θ)p(w2;Θ) ×p(x1|v,w1;Θ)p(x2|v,w2;Θ)

The parameters we want to estimate are:

Θ= αдhd ,γд, ρhd

1≤h≤Hdand1≤d≤D

I αдhd : parameters of distribution ofдth row-cluster andhthcolumn-cluster of xd.It will depend on the type ofxd.

I γд: mixing proportion ofдth row-cluster

I ρdh: mixing proportion ofhth column cluster forxd

Inference

EM and BIC not tractable in co-clustering, due to the double missing structure.

Consequently, we use:

I Stochastic EM algorithm, with a Gibbs sampler for the latent variables simulation

I ICL-BIC criterion for model selection

Results for classification on real dataset

Dataset

I Trauma-survey:823persons answered to88psychological questions about anxiety, depression, anger and possibly traumatizing life events.307of them were diagnosed with trauma, and516were declared not traumatized.

I x1: Categorical data from17questions about traumatizing life events.

I x2: Ordinal data from71questions about anger, depression and anxiety.

I 2/3of the dataset was used to train the model. The last1/3was then used for prediction.

Results

precision recall specificity not parsimonious 0.75 0.80 0.83

(H1,H2)= (3,8) 0.78 0.88 0.85 (H1,H2)= (2,5) 0.82 0.92 0.88 Table:Precision, recall and specificity for different kc.

On classification with parsimony:

I Better results are obtained on predictions when we introduce parsimony than when we don’t.

I Parsimony training result gives less parameters, which makes easier the interpretation.

Features clusters for parsimonious classification

Figure:Patients classification and features clusters. Categorical answers about life events on the left. Ordinal answers about Anger/Anxiety/Depression on the right.

R code

# # # # # # # # D e f i n i n g t h e d a t a s e t p r o p e r t i e s # # # # # #

2 d l i s t = c( 0 , 1 7 ) # d e f i n i n g where e a c h t y p e b e g i n s i n t h e c o m p l e t e d a t a s e t

d i s t r i b = c(" M u l t i n o m i a l " ," Bos ") # d e f i n i n g t h e d i s t r i b u t i o n t y p e s

4

# # # # # # # # d e f i n i n g t h e SEM−G i b b s a l g o r i t h m c o n f i g u r a t i o n # # # # # # # #

6 nbSEM = 2 5 0 # t o t a l number o f i t e r a t i o n s nbSEMburn = 2 0 0 # burni n p e r i o d

8 n b i n d m i n i = 10 # minimum number o f e l e m e n t s i n one b l o c k i n i t = " kmeans " # i n i t i a l i z a t i o n t y p e

10

# # # # # # # # d e f i n i n g t h e number o f c l u s t e r s # # # # # # # #

12 k r = 2 # Two c l a s s e s : T r a u m a t i z e d/Not T r a u m a t i z e d

kc = c( 2 , 5 ) # I n t r o d u c i n g p a r s i m o n y ( p u t t o 0 f o r no p a r s i m o n y )

14

# # # # # # # # r u n n i n g t h e c l a s s i f i c a t i o n f u n c t i o n # # # # # # # #

16 c l a s s i f = m i x e d C l a s s i f (M. t r a i n , y . t r a i n , d l i s t , k r = kr , kc = kc , i n i t , d i s t r i b , nbSEM , nbSEMburn , n b i n d m i n i )

18 p r e d i c t i o n <− p r e d i c t i o n s ( c l a s s i f , M. t e s t )

20 # # # # # # # # p r i n t i n g p r e d i c t e d l a b e l s # # # # # # # # p r e d i c t i o n @ z r_t o p r e d i c t ;

References

[1] V. Robert. Classification et détection de signaux de pharmacovigilance dans des bases de données de grandes dimensions. PhD Thesis, Université Paris Sud, 2017.

[2] G. Govaert, M. Nadif. Co-Clustering.John Wiley & Sons, 2013.

[3] C. Keribin, G. Govaert, and G. Celeux. Estimation d’un modèle à blocs latents par l’algorithme SEM.42èmes Journées de Statistique, 2010.

M. Selosse1, J. Jacques1, C. Biernacki2

1: Université de Lyon, Lyon 2, ERIC EA 3083 ,2: Université Lille 1, Inria

Références

Documents relatifs

These normalized count rates were then transformed into local isotopie compositions by comparing the average SIMS values with radiochemical^ determined isotopie compositions

The ZIP employ two different process : a binary distribution that generate structural

In order to investigate the effect of the sample size on clustering results in high- dimensional spaces, we try to cluster the data for different dataset sizes as this phenomenon

We set up a multi-stage model in which choosing between patent and trade secrecy is a¤ected by three parameters : the patent strength de…ned as the probability that the right is

After adjustment for age, smoking status, – – pack-years, pack-years , energy intake, race/ethnicity, US region, body mass index and physical activity, the consumption of cured meats

To test whether the vesicular pool of Atat1 promotes the acetyl- ation of -tubulin in MTs, we isolated subcellular fractions from newborn mouse cortices and then assessed

Néanmoins, la dualité des acides (Lewis et Bronsted) est un système dispendieux, dont le recyclage est une opération complexe et par conséquent difficilement applicable à

Cette mutation familiale du gène MME est une substitution d’une base guanine par une base adenine sur le chromosome 3q25.2, ce qui induit un remplacement d’un acide aminé cystéine