HAL Id: hal-01884822
https://hal.inria.fr/hal-01884822
Submitted on 1 Oct 2018
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
blockcluster, simerge and C++ with R
Serge Iovleff, Seydou Nourou Sylla
To cite this version:
Serge Iovleff, Seydou Nourou Sylla. blockcluster, simerge and C++ with R. Mixture Models: Theory
and Applications, Jun 2018, Paris, France. �hal-01884822�
blockcluster, simerge and C++ with R
Serge Iovleff, Nourou Sylla
Equipe Projet Modal, Equipe G4BBM, Institut Pasteur de Dakar´
blockcluster package
Summary
blockcluster package
simerge: Block clustering of binary data with Gaussian co-variables C++ Programming with R: Thesimergepackage
Preliminary Results References
blockcluster package
Co-Clustering
“Aims to organize data-set into a set of homogeneous blocks by simultaneous clustering of individuals and variables.”
Figure:Binary data set (a), data reorganized by a partition onI(b), by partitions onIandJsimultaneously (c) and summary matrix (d).
blockcluster package
Model Based Approach
xis data set doubly indexed by a setI withnelements (individuals) and a setJ withmelements (variables).
z= (z11, . . . ,zng)withzik=1 ifi belongs to clusterk andzik =0 otherwise,
w= (w11, . . . ,wmd)withwj`=1 ifj belongs to cluster `andwj`=0 otherwise,
f(x;θ) = X
(z,w)∈Z×W
p(z;θ)p(w;θ)f(x|z,w;θ) (1)
whereZ andW denote the sets of all possible labellingzofI andwofJ. There isgn×dm labelling possible.
blockcluster package
blockcluster: R Package For coclustering
I R interface to C++ library coclust (using STK++ in background), I Simple and Robust API,
I Extend four basic functions ”Plot”,”Summary”,”Show”,”Print”, I Implements “intelligent” estimation strategy.
Example
d a t a(g a u s s i a n d a t a)
out< -c o c l u s t e r G a u s s i a n(g a u s s i a n d a t a,m o d e l="
p i _ r h o _ s i g m a 2 k l ",n b c o c l u s t e r=c(2 ,3) ) p l o t(out)
p l o t(out, t y p e=" d i s t r i b u t i o n ")
blockcluster package
Example : Gaussian distribution
(a) (b)
Figure:Simulated and co-clustered data (a), Data block-distributions (b)
blockcluster package
Example : Binary distribution
(a) (b)
Figure:Simulated and co-clustered data (a), Data block-distributions (b)
blockcluster package
Example : Categorical distribution
s
Original Data Co−Clustered Data
12345
ColorLevels
Scale
(a)
1 2 3 4 5
Block ( 1 , 1 )
Data values block ( 1 , 1 )
0.000.050.100.150.20
1 2 3 4 5
Block ( 1 , 2 )
Data values block ( 1 , 2 ) Frequency0.00.20.40.60.8
1 2 3 4 5
Mixture of row 1
Data values of row 1 frequency0.00.10.20.30.40.5
1 2 3 4 5
Block ( 2 , 1 )
Data values block ( 2 , 1 )
0.00.20.40.60.8
1 2 3 4 5
Block ( 2 , 2 )
Data values block ( 2 , 2 ) Frequency0.00.20.40.60.8
1 2 3 4 5
Mixture of row 2
Data values of row 2 frequency0.00.10.20.30.4
1 2 3 4 5
Block ( 3 , 1 )
Data values block ( 3 , 1 )
0.00.20.40.60.8
1 2 3 4 5
Block ( 3 , 2 )
Data values block ( 3 , 2 ) Frequency0.00.20.40.60.8
1 2 3 4 5
Mixture of row 3
Data values of row 3 frequency0.00.10.20.30.4
Mixture of column 1
0.00.10.20.3
Mixture of column 2
Frequency0.000.050.100.150.200.250.30
Final mixture
Frequency0.000.050.100.150.20 Histogram/density for each block
(b)
Figure:Simulated and co-clustered data (a), Data block-distributions (b)
blockcluster package
Example : Poisson distribution
Original Data Co−Clustered Data
051015202530
ColorLevels
Scale
(a)
3 6 9 13182328 Block ( 1 , 1 )
Data values of block ( 1 , 1 )
0.000.020.040.060.08
036912 15 18
Block ( 1 , 2 )
Data values of block ( 1 , 2 ) Frequency0.000.020.040.060.080.100.120.14
036912 15 18
Block ( 1 , 3 )
Data values of block ( 1 , 3 ) Frequency0.000.020.040.060.080.100.120.14
048 12 17 22 27 32 Data values of row 1 Frequency0.000.020.040.060.080.10
0 2 4 6 81114
Block ( 2 , 1 )
Data values of block ( 2 , 1 )
0.000.050.100.15
1 4 7 11162126 Block ( 2 , 2 )
Data values of block ( 2 , 2 ) Frequency0.000.020.040.060.080.10
1 5 812 16 20 24 29 Block ( 2 , 3 )
Data values of block ( 2 , 3 ) Frequency0.000.020.040.060.080.10
048 12 17 22 27 Data values of row 2 Frequency0.000.010.020.030.040.050.060.07
Mixture density of column 1
0.000.020.040.060.08
Mixture density of column 2
Frequency0.000.020.040.060.08
Mixture density of column 3
Frequency0.000.020.040.060.08
Final mixture density
Frequency0.000.010.020.030.040.050.060.07 Histograms of classes of contingency data
(b)
Figure:Simulated and co-clustered data (a), Data block-distributions (b)
blockcluster package
Development history
I First versions developed during ADT coclust (October 2011-October 2013). Implement binary, Poisson, Gaussian models; BEM and BCEM algorithms.
I Release 3.0 in 2014 add:
1. Support for categorical data,
2. Add Bayesian inference estimation algorithms, 3. But stay unstable in certain situations (crashes..).
I Release 4.0 in November/December 2015 :
1. Use STK++ as background library (code became cleaner and more compact).
2. Fix (a lot of) crashes issues,
I Enhancement in release 4.2 in November/December 2016 (ADT Massicc)
1. Adding selection criteria,
2. Adding Gibbs estimation algorithms.
simerge: Block clustering of binary data with Gaussian co-variables
Summary
blockcluster package
simerge: Block clustering of binary data with Gaussian co-variables
C++ Programming with R: Thesimergepackage Preliminary Results
References
simerge: Block clustering of binary data with Gaussian co-variables
Simerge : Statistical Inference for the Management of Extreme Risks, Genetics and Global Epidemiology
http://mistis.inrialpes.fr/simerge/index.html
SIMERGE is a LIRIMA project-team started in January 2015. It includes I Mistis (Inria Grenoble - Rhˆone-Alpes, France)
I LERSTAD (Laboratoire d’Etudes et de Recherches en Statistiques et D´eveloppement, Universit´e Gaston Berger, S´en´egal)
I IRD (Institut de Recherche pour le D´eveloppement, ´equipe G4BBM, Dakar, S´en´egal)
I LEM (Lille Economie et Management, Universit´e Lille 2) I Modal (Inria Lille Nord-Europe)
The Associate team is built on two research themes:
1. Spatial extremes, application to management of extreme risks 2. Classification, application to genetics and global epidemiology
simerge: Block clustering of binary data with Gaussian co-variables
Challenge
Build statistical models in order to test association between diseases and human host genetics in a context of genome-wide screening.
Figure:Genotypes on 719,656 SNPs(Single Nucleotide Polymorphism) typed on481 individualsin Senegal, in rural area where malaria and arboviral diseases are endemic. 1 malariaquantitativephenotype ontwo sites: the individual effect on the risk of having malaria attack (iPFA).
simerge: Block clustering of binary data with Gaussian co-variables
Statistical Model
“Pour queblockclustermette en ´evidence une cause g´en´etique `a l’iPFA, il faudrait que les populations aient ´et´e expos´ees `a la maladie pendant plusieurs mill´enaires”(Cheick Loucoubar, head of G4BBM) xis a binary data-set.
yis a data-set (co-variables) ofRp indexed byI.
Classical block model formulation for binary data is extended
f(x,y;θ) = X
(z,w)∈Z×W
p(z;θ)p(w;θ)f(x|y,z,w;θ)f(y|z;θ). (2)
Dependency betweenxij andyi modeled by canonical link for binary response data
f(xij|yi,βz
iwj) = logis(βTz
iwjyi)xij
1−logis(βTz
iwjyi)1−xij
(3)
f(y|z;θ) =Y
i
φ(yi;µzi,Σzi)withφmultivariate Gaussian density.
simerge: Block clustering of binary data with Gaussian co-variables
Estimation
EM algorithm not feasible as quantityeikj`=P(zikwj`=1|x,y,θ)is not computable.
Takeq(z,w) =t(z)r(w) =t×rwithtandrmatrices of sizes(n,g)and (m,d), then
l(θ) = ˜FC(t,r,θ) +KL(q(z,w)kp(z,w|x,y,θ)) (4) withKL(qkp)denoting theKullback-Liebler divergence andF˜C
denoting theFree EnergyorFuzzy Criterion F˜C(t,r,θ) =X
k
t.klogπk+X
`
r.`logρl
+ X
i,j,k,`
tikrj`logf(xij,yi;θk`) (5)
+H(t) +H(r) andH(t),H(r)denoting the entropy oftandr.
Maximization of likelihoodl(θ)is replaced by the following maximization argmax
t,r,θ
F˜C(t,r,θ).
simerge: Block clustering of binary data with Gaussian co-variables
BEM algorithm
Initialization Sett(0),r(0) andθ(0)= (π(0),ρ(0),β(0),µ(0), Σ(0)).
(a) Row-EStep Computet(c+1) using formula
tik(c+1)=
πk(c)Q
jl
f(xij|yi;β(c)kl )φ(yi;µ(c)k ,Σ(c)k )rjl(c)
P
kπ(c)k Q
jl
f(xij|yi;β(c)kl )φ(yi;µ(c)k ,Σ(c)k )rjl(c) .
(b) Row-MStep Compute π(c+1),µ(c+1),Σ(c+1) and estimateβ(c+1/2). (c) Col-EStep Computer(c+1) using formula
rjl(c+1)= ρ(c)l Q
ikf(xij|yi;β(c+1/2)kl )t(c+1)ik P
lρ(c)l Q
ikf(xij|yi;β(c+1/2)kl )tik(c+1) .
(d) Col-MStep Computeρ(c+1) and estimateβ(c+1). Iterate Iterate(a)-(b)-(c)-(d)until convergence.
simerge: Block clustering of binary data with Gaussian co-variables
Measuring contribution of a variable
ml denotes the number of columns with labell, i.e
ml = #{wjl=1, j =1, . . .m}and for a rowi fixed letmil denotes the number of elements such thatwjl=1 andxij=1, i.e.
mil = #{wjlxij=1, j=1, . . .m}. The posterior probability of the co-variabley is
f(y|x,z,w,θ)∝
n
Y
i=1
πziφ(yi;µz
i,Σzi)
d
Y
l=1
ρml l emilyTiβzi l
1+eyTi βzi lml (6)
Takinglog, contribution of the jth variable is computed as
I(j) = logρwj+
n
X
i=1
xijyTi βz
iwj−log(1+ exp(yTi .βz
iwj))
. (7)
using MAP estimator forzandw.
C++ Programming with R: Thesimergepackage
Summary
blockcluster package
simerge: Block clustering of binary data with Gaussian co-variables C++ Programming with R: Thesimergepackage
Preliminary Results References
C++ Programming with R: Thesimergepackage
Extreme Programming (XP)
1Extreme Programming is a discipline of software development based on values of simplicity, communication, feedback, courage, and respect.
I Simple Design: XP teams build software to a simple but always adequate design. They start simple, and through programmer testing and design improvement, they keep it that way.
I Pair Programming: All production software in XP is built by two programmers, sitting side by side, at the same machine.
I Test-Driven Development: XP is obsessed with feedback, and in software development, good feedback requires good testing.
I Design Improvement (Refactoring): XP focuses on delivering business value in every iteration. To accomplish this over the course of the whole project, the software must be well-designed.
I Coding Standard: XP teams follow a common coding standard, so that all the code in the system looks as if it was written by a single – very competent – individual.
C++ Programming with R: Thesimergepackage
Extreme Programming (XP)
1Extreme Programming is a discipline of software development based on values of simplicity, communication, feedback, courage, and respect.
I Simple Design: XP teams build software to a simple but always adequate design. They start simple, and through programmer testing and design improvement, they keep it that way.
I Pair Programming: All production software in XP is built by two programmers, sitting side by side, at the same machine.
I Test-Driven Development: XP is obsessed with feedback, and in software development, good feedback requires good testing.
I Design Improvement (Refactoring): XP focuses on delivering business value in every iteration. To accomplish this over the course of the whole project, the software must be well-designed.
I Coding Standard: XP teams follow a common coding standard, so that all the code in the system looks as if it was written by a single – very competent – individual.
1https://ronjeffries.com/xprog/what-is-extreme-programming/
C++ Programming with R: Thesimergepackage
Extreme Programming (XP)
1Extreme Programming is a discipline of software development based on values of simplicity, communication, feedback, courage, and respect.
I Simple Design: XP teams build software to a simple but always adequate design. They start simple, and through programmer testing and design improvement, they keep it that way.
I Pair Programming: All production software in XP is built by two programmers, sitting side by side, at the same machine.
I Test-Driven Development: XP is obsessed with feedback, and in software development, good feedback requires good testing.
I Design Improvement (Refactoring): XP focuses on delivering business value in every iteration. To accomplish this over the course of the whole project, the software must be well-designed.
I Coding Standard: XP teams follow a common coding standard, so that all the code in the system looks as if it was written by a single – very competent – individual.
C++ Programming with R: Thesimergepackage
Design and Coding Standard
UseS4class for R side and a mirror C++ class s e t C l a s s(
C l a s s = " C o C l u s t e r B i n a r y ",
r e p r e s e n t a t i o n = r e p r e s e n t a t i o n(
# y p a r t
yid = " m a t r i x ", # c o v a r i a b l e s
m u k d = " m a t r i x ", # m e a n s of yid
s i g m a k d = " m a t r i x ", # s t a n d a r d d e v i a t i o n s i s C o M i x t u r e = " l o g i c a l ",# yid is a m i x t u r e ?
# x p a r t
xij = " m a t r i x ",
# . . . .
# C o n s t r u c t o r of the S4 c l a s s s e t M e t h o d(
f=" i n i t i a l i z e ",
s i g n a t u r e=c(" C o C l u s t e r B i n a r y ") ,
d e f i n i t i o n=f u n c t i o n(.Object, x, y, n b c o c l u s t e r, i s C o M i x t u r e)
C++ Programming with R: Thesimergepackage
Design and Coding Standard
UseS4class for R sideand a mirror C++ class
c l a s s C o C l u s t e r B i n a r y M o d e l: p u b l i c STK::I R u n n e r B a s e {
p u b l i c:
// c o n s t r u c t o r of the C ++ c l a s s
C o C l u s t e r B i n a r y M o d e l(R c p p::S4 s 4 M o d e l) ; // . . . .
STK::RMatrix<double> y i d _; STK::RMatrix<double> m u k d _;
STK::RMatrix<double> s d k d _;
b o o l i s C o M i x t u r e _;
STK::RMatrix<double> x i j _;
C++ Programming with R: Thesimergepackage
Design and Coding Standard
UseS4class for R sideand a mirror C++ class
C++ constructor get R structure and wrap them as STK++ arrays
# C o n s t r u c t o r of the S4 c l a s s s e t M e t h o d(
f=" i n i t i a l i z e ",
s i g n a t u r e=c(" C o C l u s t e r B i n a r y ") ,
d e f i n i t i o n=f u n c t i o n(.Object, x, y, n b c o c l u s t e r, i s C o M i x t u r e)
C o C l u s t e r B i n a r y M o d e l::C o C l u s t e r B i n a r y M o d e l( R c p p::S4 s 4 M o d e l) :
// . . . . .
, y i d _((S E X P)s 4 M o d e l.s l o t(" yid ") ) , m u k d _((S E X P)s 4 M o d e l.s l o t(" m u k d ") )
, s d k d _((S E X P)s 4 M o d e l.s l o t(" s i g m a k d ") ) , i s C o M i x t u r e _(s 4 M o d e l.s l o t(" i s C o M i x t u r e ") ) , x i j _((S E X P)s 4 M o d e l.s l o t(" xij ") )
// . . . . .
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
CR side s e t M e t h o d(
f=" l o g L i k e l i h o o d ",
s i g n a t u r e = " C o C l u s t e r B i n a r y ", d e f i n i t i o n = f u n c t i o n(o b j e c t) {
.C a l l(" l o g L i k e l i h o o d ",object,p a c k a g e=" s i m e r g e ") }
)
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
CC side
e x t e r n " C " S E X P l o g L i k e l i h o o d( S E X P m o d e l) {
R c p p::S4 s 4 m o d e l(m o d e l) ;
C o C l u s t e r B i n a r y M o d e l c o c l u s t(m o d e l) ;
c o c l u s t.c o m p u t e L o g L i k e l i h o o d() ; c o c l u s t.g e t V a l u e s(m o d e l) ;
r e t u r n m o d e l; }
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
C.
F˜C(t,r;θ) =X
k
t.klogπk+X
`
r.`logρl+H(t) +H(r)
+ X
i,j,k,`
tikrj`(log(1+ exp(yTi βkl)) +xijyTi βkl) + log(φ(yi;µk,Σk))
s e t M e t h o d(
f=" e n t r o p y ",
s i g n a t u r e = " C o C l u s t e r B i n a r y ", d e f i n i t i o n = f u n c t i o n(o b j e c t) {
e p s i l o n < - 1e-15 tik < - o b j e c t @ t i k rjl < - o b j e c t @ r j l
o b j e c t @ r o w E n t r o p y < - -sum(tik* log(e p s i l o n+tik) ) o b j e c t @ c o l E n t r o p y < - -sum(rjl* log(e p s i l o n+rjl) ) r e t u r n(o b j e c t)
} )
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
C.
F˜C(t,r;θ) =X
k
t.klogπk+X
`
r.`logρl+H(t) +H(r)
+ X
i,j,k,`
tikrj`(log(1+ exp(yTi βkl)) +xijyTi βkl) + log(φ(yi;µk,Σk))
C++ side
r o w E n t r o p y _= -t i k _.p r o d( (t i k _+R e a l M i n) .log() ) .sum() ; c o l E n t r o p y _= -r j l _.p r o d( (r j l _+R e a l M i n) .log() ) .sum() ;
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
C.
F˜C(t,r;θ) =X
k
t.klogπk+X
`
r.`logρl+H(t) +H(r)
+ X
i,j,k,`
tikrj`(log(1+ exp(yTi βkl)) +xijyTi βkl)+ log(φ(yi;µk,Σk))
for(k in 1:K) {
for(l in 1:L) {
o b j e c t @ l i k e l i h o o d k l[k,l] =
( (tik_[ ,k] * yid %* %b e t a k l d[k,l,]) %* % xij_ + c r o s s p r o d(tik[ ,k] ,p l o g i s(yid_%* %b e t a k l d[k,l
,] ,0 ,1 ,F,T) ) ) %* % rjl[ ,l];
} }
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
C.
F˜C(t,r;θ) =X
k
t.klogπk+X
`
r.`logρl+H(t) +H(r)
+ X
i,j,k,`
tikrj`(log(1+ exp(yTi βkl)) +xijyTi βkl)+ log(φ(yi;µk,Σk))
for(int k=0; k<K_; ++k) {
for(int l=0; l<L_; ++l) {
l i k e l i h o o d k l _(k,l)
= ( t i k _.col(k) .p r o d( y i d _*b e t a k l d _(k,l) ) . t r a n s p o s e() * x i j _
+ t i k _.col(k) .dot( (y i d _*b e t a k l d _(k,l) ) .l c d f c( l o g i s _) )
) * r j l _.col(l) ; }
}
C++ Programming with R: Thesimergepackage
Exemple: Computation of the Fuzzy Criterion F ˜
C.
F˜C(t,r;θ) =X
k
t.klogπk+X
`
r.`logρl+H(t) +H(r)
+ X
i,j,k,`
tikrj`(log(1+ exp(yTi βkl)) +xijyTi βkl)+log(φ(yi;µk,Σk))
g a u s s i a n L o g L i k e l i h o o d _= c o m p u t e G a u s s i a n L o g L i k e l i h o o d()
;
f u z z y L o g L i k e l i h o o d _= l i k e l i h o o d k l _.sum() + tk_.dot(p i k _.log() ) + rl_.dot(r h o l _.log() ) + g a u s s i a n L o g L i k e l i h o o d _; f u z z y C r i t e r i o n _= f u z z y L o g L i k e l i h o o d _
+ r o w E n t r o p y _ + c o l E n t r o p y _;
Preliminary Results
Summary
blockcluster package
simerge: Block clustering of binary data with Gaussian co-variables C++ Programming with R: Thesimergepackage
Preliminary Results References
Preliminary Results
Data set
n=444 individuals andm=515721 SNPs conserved.
Figure:Histogram of the iPFA variable and fitted Gaussian mixture models obtained with MixAll package
Preliminary Results
Model selection
ICL BIC-like approximations leads to the followingBIC(g,d)
−2max
θ logf(x,y;θ)+(g−1) logn+λlogn+(d−1) logm+gd(p+1) log(mn) withλthe number of parameters of theydistribution.
Figure:Choosing the number of blocks (Note: implemented criteria waswrong)
Preliminary Results
Results with (g , d ) = (2, 22) and y Gaussian mixture
(a) (b)
Figure:iPFA density (a), Proportion of mutation (b), BIC = 290551317
Preliminary Results
Results with (g , d ) = (2, 22) and y Gaussian rv
(a) (b)
Figure:iPFA density (a), Proportion of mutation (b), BIC = 287770996
Preliminary Results
Influence Measure
Figure:Repartition of the influence in clusters (by columns)
References
Summary
blockcluster package
simerge: Block clustering of binary data with Gaussian co-variables C++ Programming with R: Thesimergepackage
Preliminary Results References
References
Merci ` a la G4BBM team
2
Maryam DIARRA –Biomathematician PhD in Applied Mathematics Saint Louis University (UGB)
Mamadou DIOP –Computer Scientist Bioinformatician Master in Computer Science Saint Louis University (UGB) Cheikh LOUCOUBAR –Biomathematician
PhD in Statistical Genetics Head of the Group Dakar University / Paris 5
Amadou DIALLO –Biomathematician Bachelor in Mathematics Minot State University, USA
Mareme S. THIAM –Master Fellow in Mathematics M2 Mathematics – Big Data AIMS
Seydou Nourou SYLLA –Biomathematician PhD in Applied Mathematics Saint Louis University (UGB
Dame SY –Data Manager
DTS in Computer Science
Aboubacry GAYE –Master Fellow in Mathematics M2 Mathematics Saint Louis University (UGB)
Mame Malick DIENG –Computer Scientist Master in Computer Science Saint Louis University (UGB)
Other Activities
§ Support IPD units in data management and analysis
§ Teaching in collaborations with universities
Main Activities
§Research on human host genetic diversity and implication in malaria phenotypes
§New grant application
References
Links
I http://www.pasteur.sn/recherche/
biostatistique-bio-informatique-et-modelisation/
I https://cran.r-project.org/package=blockcluster I https://cran.r-project.org/package=rtkore I https://cran.r-project.org/package=MixAll I http://www.stkpp.org
I https://modal.lille.inria.fr/wikimodal/doku.php