Supplementary material for
GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
Miroslav Kratochvíl Oliver Hunewald Laurent Heirendt Vasco Verissimo Jiří Vondrášek Venkata P. Satagopam
Reinhard Schneider Christophe Trefois Markus Ollert
FlowSOM mean(F1)=0.6686 GigaSOM mean(F1)=0.6825
CD11b− Monocyte CD11bhi Monocyte CD11bmid Monocyte CMP Erythroblast GMP HSC Immature B Mature CD38lo B Mature CD38mid B Mature CD4+ T Mature CD8+ T Megakaryocyte MEP MPP Myelocyte Naive CD4+ T Naive CD8+ T NK Plasma cell Plasmacytoid DC Platelet Pre−B I Pre−B II
Assigned clusters
Manually gated cell type
Clustering comparison on Levine13 dataset
FlowSOM mean(F1)=0.8242 GigaSOM mean(F1)=0.8347
Basophils CD16− NK CD16+ NK CD34+CD38+CD123− HSPCs CD34+CD38+CD123+ HSPCs CD34+CD38lo HSCs CD4 T CD8 T Mature B Monocytes pDCs Plasma B Pre B Pro B
Assigned clusters
Manually gated cell type
Clustering comparison on Levine32 dataset
FlowSOM mean(F1)=0.6686 GigaSOM mean(F1)=0.6825
CD11b− Monocyte CD11bhi Monocyte CD11bmid Monocyte CMP Erythroblast GMP HSC Immature B Mature CD38lo B Mature CD38mid B Mature CD4+ T Mature CD8+ T Megakaryocyte MEP MPP Myelocyte Naive CD4+ T Naive CD8+ T NK Plasma cell Plasmacytoid DC Platelet Pre−B I Pre−B II
Assigned clusters
Manually gated cell type
Clustering comparison on Levine13 dataset
FlowSOM mean(F1)=0.8242 GigaSOM mean(F1)=0.8347
Basophils CD16− NK CD16+ NK CD34+CD38+CD123− HSPCs CD34+CD38+CD123+ HSPCs CD34+CD38lo HSCs CD4 T CD8 T Mature B Monocytes pDCs Plasma B Pre B Pro B
Assigned clusters
Manually gated cell type
Clustering comparison on Levine32 dataset
Figure S1: Results of clustering precision comparison with FlowSOM, on Levine13 and Levine32
datasets.
●●●
●●●
●●●
●
●●● ●
● ●●
●
●
●●
●
●●
●
●
●
●●●
●●
●●●●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●●
● ●●
●
●
●
●●●
●●●
●●
●
● ●●●
●●
●●
●●
●●
● ●●
●
● ●●●●
●●
●● ●●
●●
●
●
●●●●
●●
●
●●● ●
● ●●
●
●●
●
●●
●●●
●
●●
●●
●
●●
●
● ●●
● ●●
●
●
●
●●●
●●●
●●●
●
●
●
●●●●●●●
●●
● ●●●
●●●●●
●●
●●
● ●
●
●
●
●●● ●●●
●●
● ● ●
●●●●●●●
●
●●
●
●
●
●●
●
●●●
●●●
● ●●
●
●
●
●●
●
●●●
●
●●
●
●●
●
● ●
●●●●●
●●●●
●
●●
●
●
●●●●
● ●●
●
●
●●
●●
●●●
●●●
●
●
●
●●●
●●●
●●●
●
●●
●
● ●● ●●
●
●●
●●●
●● ●
●●●
●
●
●●●●
● ●●
●●
●
●
●
●
●●
●
●
●●
●
●●
●●●
●
●
●●●●
● ●●●
●●●●
●
●●
●●
● ●●●●
●
● ●●●●●
●●
●●●●●●
●●●
●
●
●
●●
●
●●
●
● ●●
● ●●● ●●
●
●
●
●
●●
●
●
●
●● ●●●
●
● ●●
●●●●●●●●
●
●
●●
●●●
●●
●
●●●
●
●●●
●●
●●●●●●●●●
●●
●●●●
●
●
●
●
●●
●●●
●●
●
●●●
●
●●● ●●
●
●●
●
●●
●
●
●
●●●
●
● ●
● ●●
●●●
●● ●
●
●●
●●
●
●●
●
●●●
●●●
●●
●● ●●
● ●●
● ●●
●
●
●
● ●●
● ●●
●●●
●●
●
●●●
●●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●
● ●●
● ●
●●●●
● ●●
●●
●●●●
● ●●
●● ●
●●●
●●
●●●●
● ●●
●● ●
●● ●
●●●
●
●●●
●●
●●
●
●●
●
●●
●
●
●●
● ●●●●●●●●
●●
●
●●●●●
●
●● ●●●●●●●
● ●●
●
●●●●
●
● ●●●●●
●●●
●
●
●●
●
●
● ●●
●●●
●●●
●●
●
●●
●
●●
●
●●
●
●
●
●●●
●
●
●●
●●●●●
●●●●
●●
●
●●
●
●●
●
●
●
●●●●●
●●
●●●
● ●●
●●
●
●●
●
●●
●●●● ●●
●
●
● ●
●
●
●
●●
●
●●●
●●●
●●●●
●●
●●
●
●●
●
●
●
●
●
●
●●●●● ●
●●●●
●●●
●
●
●
● ●●
●●●
●
●
●
● ●●
●●●
●●
●
●●
●
●●●
●
● ●●
●●
●●
●
● ●●
●
●
●
●●●
● ●●
●
●
●
●●
●● ●
●
●●
●
●
●●
●●
●
● ●●
●●●●●
●●●●
●
●
●
●
●●
●
●
●
● ●●
●●
●
●●
●
●●
●
●●●
●
●
●
● ●
●●●
●●●●
● ●●●●●
●
●●
●●●
●
●
●
●
●
●
●●
●●●●
●
●●
●
● ●●●
●
●●
● ●
●●● ●●●●●
●
●
●
●
●●
●●●
●●●
●
●●
● ●●
● ●●
●●
●●●●
●●●
●
●●
●
●●
●●●
●
●
●
● ●●
●●
●
●●●●●
●
●●●
●●●●●
●
●●●
●●
●●
●●
●
● ●
●●
●● ●●
●● ●●●●
●
●●
●●
●
●●●●●
●
● ●●
●●●
●
●
●
●●
●●●
●● ●●
●●
●
●●●
● ●
●
●
●
●
● ●●
●●●
●●●
●●●
●●●
●
●
●
●
●●
●●
●
●
●
●●
●●●●●
●●●●
●
●
● ●
●
● ●● ●●● ●●●
● ●●
●●
●
●●
●
●
●
●●
●●●●● ●●●
●●●
● ●●
●●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●●
●
● ●●
●●●
●
●
● ●●
●●●●
●
●●●
●
●
●
●
●
● ●●●●●
●●●
10 dimensions 20 dimensions 30 dimensions 40 dimensions 50 dimensions
10x10 SOM20x20 SOM40x40 SOM
1 10 100
1 10 100
1 10 100
1 10 100
1 10 100
30 us 50 us 100 us
100 us 200 us 300 us
300 us 500 us 1 ms
# allocated CPUs
Total single−CPU time spent per cell processed
Cells per process ● 100Ki ● 200Ki ● 300Ki
CPU time consumption of distributed SOM training
Figure S2: Detailed overview of the effect of horizontal scaling on the performance of SOM training
in GigaSOM.jl. The color distinguishes different benchmarked levels of CPU occupancy (100 to 300
thousand cells loaded in memory per CPU).
10 dimensions 20 dimensions 30 dimensions 40 dimensions 50 dimensions
10×10 SOM20×20 SOM40×40 SOM
0.7× 1× 2× 0.7× 1× 2× 0.6×0.7× 1× 0.5× 0.7× 1× 0.5× 0.7× 1×
0.5×
0.7×
1×
0.5×
0.7×
1×
0.5×
1×
3×
Speedup with kd−trees
Speedup with ball−trees
● SOM training ● SOM classification ● EmbedSOM
Speedups from spatial indexing structures
Figure S3: Detailed overview of the speedups provided by using spatial indexes for accelerating the
nearest neighbor queries in implemented algorithms.
CD8 memory
CD8 naive CD4 resting Thelper
CD4 effector Thelper
CD4 resting Treg CD4 effector Treg
NK
NKT
γδTCR autofluorescent cells dead cells
unmarked leukocytes
debris
Figure S4: Larger version of Figure 5 — an embedding of more than 1 billion cells from the IMPC dataset (FR-FCM-ZYX9). The main lineage markers are highlighted as combinations of RGB colors:
CD8 in red, CD4 in green and CD161 in blue. Annotations are made based on presence of several
markers that are not highlighted in the embedding, mainly the Live/Dead marker, γδTCR, CD25,
GITR, and fluorescence scatter intensities.
u s i n g D i s t r i b u t e d , GigaSOM , G i g a S c a t t e r , C l u s t e r M a n a g e r s i m p o r t Glob , N e a r e s t N e i g h b o r s , D i s t r i b u t i o n s
# f i n d the f i l e n a m e s to p r o c e s s f i l e s = G l o b . g l o b (" T C _ S P L _ *. fcs ")
# r e a d the n u m b e r of a v a i l a b l e w o r k e r s f r o m e n v i r o n m e n t and s t a r t the w o r k e r p r o c e s s e s n _ w o r k e r s = p a r s e ( Int , ENV [" S L U R M _ N T A S K S "])
a d d p r o c s _ s l u r m ( n _ w o r k e r s , t o p o l o g y =: m a s t e r _ w o r k e r )
# l o a d the r e q u i r e d p a c k a g e s on all w o r k e r s
@ e v e r y w h e r e u s i n g G i g a S O M
@ e v e r y w h e r e u s i n g G i g a S c a t t e r
# l o a d the d a t a to w o r k e r s
d a t a s e t = l o a d F C S S e t (: I M P C d a t a , f i l e s )
# c o l l e c t the t o t a l n u m b e r of c e l l s f r o m all w o r k e r s
d a t a s e t _ s i z e = d i s t r i b u t e d _ m a p r e d u c e ( dataset , d - > s i z e ( d ,1) , +)
@ i n f o " L o a d e d t o t a l $ d a t a s e t _ s i z e c e l l s . "
# p r e p r o c e s s the d i s t r i b u t e d d a t a
d s e l e c t ( dataset , V e c t o r ( 1 : 1 8 ) ) # c o l u m n s 1 -18 c o n t a i n i n t e r e s t i n g i n f o r m a t i o n d t r a n s f o r m _ a s i n h ( dataset , V e c t o r (7:18) , 5 0 0 . 0 ) # t r a n s f o r m the m a r k e r e x p r e s s i o n s d s c a l e ( dataset , V e c t o r ( 1 : 1 8 ) ) # s c a l e all c o l u m n s
# t r a i n the SOM and run the e m b e d d i n g ( t h i s u s e s all a v a i l a b l e S l u r m w o r k e r s ) som = i n i t G i g a S O M ( dataset , 32 , 32)
som = t r a i n G i g a S O M ( som , dataset , e p o c h s =30 ,
r F i n a l =0.1 , r a d i u s F u n = e x p R a d i u s ( -20.0) , k n n T r e e F u n = N e a r e s t N e i g h b o r s . B a l l T r e e ) e = e m b e d G i g a S O M ( som , dataset , k =16 , o u t p u t =: e m b e d d e d I M P C )
# s e t u p c o n s t a n t s for r a s t e r i z a t i o n of the e m b e d d i n g r a s t e r S i z e = (2048 , 2 0 4 8 )
a l p h a = 0 . 0 0 3
d i s t = D i s t r i b u t i o n s . N o r m a l ()
# run the d i s t r i b u t e d r a s t e r i z a t i o n
c o m p o u n d _ r a s t e r = m i x e d R a s t e r ( d i s t r i b u t e d _ m a p r e d u c e ( [ dataset , e ] ,
( d , e ) - > b e g i n
# c o n v e r t the n o r m a l i z e d e x p r e s s i o n s to n u m b e r s b e t w e e n 0 - -1 c o l o r s = h c a t (
D i s t r i b u t i o n s . cdf .( dist , d [: ,10]) , # R : CD8 c o l u m n D i s t r i b u t i o n s . cdf .( dist , d [: ,16]) , # G : CD4 c o l u m n D i s t r i b u t i o n s . cdf .( dist , d [: ,15]) , # B : C D 1 6 1 c o l u m n f i l l ( alpha , s i z e ( d , 1 ) ) )
# r a s t e r i z e the l o c a l p a r t of the d a t a m i x a b l e R a s t e r ( r a s t e r i z e (
r a s t e r S i z e ,
M a t r i x { F l o a t 6 4 }( e ’) , M a t r i x { F l o a t 6 4 }( colors ’) , x l i m =( -2 ,33) , y l i m =( -2 ,33))) end,
m i x R a s t e r s ))
# s a v e the r e s u l t to a PNG
s a v e P N G (" impc - e m b e d d i n g . png ", c o m p o u n d _ r a s t e r )