Supplementary material for

(1)

Supplementary material for

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Miroslav Kratochvíl Oliver Hunewald Laurent Heirendt Vasco Verissimo Jiří Vondrášek Venkata P. Satagopam

Reinhard Schneider Christophe Trefois Markus Ollert

(2)

FlowSOM mean(F1)=0.6686 GigaSOM mean(F1)=0.6825

CD11b− Monocyte CD11bhi Monocyte CD11bmid Monocyte CMP Erythroblast GMP HSC Immature B Mature CD38lo B Mature CD38mid B Mature CD4+ T Mature CD8+ T Megakaryocyte MEP MPP Myelocyte Naive CD4+ T Naive CD8+ T NK Plasma cell Plasmacytoid DC Platelet Pre−B I Pre−B II

Assigned clusters

Manually gated cell type

Clustering comparison on Levine13 dataset

Basophils CD16− NK CD16+ NK CD34+CD38+CD123− HSPCs CD34+CD38+CD123+ HSPCs CD34+CD38lo HSCs CD4 T CD8 T Mature B Monocytes pDCs Plasma B Pre B Pro B

Assigned clusters

Clustering comparison on Levine32 dataset

CD11b− Monocyte CD11bhi Monocyte CD11bmid Monocyte CMP Erythroblast GMP HSC Immature B Mature CD38lo B Mature CD38mid B Mature CD4+ T Mature CD8+ T Megakaryocyte MEP MPP Myelocyte Naive CD4+ T Naive CD8+ T NK Plasma cell Plasmacytoid DC Platelet Pre−B I Pre−B II

Assigned clusters

Clustering comparison on Levine13 dataset

Basophils CD16− NK CD16+ NK CD34+CD38+CD123− HSPCs CD34+CD38+CD123+ HSPCs CD34+CD38lo HSCs CD4 T CD8 T Mature B Monocytes pDCs Plasma B Pre B Pro B

Assigned clusters

Clustering comparison on Levine32 dataset

Figure S1: Results of clustering precision comparison with FlowSOM, on Levine13 and Levine32

datasets.

(3)

●●●

●

●●● ●

● ●●

●

●●

●

●●

●

●●●

●●

●●●●

●●●

●

●●

●

●●

●

●●

● ●●

●

●●●

●●

●

● ●●●

●●

● ●●

●

● ●●●●

●●

●● ●●

●●

●

●●●●

●●

●

●●● ●

● ●●

●

●●

●

●●

●●●

●

●●

●

●●

●

● ●●

●

●●●

●

●●●●●●●

●●

● ●●●

●●●●●

●●

● ●

●

●●● ●●●

●●

● ● ●

●●●●●●●

●

●●

●

●●

●

●●●

● ●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●●●●●

●●●●

●

●●

●

●●●●

● ●●

●

●●

●●●

●

●●●

●

●●

●

● ●● ●●

●

●●

●●●

●● ●

●●●

●

●●●●

● ●●

●●

●

●●

●

●●

●

●●

●●●

●

●●●●

● ●●●

●●●●

●

●●

● ●●●●

●

● ●●●●●

●●

●●●●●●

●●●

●

●●

●

●●

●

● ●●

● ●●● ●●

●

●●

●

●● ●●●

●

● ●●

●●●●●●●●

●

●●

●●●

●●

●

●●●

●

●●●

●●

●●●●●●●●●

●●

●●●●

●

●●

●●●

●●

●

●●●

●

●●● ●●

●

●●

●

●●

●

●●●

●

● ●

● ●●

●●●

●● ●

●

●●

●

●●

●

●●●

●●

●● ●●

● ●●

●

● ●●

●●●

●●

●

●●●

●●

●

●●

●

●●

●

●●

● ●●

● ●

●●●●

● ●●

●●

●●●●

● ●●

●● ●

●●●

●●

●●●●

● ●●

●● ●

●●●

●

●●●

●●

●

●●

●

●●

●

●●

● ●●●●●●●●

●●

●

●●●●●

●

●● ●●●●●●●

● ●●

●

●●●●

●

● ●●●●●

●●●

●

●●

●

● ●●

●●●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●●●●●

●●●●

●●

●

●●

●

●●

●

●●●●●

●●

●●●

● ●●

●●

●

●●

●

●●

●●●● ●●

●

● ●

●

●●

●

●●●

●●●●

●●

●

●●

●

●●●●● ●

●●●●

●●●

●

● ●●

●●●

●

● ●●

●●●

●●

●

●●

●

●●●

●

● ●●

●●

●

● ●●

●

●●●

● ●●

●

●●

●● ●

●

●●

●

●●

●

● ●●

●●●●●

●●●●

●

●●

●

● ●●

●●

●

●●

●

●●

●

●●●

●

● ●

●●●

●●●●

● ●●●●●

●

●●

●●●

●

●●

●●●●

●

●●

●

● ●●●

●

●●

● ●

●●● ●●●●●

●

●●

●●●

●

●●

● ●●

●●

●●●●

●●●

●

●●

●

●●

●●●

●

● ●●

●●

●

●●●●●

●

●●●

●●●●●

●

●●●

●●

●

● ●

●●

●● ●●

●● ●●●●

●

●●

●

●●●●●

●

● ●●

●●●

●

●●

●●●

●● ●●

●●

●

●●●

● ●

●

● ●●

●●●

●

●●

●

●●

●●●●●

●●●●

●

● ●

●

● ●● ●●● ●●●

● ●●

●●

●

●●

●

●●

●●●●● ●●●

●●●

● ●●

●●

●

●●

●

●●

●●●

●

●●

●

●●

●

● ●●

●●●

●

● ●●

●●●●

●

●●●

●

● ●●●●●

●●●

10 dimensions 20 dimensions 30 dimensions 40 dimensions 50 dimensions

10x10 SOM20x20 SOM40x40 SOM

1 10 100

30 us 50 us 100 us

100 us 200 us 300 us

300 us 500 us 1 ms

# allocated CPUs

Total single−CPU time spent per cell processed

Cells per process ^● ^100Ki ^● ^200Ki ^● ^300Ki

CPU time consumption of distributed SOM training

Figure S2: Detailed overview of the effect of horizontal scaling on the performance of SOM training

in GigaSOM.jl. The color distinguishes different benchmarked levels of CPU occupancy (100 to 300

thousand cells loaded in memory per CPU).

(4)

10 dimensions 20 dimensions 30 dimensions 40 dimensions 50 dimensions

10×10 SOM20×20 SOM40×40 SOM

0.7× 1× 2× 0.7× 1× 2× 0.6×0.7× 1× 0.5× 0.7× 1× 0.5× 0.7× 1×

0.5×

0.7×

1×

0.5×

0.7×

1×

0.5×

1×

3×

Speedup with kd−trees

Speedup with ball−trees

● SOM training ● SOM classification ● EmbedSOM

Speedups from spatial indexing structures

Figure S3: Detailed overview of the speedups provided by using spatial indexes for accelerating the

nearest neighbor queries in implemented algorithms.

(5)

CD8 memory

CD8 naive CD4 resting Thelper

CD4 effector Thelper

CD4 resting Treg CD4 effector Treg

NK

NKT

γδTCR autofluorescent cells dead cells

unmarked leukocytes

debris

Figure S4: Larger version of Figure 5 — an embedding of more than 1 billion cells from the IMPC dataset (FR-FCM-ZYX9). The main lineage markers are highlighted as combinations of RGB colors:

CD8 in red, CD4 in green and CD161 in blue. Annotations are made based on presence of several

markers that are not highlighted in the embedding, mainly the Live/Dead marker, γδTCR, CD25,

GITR, and fluorescence scatter intensities.

(6)

u s i n g D i s t r i b u t e d , GigaSOM , G i g a S c a t t e r , C l u s t e r M a n a g e r s i m p o r t Glob , N e a r e s t N e i g h b o r s , D i s t r i b u t i o n s

# f i n d the f i l e n a m e s to p r o c e s s f i l e s = G l o b . g l o b (" T C _ S P L _ *. fcs ")

# r e a d the n u m b e r of a v a i l a b l e w o r k e r s f r o m e n v i r o n m e n t and s t a r t the w o r k e r p r o c e s s e s n _ w o r k e r s = p a r s e ( Int , ENV [" S L U R M _ N T A S K S "])

a d d p r o c s _ s l u r m ( n _ w o r k e r s , t o p o l o g y =: m a s t e r _ w o r k e r )

# l o a d the r e q u i r e d p a c k a g e s on all w o r k e r s

@ e v e r y w h e r e u s i n g G i g a S O M

@ e v e r y w h e r e u s i n g G i g a S c a t t e r

# l o a d the d a t a to w o r k e r s

d a t a s e t = l o a d F C S S e t (: I M P C d a t a , f i l e s )

# c o l l e c t the t o t a l n u m b e r of c e l l s f r o m all w o r k e r s

d a t a s e t _ s i z e = d i s t r i b u t e d _ m a p r e d u c e ( dataset , d - > s i z e ( d ,1) , +)

@ i n f o " L o a d e d t o t a l $ d a t a s e t _ s i z e c e l l s . "

# p r e p r o c e s s the d i s t r i b u t e d d a t a

d s e l e c t ( dataset , V e c t o r ( 1 : 1 8 ) ) # c o l u m n s 1 -18 c o n t a i n i n t e r e s t i n g i n f o r m a t i o n d t r a n s f o r m _ a s i n h ( dataset , V e c t o r (7:18) , 5 0 0 . 0 ) # t r a n s f o r m the m a r k e r e x p r e s s i o n s d s c a l e ( dataset , V e c t o r ( 1 : 1 8 ) ) # s c a l e all c o l u m n s

# t r a i n the SOM and run the e m b e d d i n g ( t h i s u s e s all a v a i l a b l e S l u r m w o r k e r s ) som = i n i t G i g a S O M ( dataset , 32 , 32)

som = t r a i n G i g a S O M ( som , dataset , e p o c h s =30 ,

r F i n a l =0.1 , r a d i u s F u n = e x p R a d i u s ( -20.0) , k n n T r e e F u n = N e a r e s t N e i g h b o r s . B a l l T r e e ) e = e m b e d G i g a S O M ( som , dataset , k =16 , o u t p u t =: e m b e d d e d I M P C )

# s e t u p c o n s t a n t s for r a s t e r i z a t i o n of the e m b e d d i n g r a s t e r S i z e = (2048 , 2 0 4 8 )

a l p h a = 0 . 0 0 3

d i s t = D i s t r i b u t i o n s . N o r m a l ()

# run the d i s t r i b u t e d r a s t e r i z a t i o n

c o m p o u n d _ r a s t e r = m i x e d R a s t e r ( d i s t r i b u t e d _ m a p r e d u c e ( [ dataset , e ] ,

( d , e ) - > b e g i n

# c o n v e r t the n o r m a l i z e d e x p r e s s i o n s to n u m b e r s b e t w e e n 0 - -1 c o l o r s = h c a t (

D i s t r i b u t i o n s . cdf .( dist , d [: ,10]) , # R : CD8 c o l u m n D i s t r i b u t i o n s . cdf .( dist , d [: ,16]) , # G : CD4 c o l u m n D i s t r i b u t i o n s . cdf .( dist , d [: ,15]) , # B : C D 1 6 1 c o l u m n f i l l ( alpha , s i z e ( d , 1 ) ) )

# r a s t e r i z e the l o c a l p a r t of the d a t a m i x a b l e R a s t e r ( r a s t e r i z e (

r a s t e r S i z e ,

M a t r i x { F l o a t 6 4 }( e ’) , M a t r i x { F l o a t 6 4 }( colors ’) , x l i m =( -2 ,33) , y l i m =( -2 ,33))) end,

m i x R a s t e r s ))

# s a v e the r e s u l t to a PNG

s a v e P N G (" impc - e m b e d d i n g . png ", c o m p o u n d _ r a s t e r )

Supplementary material for