• Aucun résultat trouvé

PINTMF: PENALIZED INTEGRATIVE MATRIX FACTORIZATION METHOD FOR MULTI-OMICS DATA

N/A
N/A
Protected

Academic year: 2021

Partager "PINTMF: PENALIZED INTEGRATIVE MATRIX FACTORIZATION METHOD FOR MULTI-OMICS DATA"

Copied!
15
0
0

Texte intégral

(1)

HAL Id: hal-03154671

https://hal.archives-ouvertes.fr/hal-03154671

Preprint submitted on 2 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

FACTORIZATION METHOD FOR MULTI-OMICS

DATA

Morgane Pierre-Jean, Florence Mauger, Jean-François Deleuze, Edith Le

Floch

To cite this version:

Morgane Pierre-Jean, Florence Mauger, Jean-François Deleuze, Edith Le Floch. PINTMF:

PENAL-IZED INTEGRATIVE MATRIX FACTORIZATION METHOD FOR MULTI-OMICS DATA. 2021.

�hal-03154671�

(2)

M

-A PREPRINT

Morgane PIERRE-JEAN Universit´e de Paris-Saclay,

Centre National de Recherche en G´enomique Humaine, CEA, Evry, France, [email protected]

Florence MAUGER Universit´e de Paris-Saclay,

Centre National de Recherche en G´enomique Humaine, CEA, Evry, France Jean-Franc¸ois DELEUZE

Universit´e de Paris-Saclay,

Centre National de Recherche en G´enomique Humaine, CEA, Evry, France Edith LE FLOCH

Universit´e de Paris-Saclay,

Centre National de Recherche en G´enomique Humaine, CEA, Evry, France

March 2, 2021

A

BSTRACT

It is more and more common to explore the genome at diverse levels and not only at a single omic level. Through integrative statistical methods, omics data have the power to reveal new biological processes, potential biomarkers, and subgroups of a cohort. The matrix factorization (MF) is a unsupervised statistical method that allows giving a clustering of individuals, but also revealing relevant omic variables from the various blocks. Here, we present PIntMF (Penalized Integrative Matrix Factorization), a model of MF with sparsity, positivity and equality constraints.To induce sparsity in the model, we use a classical Lasso penalization on variable and individual matrices. For the matrix of samples, sparsity helps for the clustering, and normalization (matching an equality constraint) of inferred coefficients is added for a better interpretation. Besides, we add an automatic tuning of the sparsity parameters using the famous glmnet package. We also proposed three criteria to help the user to choose the number of latent variables. PIntMF was compared to other state-of-the-art integrative methods including feature selection techniques in both synthetic and real data. PIntMF succeeds in finding relevant clusters as well as variables in two types of simulated data (correlated and uncorrelated). Then, PIntMF was applied to two real datasets (Diet and cancer), and it reveals interpretable clusters linked to available clinical data. Our method outperforms the existing ones on two criteria (clustering and variable selection). We show that PIntMF is an easy, fast, and powerful tool to extract patterns and cluster samples from multi-omics data.

1

Introduction

The improvement of high-throughput biological technologies enables the production of various omics data such as genomic, transcriptomic, epigenomic, proteomic, and metabolomic data (Ritchie et al., 2015; Yugi et al., 2016). The

(3)

generation of these data allows investigating biological processes in cancer or complex diseases. For example, The Cancer Genome Atlas (TCGA (Network et al., 2012)) has already produced numerous omics data for a set of 32 cancer types (Vasaikar et al., 2017). Recently, other multi-omics studies on complex diseases and single-cell data are also emergent (Rowlands et al., 2014; Bock et al., 2016; Yang, 2020).

However, integrating omics data addresses several statistical challenges, such as dealing with a large number of vari-ables, few samples, and data heterogeneity (Bersanelli et al., 2016). Indeed, the statistical distributions of omics data are very heterogeneous. For instance, mutations can be modeled by a binary distribution, while RNAseq data can be modeled by a Negative Binomial distribution and metabolomic data by a Gaussian distribution. Besides, the omic block sizes could vary from one hundred to one billion variables. Furthermore, collecting several omics for a single sample could be difficult due to the cost and access to the biological material.

To identify potential biomarkers and new classifications in complex diseases, since the last decade, unsupervised integrative methods have been developed to analyze the multi-omics datasets (Tini et al., 2017; Huang et al., 2017; Chauvel et al., 2019; Pierre-Jean et al., 2019; Cantini et al., 2020). Blocks of omics data can be seen as matrices, and relevant information can be extracted using dimension reduction methods, particularly, matrix factorization (MF) methods (Sastry et al., 2020) and canonical correlation analysis (CCA) (Tenenhaus and Tenenhaus, 2011).

CCA methods are used to integrate multi-omics data and aim to maximize the correlation between omics under con-straints (Tenenhaus and Tenenhaus, 2011; Tenenhaus et al., 2014; Rodosthenous et al., 2020).

Then, MF techniques infer two matrices when applied to a single omic data: the first one describes the structure between variables (e.g., genes, probes, regions) and the second one describes the structure between samples.

One famous MF method is the Non-Negative Matrix Factorization (NMF, (Lee and Seung, 1999)). This method implements non-negativity constraints on the two inferred matrices. NMF provides a way to explain the structure of data by providing variable profiles (dictionary for each dimension). Besides, NMF enables a classification of the samples thanks to the second matrix. The NMF is a commonly applied method used for a single omic block to identify disease subtypes in gene expression data (Burstein et al., 2015) or recently, in DNA methylation data (Reilly et al., 2019).

More recently, extensions of MF have been developed to perform integrative analysis (Mo et al., 2013; Chalise et al., 2014; Chen and Zhang, 2018). MF extensions need to infer more than two matrices: one matrix for each omic block is computed and one matrix for samples.

Matrix Factorization showed that it is a powerful technique to integrate heterogeneous data (Chauvel et al., 2019; Pierre-Jean et al., 2019; Cantini et al., 2020). In our article, we propose a Penalized Integrative Matrix Factorization method called PIntMF, to discover new patterns and a new classification of a cohort. First, to add sparsity on the first inferred matrix (corresponding to the variable blocks), we used a common regularization technique: the Least Absolute Shrinkage and Selection Operator (LASSO (Tibshirani, 1996)). Then, sparsity, non-negativity and equality constraints are added to the second matrix (corresponding to the samples) to help for the interpretability of the clustering. Besides, we propose criteria to choose the number of latent variables and to properly initialize the algorithm.

The performance of this new unsupervised model was evaluated on both simulated and real data. We applied PIntMF on a simulated framework introduced by our group in (Pierre-Jean et al., 2019) but also on a simulated framework from (Chung and Kang, 2019). We compared our method to several existing unsupervised methods that perform both variable selection and clustering: intNMF (Chalise and Fridley, 2017), SGCCA (Tenenhaus et al., 2014), MoCluster (Meng et al., 2015), CIMLR (Ramazzotti et al., 2018), and iClusterPlus (Mo and Shen, 2018). Then, we applied the model on a murine liver dataset (Williams et al., 2016) and glioblastoma cancer data from TCGA already used in (Shen et al., 2012).

(4)

2

Method

2.1 Model description

In the following, A denotes a matrix, a a vector and a a scalar. We consider K matrices X1, . . . XK as the input of

each method. Each matrix Xk is of size n × Jk (n is the number of samples and Jk the number of variables for the

block k. In this article, we propose a model based on the matrix factorization method i.e.:

Xk≈ WHk

(1)

where W denotes a common basis matrix and Hka specific coefficient matrix associated with the block k. W is of

size n × P and Hkis of size P × J

k. Therefore, the variable P is the number of latent variables in the model.

To ensure identifiability and improve interpretation of the model, non-negativity and sparsity constraints are imposed on W (as in intNMF model described in (Chalise and Fridley, 2017)). W will be used to cluster samples

simultane-ously across the K omics blocks. On Hk, a sparsity constraint is imposed to perform variable selection simultaneously

to the clustering of samples. The model 1 can be extended to the following optimization problem:

min W,H1,...,Hk K X k=1 kXk− WHkk2 F + λkkHkk1+ (2) n X i=1 µikwi•k1 s.t. W ≥ 0 where kHkk 1=PPp=1PJj=1k |hkpj|. 2.2 Solving equation

The optimization problem 2 is not convex on W, H1, . . . , Hk, but is convex separately on each matrix. Consequently,

it can be solved alternatively on W, H1, . . . , Hkuntil convergence.

2.2.1 Solve on W

In this step, Hkis fixed and the problem 3 is solved on W.

min W K X k=1 kXk− WHkk2F+ n X i=1 µikwi•k1 st. W ≥ 0 (3)

All individuals are independent for the weights W when Hkare fixed. The problem for an individual i can be written

as follows: min wi• K X k=1

kxki•− wi•Hkk2+ µikwi•k1 st. wi• ≥ 0 (4)

Equation 4 is equivalent to min wi• K X k=1 Jk X j=1 (xkij− wi•hk•j) 2 + µikwi•k1 st. wi•≥ 0 (5)

The optimization problem described by 5 is a classical lasso problem with a positivity constraint. It can be easily and fastly solved by glmnet R package (Jerome et al., 2010).

(5)

2.2.2 Solve on Hk

When W is fixed, each Hk can be solved independently. In this section, to be more readable, the index k is removed

from the equations.

min H Q(H) = minH kX − WHk 2 F+ λ P X p=1 J X j=1 |hpj| (6) Q(H) = Trace(X − WH)(X − WH)T + λPP p=1 PJ j=1|hpj| = vec(X − WH)Tvec(X − WH)+ λPP p=1 PJ j=1|hpj| We denote h = vec(H) =             H11 .. . HP 1 .. . H1J .. . HP J             and x = vec(X) =             X11 .. . Xn1 .. . X1J .. . XnJ             . Q(H) = (x − vec(WH))T(x − vec(WH)) + λkhk1 = (x − (IJ⊗ W)vec(H))T(x − (IJ⊗ W)vec(H)) +λkhk1 = (x − ˜Wh)T(x − ˜Wh) + λkhk 1

where IJis the identity matrix of size J and ˜W = IJ⊗ W

We can reformulate the problem as follows:

Q(H) = kx − ˜Whk2+ λkhk

1

λ will be optimized for each block k = 1, . . . , K.

As for W, we used the glmnet package to solve this problem.

2.2.3 Normalization

We would like to consider W as a weight matrix. To avoid problems of convergence or non-identifiability, the nor-malization by the sum of weights for each row of W is added after computing the matrix, i.e. each row is divided by its sum after each step:

wi•= wi• PP p=1wip (7) 2.3 Stopping criteria

The stopping criterion of the model is determined by the convergence of the matrix W. The stability of the similarity of matrix W between two iterations means that the model has converged therefore we stop the algorithm. The similarity

between Wt−1and Wtis measured with the ARI. The users have also the possibility to define a maximum number

(6)

2.4 Automatic tuning of sparsity parameters

For each block Xk, we need to calibrate the sparsity parameter λk and µi. The main advantage of glmnet package

is the speed (see Supplementary Materials Fig. S9). Besides, glmnet implements a cross validation technique to choose the best λ or µ. PIntMF takes advantage of glmnet to calibrate the penalty on each block. Therefore the only parameter that the user needs to tune is the number of latent variables P .

2.5 Clustering

In this article, all clusterings are obtained by applying a hierarchical clustering with the ward distance (Ward Jr, 1963) on matrix W. For the optimal number of clusters, P is chosen.

2.6 Criteria to choose the best model

In this section, we present three different criteria to choose the appropriate number of latent variables (P ).

2.6.1 Mean square error

The number of latent variables can be optimized by looking at the curve of the Mean Square Error (MSE). In this context, the mean square error (MSE) for each dataset k is defined by:

M SEPk = kX

k− WHkk2 F

n × Jk

(8)

Then, the total MSE is then defined by averaging the different M SEPk:

M SEP =

X

k

M SEkP/K (9)

2.6.2 Percentage of variation explained (PVE)

To measure the performance of the method, we computed the Percentage of Variation Explained (Nowak et al., 2011) defined by the following formula:

P V E(W, Hk) = 1 − kX k− WHkk2 F kXk− ¯Xk1 J kk2F (10)

where ¯Xkis a vector containing the average profile of each individual:

¯

Xki =

P

jxij

Jk , and 1J k= (1, . . . , 1) is a row-vector of size Jk.

Then, we computed the global PVE as the mean of the PVE on the K blocks i.e.:

P V E = 1 K K X k=1 P V E(W, Hk) (11) 2.6.3 Cophenetic distance

We were inspired by (Gaujoux and Seoighe, 2010) for the last criterion.

We want to assess if the distances in the tree (after hierarchical clustering on W) reflect the original distances accu-rately.

One way is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function on W (Sokal and Rohlf, 1962). The clustering is valid, if the correlation between the two quantities is high. Note that we use the cophenetic function defined by (Sneath et al., 1973).

The cophenetic correlation usually decreases with the increase of P values. Brunet et al. (2004) suggested choosing the smallest value of P for which this coefficient starts decreasing.

(7)

3

Performance criteria

Two criteria are used to assess the performance of our method and to compare it with others.

3.1 Adjusted Rand Index (ARI)

On a simulated dataset and on well known real datasets, it is possible to compute the similarity between the true and the inferred classifications. We use the Adjusted Rand Index as a criterion to evaluate the performance of our method. The Adjusted Rand Index (Rand, 1971) is equal to one when the two classifications that are compared are totally similar and zero or even negative if the classifications are completely different.

3.2 Area under the ROC curve (AUROC)

On a simulated dataset, the variables that drive the subgroups are known, and it is easy to compute false-positive and true-positive rates. First, variables are ordered by their standard deviation (from the highest to the lowest) computed on the H matrix to highlight the largest differences between the P components and therefore the most contributory to the clusters. To summarize the information of these two quantities, we compute the area under the TPR-FPR curve (AUROC). An AUROC equal to one means that the method selects the variables with no error. An AUROC under 0.50 means that false-positive variables are selected before the true positive ones.

4

Results

4.1 Optimization of the algorithm

4.1.1 Initialization

Often in NMF algorithms (Lee and Seung, 1999), the matrices are initialized by non-negative random values. We assess four kinds of initialization for PIntMF (hierarchical clustering, random, Similarity Network Fusion and Singular Values Decomposition).

The best initialization is based on the SNF algorithm (Wang et al., 2014) (Fig. S1). This initialization has the advantage to take into account simultaneously the K blocks of the analysis.

Therefore, for all the following analyses, SNF initialization was used.

4.1.2 Computing optimization of H

Several algorithms to solve the Lasso problem on Hk were tested. glmnet is the fastest package among them

(Sup-plementary materials Fig. S9).

4.2 Performance on simulated datasets

We assess the performance of PIntMF in two simulated frameworks described below.

4.2.1 Simulations on independent datasets (non-correlated blocks)

The performance of PIntMF to cluster samples and to select relevant variables was evaluated on simulated data de-scribed in (Pierre-Jean et al., 2019). The framework of these simulations is composed of three blocks with three different types of distribution (Binary, Beta-like, and Gaussian) to simulate the heterogeneity of the integrative omics data studies. Indeed, a binary distribution could match a mutation (equal to 1 if the gene is mutated and 0 otherwise); a Beta-like distribution could match DNA methylation data, and a Gaussian distribution could match gene expression values.

Four unbalanced groups (composed of 25, 20, 5, and 10 individuals) have been simulated (Benchmarks 1 to 5). Datasets with 2, 3, and 4 balanced groups have also been simulated (Benchmarks 6 to 8). Each benchmark is simulated 50 times.

PIntMF was compared to several integrative unsupervised methods (Pierre-Jean et al., 2019) that perform both cluster-ing and variable selection namely: intNMF (Chalise et al., 2014), SGCCA (Tenenhaus et al., 2014), MoCluster (Meng et al., 2015), iClusterPlus (Mo et al., 2013), and CIMLR (Ramazzotti et al., 2018).

(8)

On the eight simulated benchmarks with various levels of signal to noise ratio, PIntMF and MoCluster outperform the other methods with an ARI equal to 1 in most cases (Fig. 1).

iClusterPlus CIMLR SGCCA MoCluster PIntMF intNMF Benchmar k1 Benchmar k2 Benchmar k3 Benchmar k4 Benchmar k5 Benchmar k6 Benchmar k7 Benchmar k8 Benchmar k1 Benchmar k2 Benchmar k3 Benchmar k4 Benchmar k5 Benchmar k6 Benchmar k7 Benchmar k8

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Figure 1: Adjusted Rand Index of PIntMF, intNMF, SGCCA, MoCluster, iClusterPlus, and CIMLR methods on simulated datasets. B1 :Referrence, B2: More Gaussian noise, B3: More Gaussian noise and more Binary noise, B4: More Beta noise and more Binary noise, B5: More Relevant variables, B6: 2 balanced groups, B7: 3 balanced groups, B8: 4 balanced groups

The performance of variable selection is assessed using the area under ROC curves (AUROC) after computing False Positive Rates (FPR) and True Positive Rates (TPR) (see section 3.2). The computation of the AUROC shows that PIntMF performs as well as MoCluster on the three types of data (Table S1 in Supplementary Materials). Indeed, PIntMF reaches either the first or the second-best AUROC for these simulations. Besides, the lowest AUROC is equal to 0.88 which means that the method is both sensitive and specific.

(9)

4.2.2 Simulation based on real data (correlated blocks)

We evaluate the performance of PIntMF on a simulated framework based on cancer real data and developed by (Chung and Kang, 2019). Indeed, the previous framework does not simulate any correlation between omics blocks.

OmicsSIMLA is a simulation tool for generating multi-omics data with disease status. This tool simulates CpGs with methylation proportions, RNA-seq read counts and normalized protein expression levels. Here, we simulated 50 datasets containing 50 cases (i.e., short-term survival) and 50 controls (i.e. long-term survival), and three omics blocks (RNAseq, DNA methylation, and proteins). We try to recover the two groups but also the different features that drive overall survival by using DNA methylation, expression, and protein data. For two of the three blocks (expression and DNA methylation), the variables differentially expressed or methylated between the two groups are known.

The simulated data are described in Supplementary Materials (Section 5).

In these simulations, we also compare the performance of PIntMF to other methods in terms of clustering and variable selection. First, CIMLR does not give any results on these simulations (the algorithm does not converge). For all the other methods, the ARI is equal to 1 (maximum value) for all 50 datasets.

Then, we compare the variable selection performance of PIntMF, intNMF, iClusterPlus, MoCluster, and SGCCA by computing the AUROC on expression and DNA methylation blocks only (the protein block does not contain any variable simulated with differential abundance, more details are given in Supplementary Materials section 5). DNA Methylation dataset: PintMF and iclusterPlus outperform the others with similar performances but the AUROC of iclusterPlus is significantly higher. Then, the AUROC of PintMF is significantly higher than for MoCluster, SGCCA and intNMF (Fig. 2).

Expression dataset: PIntMF is the best method with an AUROC significantly higher than the others. However, all methods achieve an AUROC higher than 0.92. (Fig. 2)

On these simulations, PIntMF gives similar results to iClusterPlus, but with automatic tuning of parameters. Besides, the algorithm of PIntMF is faster than iClusterPlus.

0.00024 p < 2.22e−16 p < 2.22e−16 p < 2.22e−16 0.7 0.8 0.9 1.0 1.1

MoCluster SGCCA intNMF icluster PIntMF

A UC (a) Methylation 0.00015 7.1e−06 1e−10 5.5e−15 0.92 0.94 0.96 0.98 1.00

MoCluster SGCCA intNMF icluster PIntMF

(b) Expression

Figure 2: AUROC of PIntMF, MoCluster, SGCCA, iClusterPlus and intNMF for OmicsSIMLA simulations on (a) DNA methylation and (b) Gene expression blocks

(10)

4.2.3 Stability selection

Jackknife was performed to evaluate the stability of variable selection. To perform this technique, we run the model PIntMF on the data without one sample at each step. Therefore, we obtain n datasets containing n − 1 individuals on which we apply the method.

The stability of the selected variables for Binary, Gaussian, methylation and expression datasets seems to be strong (Fig. S10 in Supplementary Materials). For proteins and for beta-like data, the bootstrap reveals that some selected variables are not stable. The Jackknife method could be used to remove false-positives variables.

4.2.4 Summary

To summarize this simulation part (see Table 1), our method PIntMF provides satisfying clustering and variable se-lection both on correlated blocks (Simulation Framework 2) and on non-correlated blocks (Simulation Framework 1). PIntMF is the only method that performs well on all simulated settings.

We conclude on these two frameworks of simulated data that PIntMF is a fast and flexible tool.

Clustering Variable selection Automatic Tunning Parameters left to tune iClusterPlus + ++ - ¿2 intNMF +++ - +++ 1 SGCCA ++ ++ - ¿5 MoCluster +++ +++ + ¿2 CIMLR + ++ +++ 1 PIntMF +++ +++ +++ 1

Table 1: Summary of the performance of the PIntMF compared to other methods

4.3 Applications

In this section, we assess the performance of the PIntMF method on real data by considering two applications. The first one is a dataset from murine liver (Williams et al., 2016) under two different diets already used in two previous comparison articles (Pierre-Jean et al., 2019; Tini et al., 2017), and the objective is to recover the diets of the mice (fat diet or chow diet). The second one is a glioblastoma dataset from TCGA used in (Shen et al., 2012) and the goal is to find the tumor subtypes.

4.3.1 PIntMF highlights variables linked to phenotypes of samples

We analyzed the BXD cohort (composed of 64 samples) (Williams et al., 2016); the mice were shared into two different environmental conditions of diet: chow diet (CD) (6% kcal of fat) or high-fat diet (HFD) (60% kcal of fat). Measurements have been made in the livers of the entire population at the transcriptome, the proteome, and the metabolome levels.

Therefore, we applied PIntMF to this dataset as well as intNMF, MoCluster, SGCCA, iClusterPlus, and CIMLR (Supplementary Materials Table S2).

PIntMF produces a perfect classification of the individuals for this real dataset .

For this dataset, all criteria for the model selection were computed (Supplementary Material Fig. S6), and 2 groups were selected for further analysis.

PIntMF highlights interesting variables that seem to have different abundance between the two groups CD and HFD

(Fig. 3): VITAMIN E (C29H5002), Cholesteryl (C36H62O5), Mustard Oil (C4H5NS). Saa2 gene that codes for a

protein involved in the HDL complex seems to be deferentially expressed between the two groups. Then, the Cidea gene that is involved in the metabolism of lipids and lipoproteins has a slightly different level of expression between the two groups. Finally, Cyp2b9 oxidies steroids, fatty acids, and xenobiotics are less expressed in the high-fat diet group. To conclude, PIntMF succeeds well to recover classification and relevant markers in all datasets.

4.3.2 PIntMF reveals a new classification of non annotated samples on TCGA dataset.

Secondly, we analyze a subset of the glioblastoma dataset from the cancer genome atlas (TCGA): the Glioblastoma study (2009) used in (Shen et al., 2012). The dataset contains three matrices: copy number variation (1599 regions),

(11)

Metabolites

C37H70O5

C39H72O5

C4H5NS

C9H15N3O2S

C26H46O6

C29H50O2

C33H64O5

C36H62O5

C24H35O2

C39H60O5

Proteins

Srd5a−1 R75370 Ccdc44 Lpsb2 Pcn MTP galectin−1 GstpiB Gst3 Ces6

RNA

Cd36 Lcn2 Saa2 Cyp2b9 9030619P08Rik Cyp2a22 Cidea D630002G06Rik

AB056442 Cyp2b13

PIntMF

T

ruth

Metabolites

5

10 15 20

Proteins

−10 −5

0

5

RNA

5

10 15

PIntMF

1

2

Truth

CD

HFD

Figure 3: BXD cohort results: Top 10 selected variables with PIntMF of each dataset (Metabolites, Proteins and RNA), the clustering given by PIntMF and the true clustering are on the right.

DNA methylation (1515 CpG), and mRNA expression (1740 genes) in 55 samples. GBM samples were classified into four subtypes (Classical: CL, Mesenchymal: MES, Neural: NL, and Proneural: PN). Besides, there are samples with no subtype (NA). Using the PIntMF method, we highlight samples with no classification close to labeled samples. Looking at the three criteria, the best number of latent variables seems to be 5 (Supplementary Materials Fig. S7). For example, the green cluster from PIntMF matches a part of the CL subtype, and one sample labeled as NA is in this green cluster. Then, the purple cluster from PIntMF matches the PN subtype, and one sample labeled as NA can be classified with the PN subtype (Figure 4a). Clusters 1 (red) and 2 (blue) are more heterogenous. However, the red one is composed of NL and NA labeled samples. The blue one is close to samples labeled as PN.

We performed a survival analysis to identify a relation between groups found by PIntMF and the survival rate (Figure 4b). The survival test gives a significant p-value at 5% (p-value =0.00013 with log-rank test). The prognosis for the purple (4) group is better than those of the red and green (1 and 3) groups and even better than the orange and blue (2 and 5) groups. Note that the PN subtype is split into two groups (purple and blue) that have two very different survival curves.

The previous study (Shen et al., 2012) performed with iCluster method (Shen et al., 2009) identified 3 subgroups with a less significant p-value (0.01) than PIntMF for the survival differences between subgroups. Their Cluster 1 matches the PN group, Cluster 2 matches the CL group, and Cluster 3 is mostly composed of the MES subtype. Authors do not give any information about the samples with no subtypes.

H matrices exhibit various types of genomic profiles according to the clusters (Figure 4). For instance, the orange clus-ter (5) shows few alclus-terations at the copy number variation level (Fig. 4c) but a particular profile for DNA methylation and gene expression data (Fig. 4e). The blue cluster (2) has a distinct pattern of expression (Fig. 4d).

5

Discussion

We presented a new model to discover new subgroups of a cohort and potential new markers from several types of omic data. PIntMF is a matrix factorization model with positivity and sparsity constraints (Lasso) on inferred matrices. The method and all scripts of this article are available in an R package named PIntMF.

The main advantage of this method is the automatic tuning of the lasso penalties for both variable and sample matrices. To optimize the algorithm at the computational time level (Supplementary Materials Fig.S9), we tried several

algo-rithms to infer matrices Hk. glmnet is very fast compared to the others (ncvreg, quadrupen, and biglasso), therefore

it was retained for all analysis. We also optimized the initialization of the algorithm that is obtained by using the SNF algorithm (Wang et al., 2014). This initialization provides, at the end of the algorithm, the best clustering and the best

(12)

1 2 3 4 5 0001 0002 0006 0027 0054 0057 0085 0089 0107 0113 0115 0133 0003 0007 0034 0037 0038 0046 0047 0052 0071 0074 0099 0143 0009 0021 0043 0083 0102 0125 0126 0137 0145 0148 0010 0011 0014 0024 0028 0058 0060 0069 0080 0114 0128 0129 0033 0055 0064 0075 0086 0122 0130 0139 0147 Types PIntMF W values 0 0.5 1 Types CL MES NA NL PN

(a) Heatmap plot of W: Homogeneity between subtypes and subgroups identified by PIntMF

+ + p = 0.00013 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 Time (months) Sur viv al probability

Strata + clust=1 + clust=2 + clust=3 + clust=4 + clust=5

(b) Kaplan-Meier plot: The subgroups identified by PIntMF show survival differences

3 5 4 2 1 EGFR SEC61G TSP

AN12 ING3FAM3C MEST CALU

KIAA0828

FLNC HIG2

MRPS33

CAV2TFECTESGPR37 CALD1CADPS2 TBXAS1GNAI1 SAMD9ACN9CAV1AZGP1 PILRBAASS ARPC1BSERPINE1PLOD3 SFRP4TSP AN13 ITGB8GPNMBAOAHAHR THSD7ATMEM106B SCIN PTNTRIM24 DENND2A HIPK2 TMEM140 DGKI LAMB1 NRCAMSYPL1LRRC17 IFRD1

PRKAR2BPBEF1 DUS4L LRRN3DNAJB9SGCESEMA3AABCB1PPP1R9ADBF4COL1A2 GNG11PEG10 CR

OT SEMA3C DYNC1I1TRIP6 PCOLCE ELN STEAP1PDGF A GBASHO

XA7CHN2 SNX10CPVLNFE2L3DPY19L1ARL4AIL6BZW2

OSBPL3 MPP6 ANKMY2 DGKB KLHL7 RAPGEF5 ETV1MEO X2 IGF2BP3

GLI3AMPH FGL2FLJ13195STYXL1 HSPB1HIP1TPST1 AEBP1UPP1 TNS3STK17A RAMP3 IGFBP3 PGAM2 GRB10PSPHGIMAP5 GIMAP4 NCAPG2EZH2

CLEC5ARARRES2 CNTNAP2

ZYX

ZNF43CACNA1ACD97 TPM4GDF15BST2 IFI30PLVAP

RNASEH2A

ICAM3 DNMT1MAN2B1ASF1BLDLRICAM1 JUNBDNAJB1 ZNF536AXLZFP36DLL3BLVRBGMFG ITPKC SIRT2

ZNF573 GPIHAMP GADD45BMY O1FCNN2ZNF177ANGPTL4HNRPMC3 M6PRBP1 TPX2 HCKID1 PI3

MYBL2SLPISDC4 MAFB IFT52GDAP1L1TMEP

AI BMP7AURKA EYA2CTSA PLTP UBE2CB4GAL T5 SP AG4 PROCRZNF217DOK5 CTSZKCNQ2 GINS1CST3 C20orf39 CD93 SLC24A3NKX2−2JAG1 C20orf23 RASSF2 PCNA C20orf42PLCB4C20orf103SNAP25FLR T3 PLCB1 PAFAH1B3 PLA UR

ZNF228PEG3 APLP1FXYD1FXYD5COX7A1

TYR

OBPSMO

X

APOC1 APOC2 ZNF83CD37 RRAS EMP3 EHD2 LAIR1UBE2S TTYH1FPR1ZNF415 LILRB1PLA2G4C PLEKHA4CA11PIK3C2BNFASCBTG2CHI3L1CTSK FMODCTSSMLLT11DNM3 SELLATP1B1 PRRX1CA14C1orf54 C1orf61NESS100A3 S100A1 S100A9 S100A6S100A13S100A8 S100A2F11R IFI16SLAMF8 FCER1G ATP1A2 EFNA1OLFML2BKCNN3 TAGLN2S100A4SHC1 DDR2 AIM2 UAP1

CSRP1KLHDC8A RGS5 RGS4 MND A TXNIPS100A10 S100A11CHIC2PDGFRA NMUKITKDR PPAT IGFBP7KIF14PDZD2RGS2FAM5C RGS1PR OX1 PTPRCB3GAL T2 CENPFASPM CRB1MOSC2 DTLTGFB2EMP2 TNFRSF12A GPRC5BCP110ABA T LOC440350 SUSD4 IL32 IL4R VSNL1PRKCB1 SEZ6L2 COR O1A PYCARD MVPNUPR1CRYM ITGAMNO

TCH1OLFM1 ZNF423ORC6L NET

O2SALL1SHCBP1CREG1NCF2 RABGAP1L NPL FAM5B GLT25D2NMNA T2 LAMC1 SO AT1 EPHX1ENAH

EFCAB2 SMYD3CNIH3ZNF124EXO1 CHMLPTGS2PLA2G4AGNG4 ITPKB ACTN2KCNK1NID1RAB4A

HIST3H2A

RGS7 TRIB2

GALNT2

NEK2DIRAS2SHC3

AGTPBP1

SYKCKS2CTSL1 DAPK1 DBC1 TLE1 PSAT1 GADD45GSEMA4DOSTF1 PCSK5 ANXA1 C9orf95PRUNE2GNA14ALDH1A1CD302 RPRMPSCDBPBAZ2BLY75

TNF

AIP6CNGA3IL1BIGFBP2ODC1 ITM2CRRM2 LRP1BNRXN1ANGPTL2DNM1 PBX3AK1

UGCG ZFP37 ST

OM

NTRK2EDG2TNCABCA1TLR4 RMI1PTGS1ASPN

GABBR2

ECM2 FBP1 SMC2

TMEFF1GPR56PTPN13MT1X NQO1 MMP2 PLAC8MT1G

HSD17B11 MT1EHSP A4L SPR Y1GAB1CCNA2MAD2L1GRIA2LEF1 DKFZP564O0823 CTSOMGST2 PDGFC ENPEPADAMTS3AGXT2L1PF4 GUCY1B3 GUCY1A3C4orf18 EDNRATLR2 TDO2 EMCNGLRBSLC7A11HPSE

MAPK10

SNCAHERC6IBSPHERC5PGDSPPID AGACHRNA9UGDHATP10D BMP2KGALPDE2AODZ4 UCP2PRSS23 C11orf75CTSCKCTD14SLCO2B1SERPINH1SYTL2 NO

X4

SLCO1C1

PRCPLRRC51 ENDOD1PLEKHB1FOLR2 DHCR7 DDIT3

B4GALNT1

C3AR1CLEC4ARAD51AP1

ARNTL2 CCND2ITGA7 ITPR2 SOX5

BCA

T1ITGA5

TMEM106C

NELL2 SOCS2PLEKHA5ACCN2RACGAP1LIMA1METTL7ACNTN1HOXC10YEATS4 LYZNAV3 CSRP2NCKAP1L SLC38A1 TIMELESSHSD17B6PRIM1 SSPNBHLHB3KCNJ8GOL

T1B CD163C1RLFOXM1 TNFRSF1A C1SEMP1 LMO3 MANSC1 PHLD A1

BBS10FAM60ANT5DC3ASCL1CHST11 C12orf48TMPO FAIM2 SYT1DRAM DUSP6PDZRN4GNS LRP1SLC16A7HEBP1FLJ22662CLEC4EOLR1A2MPHC1PTPR

O ARHGDIB MGPGPR19CLEC7A CD69 C1QB IGSF3 OLFML3 CD53CD2PHGDHVAV3

SLC16A4 AHCYL1GSTM5CD58GSTM4 GSTM3 GPSM2 PSRC1 KCNA2ADORA3CHI3L2 VCAM1AGLLPPR4 PALMD EDG1 PXDNBCAR3 ELTD1 FNBP1LIFI44LIFI44 SNX7DPYDPTBP2F3AK5SSX2IPGBP1DNAJB4LPHN2

ARHGAP29

GBP2CYR61TGFBR3CDC7DDAH1IFIH1 GCARBMS1ADI1CAPG FRZB

CALCRL B3GAL

T1

KLHL23GAD1 LTBP1 RBKS

LRR

TM4 TFPI

CTNNA2VAMP8 GYPC MREGLIMS1

AD

AM23STEAP3MCM6 GPR17

ARHGAP15SLC20A1 EFEMP1

RND3 FZD7 CLK1KCNE4 SP110 ACSL3SLC11A1ACYP2 CPS1 CHPFEPHA4FN1STAT1CCL20 MYO1BSCG2 SATB2MAP2THNSL2ERBB4TMBIM1

SERPINE2

CXCR4

NMIFZD5DNAH7IDH1MERTKPGAP1 STON1PLCL1CYP1B1 IGFBP5CRIPTPIGF MSH2RSAD2 EPAS1GPC1

DKFZp762E1312

RAMP1 ARL4C DTYMK PDE1A DNAJC6CHN1 KIF1AEFHD1 FNDC4CYBRD1GRB14RAPGEF4SCN1A VAMP5SCN3A REEP1PDK1HOXD11 ITGA6CRIM1GULP1COL5A2QPCT SSFA2COL3A1

RASGRP3

MXRA8 DSCAMSDC3

LAPTM5

CTPSID3STILKIF2C NASP CDC20

HPCAL4KIAA0754CDCA8 PIK3R3 SLC2A1

IFI6HPCAC1orf38 STMN1PLA2G2APLA2G5PADI2PLOD1 DHRS3

TNFRSF1B

PDPN PER3 SPSB1SPATA6 CIT

SELPLG

TESCALDH2 ATP2B1RPH3AOAS1 DCNCRY1

FLJ21963

LIN7AHSPB8 TAGLNFXYD6MOG CA9 TPM2HSPB2 CRYABZNF84RDXNCAM1IL18 CASP1 PDGFD CASP4 EXPH5 SP

A17RBM7

SLC35F2REXO2 SORL1 DDX25NNMT CHEK1CEP290 C11orf63CDON TTC12

LOH11CR2A

THY1OPCMLBIRC3 MMP7B3GAT1CDK4NUP107CPMKNTC1 KPNA2

LGALS3BPSLC16A3KCNJ16 SCPEP1ABCA5PITPNC1FO XJ1 HN1 ABCA8 BIRC5 HLF COL1A1ITGA3 TMEM100COPZ2 HO XB7 HO XB2WSB1ABCC3 TRAF4 MF AP4 HS3ST3B1

PMP22 EVI2BIFI35 RND2GRNMAPTIGFBP4 HIGD1B TOP2ACPDSPAG5LGALS9 TMEM97ALDOCEVI2A

SERPINF1

CENT

A2ASP

A

CCL5ATP1B2CCL2MYH10 AURKB GAS7FAM64A SLFN12C21orf91CXADRSAMSN1JAM2DONSON ADAMTS1ADAMTS5 NCAM2PCP4 GRIK1OLIG2

C21orf62 SH3BGR

TTC3 ETS2 CBS MX1BACE2COL6A2MX2ITGB2SQSTM1CPLX2 GFPT2ZFP2 LCP2DUSP1 KCNIP1DOCK2HOMER1THBS4 HMP19FGF1SPRY4JAKMIP2SNCBIQGAP2 CCNB1KIAA0888ENC1 HEXBPDE8B PIK3R1C7PPAP2A

PCDHGA8

TRIP13 SLC1A3DAB2

SEPP1HMGCS1GHR FYB PLK2 CD14IL7RRAI14SEMA5APTTG1GPX3ATP10B CYFIP2SLC26A2LOC63920CSF1RGABRA1CD74

HAPLN1

F2RMAP1BDHFR CHD1 PARP8ITGA2C5orf13HISPPD1APCTRIM36 KCNN2PAMCDH10IRF8CYBA IL1R1 IL1R2RARRES3 HRASLS3CD248EFEMP2SERPING1MS4A4AFEN1AHNAKMS4A6A ASRGL1GNG3 FADS2C11orf9SLC15A3 C1QTNF3FAM105A SRD5A1 CTNND2LRP4FOLH1DMN

SLCO3A1

AQP9CCNB2SV2BBLMMYO1EPRC1 NMBNTRK3 IQGAP1PIGBKIAA0101PTPN9

FAH

RAB27A THAP10 ACSBG1 TCF12 ARNT2MNS1ISLRFANCI CA12RPS27L VPS13CKIF23BCL2A1 SCAMP5CTSHANXA2GLCE NEO1DMXL2SCG3 TIPINKIAA1199CSPG4 MYO5CGATMSQRDLSEMA6DCAPN3NDNAPBA2NUSAP1BUB1BCASC5 MEIS2RYR3TRIM2KIAA0922FGFR3 TACC3LDB2 CD38 QDPRUCHL1CFIANK2CXCL10 SLC4A4 SC4MOLIQCG UGT8 ARSJPROM1

GABRA2

TLR1

GABRB1 D4S234ECRMP1 STK32BKIF1B TFRC MSX1C4orf19 CENTD1ABCG2HSPA6CFH

ANKRD15S100B MLL

T4ALK

BTBD3 ZNF91FCGR2AC9orf46

APOBEC3B

IGSF6 GSTT1USP18GOLGA8AFCGR2B SLC2A3FKBP5 TUSC3 FCGBPSCG5CLSTN2C1GALT1PRODH

LOC23117 SELENBP1

PDE4BITGB3BPDIRAS3FLJ10986 GADD45ACLDN1 IL1RAP RPE65ROR1GPR177JUN

CDKN2C DEPDC1

CTHALG6

MTHFD2PPARGC1A KIAA0746

SLIT2ACTG2 CRYZ

LRRC40 HRASLSHES1 APOD FGF12ETV5DGKGSST

IGF2BP2

RTP4RFC4RPL39LTIA1CENPEENOSF1COLEC12TRPM3 TMEM2TJP2 CPEC9orf61FLJ20035 LEPREL1BDH2

SLC39A8

BCL6SLC1A4 FNDC3BHMGB2SCRG1TNIK

SERPINI1

EVI1GOLIM4 PAIP2B ANXA4

TNFSF10

ECT2IL17RB NLGN1 ST

AB1

LAMB2 DOCK3 LRRC2 MOBP CDCP1CX3CR1 NKTR KIF15LTFCCR1 AQP4PMAIP1 RBBP8 NEDD4LMAPK4MALT1

CYB5AST8SIA5

CDH2

CCDC102B

NOL4

PHLPP TWSG1TUBB6FAM38BARHGAP28EPB41L3 CDKN1ASERPINB9TREM1

HIST1H4C HIST1H1A

TRIM38 BTN3A2 TREM2FAM50B

HIST1H2BDSERPINB1 TNFRSF21 HLA−DP A1 HLA−DPB1HSD17B8 KIFC1 HLA−DMA ID4LY86PSMB9 ZNF184GMPRELO

VL2LRRC16EDN1NEDD9 F13A1PHACTR1HLA−E HLA−CTMEM14B HLA−DMBNRN1 LST1AIF1ENPP4C2

ATP6V1G2

CLIC1 CAP2 GSTA4LRRC1BTN3A3GCLC HLA−FMELK

POLR1E LRP2BP

FAT

PDLIM3TLR3ACSL1LPHN3ST3GAL6 TMEM45ACOL8A1PROS1DTNA

SERPINB8

OXTRZNF659TIMP4SLC6A1 BHLHB2EPHB1 CEP70RBP1PCOLCE2SA

TB1

TOP2B PCAFZIC1SCHIP1SSR3 MLF1 PFN2RARRES1WWTR1 TM4SF1PTX3MFSD1CPLXN

PLSCR1 P2R

Y13

PLSCR4 TIP

ARPPLOD2SGEF PLEKHSPA1A BAI3SNAP91 OGFRL1 ANKRD6NT5ERWDD2A ZNF292PGM3ELOVL4 ME1

CYB5R4 RRA

GDPHIPFAM46AMYO6TTKFUT9

POPDC3

AIM1IL17RASLITRK3BCHECLASP2FBXL2TGFBR2REV3L SESN1 WASF1CD164 NR2E1SLC25A4LIPGC3orf14CDH5 GBE1 MAGI1ACOX2FAM107ALRIG1 ADAMTS9FLNBWNT5A CDH11 CENPNGCSHMBPPTPRMCHL1 STACLAMA4GJA1CDC45L GGTLA1 C6orf60DSE

SMPDL3A EPB41L2 CTGF HEY2TPD52L1PTPRK LAMA2 MO XD1VNN2 SQLENDRG1HAS2 ANGPT1 TNFRSF11B SLA ENPP2KHDRBS3ATAD2 DEPDC6ADCY8

CA8CHD7 GGHMYBL1 CCNE2PGCP SDC2 MATN2

HRSP12

TOX

NCALD ZFPM2FZD6

ANKRD46 CHCHD7

HEY1C8orf70SNAI2 MCM4EYA1

SULF1LACTB2 LY96SGK3 TPD52IMPA1CA2FABP5 GEM PMP2 ST18PBK PLA

T

C8orf4SLC20A2MTUS1 FDFT1CTSB FZD3 STC1RBPMS EPHX2 STMN4SLC39A14DUSP26ADAM9

PDGFRL

NAT1LPLPSD3 ChGn

ADAMDEC1 ANGPT2

NEFL

AD

AM28KBTBD11ALCAM RAB6B

TF

NEK11PLXND1 ZBTB20CD200TAGLN3

GRAMD1C

GAP43 LSAMPOSBPL11FSTL1 CD86 PVRL3 PDIA5 MCM2ALDH1L1HCLS1SLC15A2ITGB5RNASET2BRP44L PDE10ANCF4LIF TSTSEC14L2PACRG RGS17 MYCT1 AKAP12FBXO5SO X10 POLR2FAPOBEC3G TCN2RAB32LGALS1MLC1GTSE1WTAP KDELR3 LIMK2HMO X1

APOL6 MYH9TOMM22RAC2THBS2 SEZ6LMAP7SASH1UST EYA4

PHA CTR2 TBPL1 MAP3K5 PERP PEX3 TNF AIP3PLA

GL1DDX58 KIF3AFBN2 IRF1ALDH7A1PDLIM4ACSL6 TGFBISPOCK1KIF20APPICSNCAIP SEMA6ANME5CXCL14 MARCH3P4HA2GRAMD3EGR1LMNB1LRRTM2

SLC22A4

LOX

CDO1

COMMD10

PCSK1CAST GLRXCETN3LRAP PAX6

PRMT3HTATIP2FBXO3 CAT CD44 MPPED2KIF18ALGR4BBO X1 DKFZP586H2123 FJX1SLC1A2 SPON1ZDHHC13 ADM SW AP70

SCUBE2ARNTLCYB5R2DKK3 RRM1DCHS1 TRIM22TPP1IFITM2 IFITM3 CD151 IFITM1SLC7A7 PSME2NOVA1

ARHGAP5

PRKD1RNASE2NDRG2RNASE4 RNASE6SALL2ANG

RNASE1CCNB1IP1MAP4K5PARP2LGALS3PYGL DLG7 NID2POLE2CGRRF1ERO1L PLEKHC1TXNDC1TRIM9 CDKN3CRIP1KIAA0423BAZ1A EGLN3 NFKBIAAHNAK2VRK1FBLN5 FLRT2NRXN3FOS DIO2RPS6KA5TGFB3GALCGPR65 ACTN1

C14orf109 KIAA1622 SERPINA3

CKB ZFYVE21 BDKRB2 NPC2 IFI27 ALDH6A1SERPINA5 WARS KIAA0247SIPA1L1 RTN1 DAAM1 HSPA2 SYNE2 TRMT5SLC38A6CR YL1IFT88 SACS ALO

X5AP NBEAHSPH1EBI2POSTN KDELC1WASF3FGF14UGCGL2DZIP1COL4A2

LOC728215 EFNB2FLJ10154COL4A1MAB21L1 ALG5 EXOSC8 SOHLH2CCNA1LHFP RFC3 DNAJC15 LCP1P2R Y5 C13orf18SPR Y2

CKAP2 EDNRB KCTD12MYCBP2PCDH9HLA−DRA

GLDCSLC1A1VLDLRKIAA0020RFX3PTPRDSNAPC3PSIP1NFIBFAM29AADFPMOBKL2BKLHL9

IDI1PFKP

AKR1C3 PFKFB3PTPLAPTER KLF6 BAMBINEBL

PIP4K2A

SVIL

MAP3K8

ZEB1 NRP1ZNF22RASSF4 C10orf10 CXCL12CCDC6 ARID5BCDC2JMJD1CANK3 DKK1P4HA1SPOCK2DDIT4 PLA UZWINTH2AFY2PPA1IFIT1 LIPA ADD3LGI1 BLNKDNMBP PLCE1 CH25H FER1L3PPP1R3CCEP55IFIT2GSTO1SCD FAS

SORCS3

INA

ACTA2SLIT1RGS10 ABLIM1INPP5FKIAA1598MGMT BNIP3ADAM12 MKI67 H copy number

−2 −1 0 1 2

(c) Copy number variation

2

1

3

4

5 POSTNNNMT PTX3 ASPNCOL1A2 COL3A1 COL1A1MMP7 DKK1

SEMA3CGUCY1A3POPDC3 STEAP1 SLC15A3SMPDL3A SLCO2B1 CCL5CD2LILRB1 DOCK2IFI16 TES

CLEC7A LCP2 CYBBVNN2FYB CENT A2PLEK CCR1MGP SLC11A1 CTSB OLFML3 AD AM28FXYD5 MY O1F ITGAM TNFRSF1B MYH9 CD93 NRP1 SHC1

MAN2B1 TGFBR2ADAMTS1 MXRA8ITGA5COL8A1MVP

OLFML2B DSEIL1R1IL7RRGS2HMOX1 SAMSN1 PSCDBP TLR1 CD14 KIAA0746 IL1B PCOLCE DCN TNF AIP3RAB27ALCP1 CTSCIL4RST

AB1RAC2CNN2GFPT2IL1R2CYP1B1 SLAMF8TNFSF10MAFBBCL2A1 ARPC1BPLAURNCF2GPR65LY96 AQP9FOLR2DAB2AOAH CLEC4E LGALS9ARHGAP15

CD37 OLR1CREG1 C3AR1MS4A6A TBXAS1

EVI2B ACSL1 HPSEC1QBISLRPTER

SCPEP1

FGL2 TFECGIMAP5TLR7OSTF1HLA−DRARGS1DUSP1RNASE1 TYROBP

HLA−DPB1 CD74 HLA−DMA CPM SLACYBA HLA−DP A1 ADORA3HLA−DMB SCIN FCGR2A TLR2 NCKAP1LMAP3K8 NCF4HCKLAIR1PTPRCMND A

CTSS ITGB2SLC7A7CD86CSF1RSYKC2FBP1SEMA3AVAMP8AIF1PYCARDGMFGC1orf38SRGNFCER1GSERPINF1LAPTM5 RNASE6SQRDL HCLS1CD53 LY75CXCL12 PMAIP1AIM2MERTKDDIT4C7SVIL

CLEC4A

TCN2AZGP1SELPLG P2RY13EDNRA MFSD1CTGF

GGTLA1 IL32 IL17RAGRNBMP2K RABGAP1LCOR O1A SERPINB9 MAP1BITGB5NFE2L3FPR1 CHCHD7 MX2 ARHGDIBPROCR CTSZ C11orf75HPR T1ETS2 C21orf91 SELL ADAMDEC1 IRF8RNF128TDO2 FCGR2BCH25H GPNMB LYZ VSIG4MS4A4A CCL2SFRP4 ZFP36GADD45B SERPING1

PPICCYR61 TAGLN FAS

CXCR4 VCAM1ANGPTL4 EBI2 FLJ22662 IFI30 ALO X5AP S100A11ICAM1S100A4BIRC3 GBP2PLTP

CASP1SDC2 CAV1GEMCASP4 P4HA2RNASE2CFI C1SGBP1 TIMP1DPYDSNAI2SLPICOL5A2LAMB1LIF

KDELR3

IBSP BGN CPD FN1COL6A2PTGS2 PLOD2 MALT1STC1BDKRB2CDCP1ADFP

AD AM12 CLEC5AHSP A6 IGFBP3C8orf4PLP2FSTL1TNCACTN1 SERPINH1 WWTR1 NDRG1 APOBEC3B SLC2A3 CA9 AIM1 SERPINE1 CD163 CCL20 F13A1RARRES1TGFBI IL6

TREM1 S100A9 S100A8

PI3 LOX SRPX2 PCSK1PLA U FAM129AABCC3CA12 CHRNA9 NDNUGT8

TMEM100TUSC3 ALCAM BAMBICDR1 SOX10

SA

TB1

SLC1A1SGK

SPOCK1

CLIC2

DEPDC6TXNIPSEMA4D FAM105ALIP

A

IGSF6 ENPP4

EPB41L3RCAN2EVI2A SULF1ENPP2PARP8GCLC

CALCRL

PPA1

GNAI1 CHIC2 TBPL1PLEKHA5GHRPIP4K2A FAM38B PDE1A CADM3ST18C11orf9SLC16A7KLHL9 UCP2CACNA1A

ARHGAP28 HIST1H2BDRASSF4 ENOSF1 PTPRM DNAJB4

TNS3ENO

X1

MED21 CETN3 RRA

GD

USP18 PSPH EMCNNEBL

TMEM106C

SGK3BEST1PRKCB1INPP5FCISD1 ChGnCNTNAP2 GUCY1B3HSPB2PHACTR1 GJB1NRXN3ME1PADI2MGMTCD38CAPN3RAPGEF5APLP1 EFHD1GYPCIFI44LACTB2NUPR1PGDS

SLCO3A1 CTSOPPAP2ADNAJB9FNDC4CO X7A1 SELENBP1MRPS33 STYXL1 NAT1 DKFZP564O0823 QPCT PLAC8 CR YAB ZNF536OSBPL11 AGTPBP1NFASCAMPH C20orf39 CA8 MARCH3 TESC SPA17 C1QTNF3 ACN9 AK1 DNAJC6PRKAR2B IMPA1

LRRC51PIGF NXT2ATP1B1 MAP7 FOLH1DDIT3DKK3

AKR1C3ALDH1A1

LMO3 IFIT1 ASP

A

APOD

KIAA1598

PLP1TFMBP MOGSERPINI1 DYNC1I1 KCNK1 NAP1L2ADD3SNAP25SNCAAK5NEFL MOBPNAV3

VSNL1 ALDH2KIAA1622

DBC1 SV2BSST

S100A1ITM2A PAIP2BSYN1KIAA1107 SLITRK3 RUNDC3BNTRK2 GNG3 GSTM3CRYMSYT1KCNN3PLLP

NCAM2 HPCAL4C20orf42

TOX3

HRASLSPCDH11Y

CHD7NKX2−2NOL4 DLL3 GNG4DCXGPR17 TMSL8 CASC5CDC2 ASPM MKI67 KIF14CENPFKIAA0020CEP55 BAZ1A UGDHDEPDC1CKS2TACC3SHCBP1CDKN3 KIF20A TRIP13 CENPEKIF23 HAS2CHEK1KIAA0101RRM2 MELK

PDGFRA MAD2L1

KIFC1 KIF2C EXO1CDC45LFBXO5 CCNA2 UBE2CKIF4A SPAG5TTKNDC80 LMNB1PBKTOP2ADTLKIF15 NMUFAM64A

HIST1H4C GOLGA8A NET O2 PR OX1 NO TCH1 ZNF91 CCNB1IP1 SACSZEB1LRP1B NO VA1ETV1GRIA3 ZNF83 F2R NLGN4X FZD3ZNF43ZNF228 SEMA5AZIC1SNCAIPHMGCS1C5orf13HEY1UBE2STIA1CPS1 MSH2 PSIP1ENAHCKAP2CBSMYH10 ZNF184CSPG4RFC3 TMEM97

RFX3 MLLT4CLK1 WSB1PFKPPTPN13FLJ10154ZNF573HIP1 IFT52PARP2ZBTB20 EXOSC8MED27MNS1 ENC1 GINS1

RNASEH2A

DCHS1 TRIM24CDKN2CDZIP1 PLCE1

SLC24A3

IFT88PRKD1BLMSALL1 TRIM9SEMA6A LRRTM2ZNF177GAB1 KIF1B GLDCCCND2 JMJD1CH2AFY2

HIST3H2A

ZNF711 PGAP1 REV3LZFP37NFIB SOX5TRAF4 LPHN3ANKRD15ZNF22PAR5

TCF12PPP1R9A

KIF3A APBA2PHC1ZFP2KCNQ2 CLASP2MAPT SHC3PAFAH1B3 MA GI1GAD1RAPGEF4NUDT11PHLPP OLIG2 DSCAMSLIT1SCRG1PGM3 GULP1LRRC40BTG2ZDHHC13 PLA2G4AZNF423ITGB3BP TM4SF1 PGRMC1TMEM14BPBX3CYB5A C7orf44FAM117AEFNA1 C9orf46

LOC23117LOC339047 LOC440350UGCGL2 CXorf45 CCDC131HISPPD1

BCL6

MAP4K5PCDHGA8

ZNF292 ZMYM2 SFRS18NKTR

KIAA1641ZNF84 NASPHNRPM DNMT1 SYNE2 CEP290PPAT

TOMM22PPID FBN2

EPB41L2PTPRKMTHFD2ANGPTL2SMYD3GLCEC15orf5 ZNF124CCDC88ABAZ2B PIK3R1PDE8BLRP2BP MYCBP2ST3GAL6HIPK2 FDFT1

PHLD A1 C13orf27 TTC3 POLR2F IDI1 KIAA1166

REC8PLCB4 PLCB1 TOP2B CP110SMA4MYBL1RMI1 TOX

SNAPC3TNFRSF21

FAM29APSD3 MEIS2

TRMT5 C9orf45VPS13CCCDC6PVRL3CHST11DMXL2 GPX3RPS27L DNMBPCRIM1KIAA0922LIMA1PHIP CTPSZWINT CENPNPEG10ATAD2

TIMELESS

SMC2

KHDRBS3STIL FEN1DBF4DHFRVRK1FAM60A TMPO CHML PRIM1RACGAP1 HMGB2 ORC6LNEK2 DLG7BUB1BCCNB2TPX2KNTC1PRC1

DKFZp762E1312

GTSE1 FOXM1CDCA8 KIF18ASQLE ECT2

RAD51AP1

AURKBMYBL2 SPC25 MCM4EZH2 FANCINUSAP1 IGF2BP3POLE2 PTPR

O

STK32BGGHSTON1CCNE2 GINS2 MCM2 PTTG1 BIRC5DSN1

DONSON TIPIN POLR1E RRM1 IGF2BP2 PCNA NCAPG2

SMC4AURKAKDELC1 C12orf48KPNA2 ASF1BRFC4CDC20 CCNB1 MCM6 RGS17PF4PDK1 CDONDTYMKNLGN4YLRRC1 PTBP2MAGEH1

MOBKL2BSORCS3GADD45G SIRT2 GPSM2 CIT SLC38A1FGF12DGKI LRR TM4RPRMMLL T11 WASF1PDE10AEYA1 CLSTN2CDC7HN1 C16orf80 ODC1SCN1ACDK4

B4GALNT1TMEFF1EPHB1 PDE2ATCEAL2PLCL1ATP10BCXorf57 NMNA

T2

FBXL2 UCHL1USTPFN2GABBR2ZFPM2 CD200 SUSD4ING3RWDD2A ELO

VL4

LOC728215

PIK3R3 CEP70 CRIPTRTN1HEY2

PIK3C2B RPS6KA5BBS10 DOCK3PTPRDGLRB

ANKRD6LOC63920

STMN1PDZRN4KCNA2 THAP10RND2KIAA0888SCN2A BTBD3ST8SIA5GPR56APCLRIG1 RGS5TNIKIGSF3JAKMIP2 NTRK3LRP4ARNT2 SALL2C13orf15 MAPK10FAIM2 HLFCXorf1B3GALT2 SCAMP5 ASRGL1CYFIP2PLEKHB1 SERPINE2PCYT1B NAP1L3ERBB4GABRA1 B3GA

T1PEG3

ANKRD46

AKTIPMANSC1TRIM2SLC25A4DIRAS2 OLFM1 RAB6BKBTBD11FADS2GPR19HPCAHSPB8 TPPP3IL13RA2PMP2 ITM2CRPE65MLF1

SCUBE2

PCP4

D4S234ERAMP3 PROM1DDX3Y EIF1A

Y

RPS4Y1LAMA4KLF6MMP2ID1NID1PXDN

SLC26A2

FLNBP4HA1SLC20A1 CYB5R4HSD17B11GOL

T1B

PL

VAP

FLR

T2

NT5DC3SGCE CDH11TPM4TMEM2CALD1SLC39A14FNDC3B GPC4 FLNAZYXADAM9FAM46A ARID5BZNF659PLOD3LAMC1PLAGL1 PLOD1GALNT2ER

O1L

ACTG2 CTSKAHR NID2

TMEM45AWNT5ATHY1GALSLIT2IGFBP4 MYO1BRND3MFAP4TPM2 CD248ACTA2EMP2

SERPINB8

A2M

SLC16A3CRIP1PRSS23DDR2PRRX1PLXND1SOAT1UAP1SPSB1 RAI14TFPIHOXB2PDIA5 WTAPGBE1

SRD5A1VCAN PTPLABZW2RBBP8USP9YJARID1DTSP

AN13CCNA1 VLDLR IL1RAP COL4A2COL4A1AKAP12BCAT1

VEGF A IGFBP2FRZBTIPARP FOS NDUF A4L2 PCOLCE2UGCG NO X4

ARL4C LIMK2LEF1 IL33SYTL2TAGLN2PLSCR1AEBP1 LRRC2 FBLN5HIG2 EDG2TGFB2DOK5GDPD2 ABCA5 KCNJ8NMBGPR177RDXDCLK1 NEK11 FOXJ1CDO1THNSL2 C11orf63EGLN3 TPST1IQCG SCG5S100A3 NUP107KAL1TRIM36 GRB14MESTC8orf70 ADAM23 ALG6 DNM1

EFCAB2 BNIP3FNBP1LCRY1 YEA TS4 RAP1GAPDLEU1 KLHL7 ID3 HO

XD11 VIL2GAP43 CXADRGSTT1HOXA7 PIK3IP1 EPHX1ADI1 IFIT2SESN1SCD DIO2LDB2OXTRSMO

XIFI6 ENDOD1 BDH2HERC6IFI27MAOA IGSF1C4orf31PER3 LRAP SPOCK2 AXL MAP3K5HERC5FLJ10986 ARSFGNA14IFIH1RSAD2OAS1

GNG11 BCAR3 GDF15METTL7APHKA1 IFI44L SORL1HSD17B6 GLT25D2 PCAF MY

O6 GPRC5BARHGAP5 RAB4A GCSH ABLIM1ABCG2KIAA0828ACSL3ALKEYA2 SPR Y2AASSSPATA6 MAB21L1 DENND2A SGEFCITED1PTN PITPNC1 SDC3 DTNALRRN3 MYO5C PPARGC1APCDH9ALDH6A1 PAX6

PLA2G4C SLC7A11 C20orf103

AGL

TGFBR3

EDG1 QDPRCPEPPP1R3CACYP2CRYL1GSTM4CA11ZFYVE21C4orf19 ACTN2

PDE4B PRMT3 TSP AN6MEG3ADCY8 ABCB1JAM2 CGRRF1 ANK3EXPH5HO XC10SLC1A4CPLX2 SASH1 TST HSP A4L HRSP12CLDN5SNCB RGS7 AHCYL1 ZNF415 TJP2 SLC20A2FBXO3ANKMY2DGKG DNAH7CAP2 GRAMD3

ANK2MTUS1CHN1 EDN1DUS4LBRP44L HIGD1BFLJ13195HOXB7

HSD17B8 MSX1KCTD14GBASFAT LHFPABCA8 ZMA T3 SSF A2

C14orf109DAPK1 ALG5 RPGRPEX3KIAA0423GOLIM4HOMER1NEO1C6orf60 DHCR7CTH FJX1DAAM1

RASSF2 SOHLH2 PILRB LGR4MID1 PLEKHA4 GPILRP1 NT5E SEC14L2 DNAJC15 MPP6 ITGA7

GADD45ALAMB2SLC35F2 SIPA1L1DUSP6 STK17ACD97 ITGA2ETV5SPRY1TRPM3ITGB8

KIAA0754

CHD1 JAG1

CCDC102B

LPHN2LDLR EYA4JUNFZD5ENPEP ARNTL2CDH5TXNDC1PTPN9KDREPHA4NEDD4LFAM3CIDH1IQGAP2 PDGFCADAMTS5RBM7RNF19A

HIST1H1A

ELNODZ4

ANGPT2 DPY19L1PLEKHC1

PSRC1ADAMTS3B4GAL

T5GPC1IGFBP5 ARL4ATFRC NRN1NEDD9EVI1EFNB2GALCCLDN1LIPGEPAS1

DNAJB1 TWSG1

LGALS3BP

PIGB LIN7AEHD2M6PRBP1NRCAMGLI3 IRF9ELTD1MT1X

MOSPD2

CHN2 CRYZ

PDLIM3

TMEM106B

AGA CAT

PFKFB3CHPFATP10D EGR1 TPD52 IFRD1HSPA1AFZD6

C1GAL

T1

SEZ6L2

LXN

ATP2B1 HSPH1MYCT1SSX2IPPERP GAS7 ITPR2DMN

COMMD10 PLA T SC4MOL SLC2A1SA TB2 SEMA6D TRIB2NESFOXG1 B3GAL T1 CCDC144A TLE1 CDH2PDZD2ADAMTS9 SPR Y4 HS3ST3B1

JUNBPROS1RBMS1 NFKBIACALU PLK2RBPMSCTSA LIMS1AHNAKPHACTR2TMEP

AI

MYO1ECD164 WARSROR1

KIAA0247TUBB6PGK1GNSSTAT1SSR3 SNX7

RASGRP3

DDX58 ITPKCBTN3A2 PSME2FLJ20035BTN3A3SLC22A4ABCA1TPP1SQSTM1 C20orf23SP110 SAMD9HLA−F

FAH

APOL6C10orf10 SLFN12REXO2ITGA3IGFBP7CFH IRF1 SAT1TRIM38 IFITM1CAST

BHLHB2 SLC38A6 HEXB P2RY5 PDLIM4CTSL1SLC39A8 PDGFRL RRAS CPVL DRAMFER1L3IFITM2 TNFRSF12A RAB32AHNAK2 F11RLGALS1 FCGBPBA CE2 ARHGAP29 S100A2 SP AG4 THBS2CAV2

PALMDC4orf18RARRES3C9orf95 CD302NMI IFI35ANXA4TMEM140CDKN1AHLA−CANGIFITM3 PSMB9HLA−EVAMP5HSPB1PRUNE2THBS4ANGPT1GSTM5

TNFRSF11B PAMEMP1FAM50B S100A6TRIM22PLS3 SW AP70FKBP5KCTD12COPZ2 APOBEC3G LTBP1 BST2 ICAM3CXCL10 tcag7.1314

UPP1TMBIM1 DIRAS3SDC4 PGCPDYNL

T3MT1E MT1G

BHLHB3

TLR4APOC1TPD52L1

BTK

GIMAP4LST1RGS10CD69NPL IL18 LY86CTSH HEPHTUBA4ARGS4 TLR3 NQO1 SEPP1OGFRL1MGST2RBKS STOMCD58BLVRB

CADPS2 C1orf54 RTP4BLNKGST O1 CX3CR1TREM2 APOC2 SERPINA5 NPC2C3 IL13RA1 CAPG SERPINB1 RNASET2 GLRX HAMP SKAP2RPL39L DHRS3 LOH11CR2A GCA LEPREL1 SH3BGRHT ATIP2 HRASLS3HEBP1MX1 PIRECM2C3orf14 DKFZP586H2123 SSPN CYB5R2 F3

PMP22 CSRP1PNPLA4GJA1CA2GPR37 MAPK4NME5EPHX2 EFHC2SLC16A4PLA2G5ACOX2

S100A13KCNE4 PCSK5TNF AIP6 RNASE4 STEAP3TNFRSF1A LRRC17SCG2 SRPX PYGL FZD7 OSBPL3 XAF1

C13orf18PTGS1 CD151KIAA1199IQGAP1 ZNF217 LAMA2STACTGFB3 CNIH3ARSJGRB10 PDGFDCROT MREG SYPL1 TTC12CYBRD1

TRIP6CSRP2VAV3 LRRC16ALDH7A1ARNTLCHL1NR2E1 MEOX2 FLJ21963ANXA1LGALS3 MO XD1 CXCL14COLEC12 CP RARRES2TMEM176AS100A10SERPINA3 SNX10 CHI3L2RBP1FABP5 FMOD PLA2G2A CD44 EFEMP1ANXA2 CLIC1EFEMP2 C1RL CCDC109B PBEF1ADM FLNCPDPNCHI3L1EMP3LTF NELL2 MA TN2 PLSCR4 PGAM2 GYG2PACRG SLCO1C1 AQP4MA OB C21orf62 CENTD1

HES1LPLITPKBSLC1A3GATMMLC1 CST3NDPRYR3MOSC2RGNBBOX1GRIK1 CNGA3PRCPKCNN2IL17RBACSL6HAPLN1GRAMD1CSLC15A2DD

AH1

CTNNA2 PR

ODH

GABRA2 FAM107A

ITGA6 GMPRRAMP1 ATP1B2 FGFR3SPON1SLC6A1 TSP

AN7NDRG2 KCNIP1 FAM5BWASF3EDNRBACSBG1 SLC4A4SLC1A2GABRB1FGF1HSPA2FXYD1LGI1LPPR4ALDH1L1PSAT1DGKBALDOC S100B C9orf61 KCNJ16ELOVL2

KLHDC8ARPH3A SOCS2 KLHL4 PDGF A SEC61G EGFR DNM3CNTN1PAK3 SNAP91FUT9FAM5CSTMN4DUSP26HMP19KIT PHGDH RUNDC3ANRXN1KIF1A ATP6V1G2 INA CRMP1 KLHL23 SCN3ACRB1GST A4

ACCN2 FGF14GDAP1L1NCAM1 LSAMP SCHIP1 DDX25TIMP4 REEP1FLR T3OPCML TAGLN3THSD7A FAM70A MY

O16 ID4

NCALD FXYD6BMP7NLGN3 NLGN1BAI3

TSP

AN12

PCSK1N

ABA

T

NBEA CA14CTNND2 MPPED2SEZ6L MAP2WSCD1GRIA2 ASCL1 BCHEBEX1CKBTTYH1 CDH10 NCANSCG3C1orf61AGXT2L1 ATP1A2 H expression −5 05 (d) Gene Expression 5 4 1 3 2 OBP2BSLC44A2GPR75TXKCUGBP2ZNF583 PRSS1CUL7 VSIG2C1orf64SLC47A2GALR3KCTD12 C19orf21 TFAP2E

SORBS2

UCNTNXBDHX32 FABP7FNDC3BRDH5 RDH5 RAB34DCLK1HPDSLC7A11 ACSBG1 FKBP10PDE6BHIST1H3ES100A16

CHFRGRB10 BBO

X1AGTAQP4TGFB3 SOCS2DARCPHF20 PDE6B C4orf26KALRNAHR CD36EDAR

PKHD1

SLC22A18

LRFN3 NUPR1 MS4A1MPP7 ARSBPRDM11 C11orf76CNTNAP4LCE1DC13orf29 C1QTNF9LPAR5TAGAPLGTNZNF541AIM2 RCN3CASP2CRX

CD164L2

RGS5

SEMA3BTMPRSS8

TAP1 DAKGDF2TRIM65 TSSK2 UGT1A1THPARK2DEFB118C4orf50 GPR152ABLIM1BTBD6 AZGP1

KR

TAP4−2C3orf22CCR9

CHRNA2GCKRSULT2B1ASB16LMAN1LCD79APCOLCEHK1 MPZKRT13IL1F7PRRG2 GPR35KRTDAPATP8B1C10orf11 RNF186PDE6HPCK1CLCA1 SPINK5ACMSDLCE2DC20orf186 APOBEC4PAPPA2LACR

T

CATSPER1 OP

ALIN PRXTRPV6 MYO1AHTR3BEPX FRKACMSD FOXI1 ADAMTS13CASQ1WFIKKN2 C16orf47IL1RL2 GST

A5

PLA2G3GNASGCNT3MYH7GALR3ABRAITGBL1 BCAS1HBZ

KLHDC7A

FUT1RGSL2 TACR2WFDC13CRCT1ANGPT4CCL16PLA2G4ECYP2A7PART1KLF1B3GNT3SFTPBLY6D

KR

TDAP

C17orf73HTR3CCIB3CYP4F3HFE2RCVRN OR5V1CTSK TEX19

SERPINB12

SPRR3TFF3

MAB21L2 PCOLCERUNX3 IGHG3

SERPINA10FAM107B NLRP14 INS CA CNG3 PDCD1LG2 RBP3 LIMD1HNF4AINS RAP1GAPIGSF9 RIMS3SYNGR2FOSL1 HAMP PDCD1APCDD1MRI1ST6GAL2MICAL1CNFNVPS33AEBI3STMN1C13orf30 CCDC69SULF1COLEC11 C14orf93CDC42EP3ATP10ACR1BNC1

ALS2CR11 C7orf52CNFN SLC4A11 TP63SUSD1FLJ37396 TES PLXNB1SFRP1 SMPD3FAAHADCY5 KCNQ1DN SMPD3SOX8PAX3 D4S234E

PENKPSMD5 TSP50NDNASRGL1ACTL9PSMD5L3MBTL2SKAP1 CYFIP2ST18C2orf82 SLC5A8 CALCRLCCR3 CHFROPRM1 HOXA4 RIN2CDK10SLC8A2 HOXA13GPR27 GRIN1 FOXA2

TMEM147 SRR T RPS6KC1 HSD17B4 SLC25A11BXDC1NDUF A3 GA

TA4MYF6BMP8AEIF4ERUNX3KCTD4RAD51CGRIP1SLC39A7LRFN3SH2D3A C19orf47CUL4A KCTD4CACNG2IL16CRISP2LRRN4CLAKR1C2 TUBA3CHOXA5

PRR

T1CLEC4CRBM17 SMPD3 TRIM54 SOCS4DNAJC5BCCL11KRT33AKRT34KRT33BINHBE KRT14CKMHIF3A SPRR1ARGPD5VWA5B1OR1D2 OR1G1FLJ43826 SLC17A4TPM3 CCL16PKLR

UGT1A3 C12orf59

IL29ASAH2MOGA

T2

NPC1L1SIRPG NR0B2FCER1A TM4SF1UCN3APOC2KRTAP13−3ESM1

WFDC12

ZACN

CELA3B

SLC22A18AS

SDR9C7

IL1R1 AQP3TNNI3MMP26ADH7NCSTN OR2S2DNTTIP2 C10orf81IL5RASEPT12

CEA CAM7 ALO X12B APCS S100A12CR YM SLAMF7

IL1R1 MEFV CD1BADAM7 GMLMGAM FCRLBCNTROB

NEUR OD6KLK12ALDH8A1TM4SF4 ZNF532 ZBTB32 PRDM7LRTM1 BTD CD A C16orf81HEP ACAM2 ADCY10 ALKBH1RGS13CDC45L ZP4

PPAPDC3AQP8MYL4 KLK3OR1F1KIF25LIPEFCGR3A FCGR3A ACTL6B HY

AL4

KCNK18

FUT5 TGM6 SAA4TRIM31 TRPV6 DEF

A1CTSG TRYX3 OR2W1 SPER T SLC36A3SPRR4AD AM29KR T13

C20orf71IL1F9 SGCZ FBXL5SMCPLCE1FZNF280A SH3BP5NOS1 ERP27MSR1

TMEM129 ST6GALNA C1PDIL T GABRA5 FGAPRG4 PGL YRP3 IQCF2 SPRR2A KLK7 FLJ44674UGT1A6XYL T2 C4orf7POU1F1 AD AM21STMN1 ST6GALNA C6 SERPINB5C1orf161WFDC10BC16orf73HIST1H2BO CUZD1PON1 UNC45B CCL8 TP73PF4V1GPR115 EXOC3L CASP14CHST4PSG4 SIGLEC9

LCN1OR7A5 ICAM2 WFDC9 ATP4B

SP ACA3KR T78 NCRNA00161 SPINL W1 AKAP3C14orf68CST9LCYP1A2RIPK3 FFAR1 SCGB1D2

HBE1BTNL2ZNF324 LILRA3C20orf79 C13orf28MND

AGIF C9orf116 NFS1GK2RGS13 MS4A2KIR3DL1KRT9CCL7ZNF266 SUMF1MGMTNTF3FLG PPP1R3ATTLL6 MGMT MSMB KR T15 CDA CLCSPAM1 SDR9C7 LCN6 ABI3IGJ

SLCO1B1ZNF541FLJ40235 FLJ46358FAM12BSLC17A1 SLC34A1TRHRKRTAP10−8 ANP32D CD300EFAT2

SIRPD OR1N1 CD163NLRP10DEFB4

KRTAP13−1C19orf59SYCE1PGL

YRP3 GDEPDYRK4LBP

C21orf56ANXA4

PF4

FAM83FS100A10

PSD4PEX10 USP29 REG3A C2orf53GDF5PRSS16 ZNF274 MTHFRRBM46ISG20L2IL21RSFTPDJPH4 MYT1

OGFOD1

ANK3

C22orf23ANPEP ZC3H7AMYH1 ZNF19 OR7C1HAS1NLRP8SLAIN1PRRG2APODMYT1CD1AFSD1HOXB1 ARPP−21LRRC4DEFB123DGKIFCGR3BLALBALCKDAB2IPOR2C3 SKAP1EDN3C7orf16 CHST13 C7orf16 PRSS16RTEL1

DNASE1L2 AKT3 TMEM140 LYL1 SEC14L4 ANK1KCNQ1XAF1 ENTPD3 UCNISG20 SLC25A10 TTC22 GBGT1RHCGGAS2L1 CYFIP1KRT7 CDH22 P2R Y6CYBA LPAR2OAS2

KCTD14RHCGZMYND15ANKMY1 RASSF1 RASSF1 THNSL2 PYCARDCRIP1 THRBC9orf167NEBL HCG9 HCG9CYB561KIAA0746KCNQ4FTH1A4GAL

T CCDC78GSTM5SLC16A5SPATS1 ALDH1A3TMEM176B HNF1B MOXD1 FAM124BFAM124B CCNA1 CCDC8 LY75 LRRC61 PAOX COR O6 LRRC8E WFDC2 MBP JAG2 ALDH1A3PRR15 PDIA2 ABCG2C6orf150KLF11MYL12ACTSZ

SLC44A3

JAG2

GPR25CDKN2A CDKN2A CDKN2AHOXD3LPCAT2

SERPINB1 RSPH9 ACTA1 HSP A2 CCDC68PCDHA13STEAP4 TRHABCA3CXCL12 CDKN2BSCAPRPL39L HSPA2 TNFRSF10A C6orf227C1orf87HCP5 WNK2 PKP1 HAA O SPINT2OCA2GPR27TP73HSP A2 PCDHGB4 SPD YA LVRNLPAR1MKNK1RHBDD1PCDHGB7 ABLIM3MIXL1HOXD4 GPR124HIST1H4J HIST1H4K GLS2 HIST1H2AI DLX5SYK ENTPD1PCDHGB7 HOXD3VRK2 TBX5 PIP5KL1

APOB SDPRC6orf227RSPH9C1orf107 IQSEC1TBR1TOM1L1ZMYND12 SLC15A3CPXM2 NOD

ALCYBA CTSZ

RASSF3

GBP3

TNFRSF18

CHI3L2 NOXO1CFTR RIPK3CHI3L2ADRA1AAPOL1 CPNE8LY6KC13orf33 CDKN2B SLC5A1 CDKN2BMA T1ACTPS FAM46B ARL4A ARHGAP8 WT1CBR3LXN COL14A1 FZD6 LRRC61FABP5 HAA O ISM2 GPR157

HEYL MESTENTPD1SP100PAPSS1CHRNA4SCN5AERGPRKG2 SUMO3 DLEC1 SMAGP

TNFRSF4

EYA4

SOCS3 SPINT2NGF

C13orf31

MYLKRNF207BAI1GSTM5ASAM TOX2ALDH1A3 FCGR2BPTGISMAP4K2FARP1 NET1HDAC3

CCNA1SLC13A5

TNFRSF10D

C16orf28 C13orf33

MEST PGCP MEST GLRX ETV7

CHRNB4 COG2 GJB6 C10orf10 GAS6 PRPH SYT9 EFEMP1 MT1HFBXL22 RNF43 SLC22A18

IMPDH1SCGB3A1 SLC11A1OCIAD2RHOHOCIAD2PLEK2 F13A1 EPHA2 CILP2 CNTN4 RILPL2FAM123CDAAM2SPNS3 LTB4RACTN2ACSS1 NPPB

ADSSL1 ENTPD3 MPV17LHIST1H1A

TNNI3PGFBSCL2AKR7A3S100B KCNQ1RASSF1OXCT2 DSG2NTRK2LEPCREG1GFI1

TNFRSF10C KR T72 MFSD7 CMTM2 SOCS1 EPHX3 CR YABNGBLPCA T1

TRPV4 WDR85SULT1A1 TBX5ESRP2ZDHHC12GPR126 DNAH3KIFC2ALDH1A3SLC2A2 BEGAINALX4 AQP5MYD88 FGF23 KCNQ1ARHGEF7 TNFRSF1BPCDHB14C19orf35 SLC12A6PSD3PLCD1COX7A1KCNE3 CAMK4SPIBCD244 KLHL1

TNFSF13B TOX2 BANK1CYP26C1IL20RASTEAP4 IER3TCF15 RAC2 MAP2K3APCDD1LACO T12 APCDD1LCPXM2SND1PSKH2 KRT72 SEMA3BMOBKL2ATEKT3 POMCALO X15B EVC2 ZC3HA V1L VPS53 KCNS3ASP A PM20D1 PSMD11SNAI1KIAA1804IGFBP7 PAQR9 TNFRSF10C GNG4 RYR1TRAM1 LHCGRCD14 NCCRP1 MTNR1A PAX1 DSCR6CGB2 TCF12MORN4 CCDC19NAGA C22orf27 SLC5A1

MKX ME1 ME1C7orf13PVRRHOD LPAR2AHNAK ACSL1APH1B AXIN1TET2KCNQ1 TCF21TBX1 FLT4

SLC26A5

GBP4HPNPLAC2RBP1PLAC2SLC6A15FBLN2PLLPENTPD2DLEC1BMP4 GJD2 TFPI2HOXA9TLR2 C1orf115HO XA9 HO XA7 ST6GAL1 NINL KCNQ1DN SNX9KCNK5 KCNC4 MCHR2B4GALT6 B4GAL T6

TSGA14 ADRA2BGALNT14 PRKCDBPADSSL1 PRKCZGIPC2 BANK1 ULBP1C10orf82MNX1

TNFRSF10D

KCNS1VRK2PRRG4CREM CDH1

TNFRSF10D

RAC2

SLC13A5RPL39LPCDHB13 HIST1H4LENTPD2PCDHB12KCNA1SECTM1TLR2BHMT2KLSIX6GCM2CTHRC1TCTEX1D1SPAG17

CCDC140 C1orf115 SLC16A3CYB5R2 SLAMF7 FBLIM1

HPN PHLD A2SST PYCARDPRKCDBPSEMA3F HEYL TMEM171FO XJ1 RAD9B ZNF22 APLNR FO XJ1

EFCAB4B CLEC14AIRAK3 GNAI2

KLKLC1 KIAA0323 GJB6 ECHDC3 UBXN10 PRA C PLA2R1 HTR5ADLX1 PHO X2B KCNK17 SLC7A10LRRC56HO XB8

TCF21 GHDC CCL23 GCNT1TMEM92AMPD3SLPICD244MYOM2GOLM1FAIM3P2RX1 TRIM40ABTB1TREML2 C22orf33 CXCL10 SLC26A4AIM2HESX1 HTR3E CSRP3GPR114 FCGR2A GIMAP1 KCNK17TNFRSF9CD48ASGR2 ZG16B EGFL7 PRTN3

NO

TCH4 F2RL3SDC4CSDC2CD93RGS14DPYSL5MBNL1 PODXLMEGF11GSTP1VSTM2LELANE NFATC2

KCTD17HO XA2 COMPNFATC2 SHKBP1 PR OKR2 TER T

MADDC19orf55 PARP12 BFSP1EPOCHFR IFNA2

CCDC87

HCP5 NAT8

TCF15BTN3A2ZFPL1 MYBL2 GFRA4GIMAP5 TAS2R60TREM1 KCNG1C21orf128MS4A1 STMN2BMS1 KLK4HSPA1L WFDC3PCDHB2 SLC9A2ALX4 SNNLIMD2 KCNA3BNC1 MALL SNX9KCNA3 HOXA9 PCDHB15GNA14PCDHB2UFSP2LRRC34 KCNQ1 SLC27A6CBX7HAND2 H2AFY MYO1BHOXA2 PCDHB15HO XA7 DUSP1FGFBP2NUDT4 MRGPRX2CCDC64 CD53 WNT10A RINL UBXN10GCET2MRGPRFWDR41TRAPPC1FCER2 SPON1NXNHOXB2 ELO VL1 RALGDS MAPRE3 EML1CASP8SH2D3AALDH1A3TRIM58 HOXC11 HO XA11 HO XD11 OLIG3PITX2H2AFYESR1 RASSF1NEUR OG1 FAM69BCDX2 AJAP1

NKX3−2 MAP4K1 RASSF1 PPP1CA

NEUR

OG1HO

XA9

NEUR

OG1ALOX15CDCP1TAC1LDHCRUNX3HOXA9KCNS2 ZFP41MRPL41KLF11SYKCPT1APKDREJPRLHRDLX5HIST1H4IRNF149WT1ACOT4TAC1TACR3 SIM1PPIEID2HTR7WT1IL12BWBP2NLWWP1DLX5 WIT1

C14orf102 SIM1 PR OCA1 NHLRC1ALAS1WT1PTMS TYMP NEUR OG1KLF14 ANLNCCDC96

HHEX NMBRHOXA6

TBX20 TSHR GNA11 PRG2ATP5G2HIST1H4I WT1 TNFRSF8 SLC2A14 DLX5 GPC2 BOLL TNFRSF10D DPYS RIBC2SIM2 HO XD10HTR1B SPAG6 LRRC3 GA TA4

SYNM NOL10 GNG4 TRHDEAMNADCY5LVRNHOXA9BNC1 MTL5

TNFRSF10C CXCL1NEFH ELL2 C1orf188SPAG6TDRD5MOSWDR52SYN2 HT ATIP2 SLC22A16 HO XB4 IZUMO1 EHHADHC19orf41GA TA4 ARHGAP24 DKKL1RPL26L1

FOXI2LOC84856C15orf51P4HA3BRF1 HBQ1FLJ45983CALCA CALCA HOXD9 DIO3CHP2NFE2L3 HOXD9KCNQ1EN2ACOT8

PTH2RTCERG1LCAR TPT WT1 NINL GA TA4PAX9 XRCC3 C4orf32 NXNL2 SO X14 GA TA4

SLC5A8 HOXD12USH1C C1orf59IRF6GATA4SLC5A7 MAP4K1 SPATA18

HO XA11 ICMTHO XB4 ADAMTSL3 CYP2E1RASGRF2MYO1BNEFMHOXD4 HO XD12FOXD3 ALDH1A2KLF14 MAFBNPY STRAD A SRD5A2 WDR69 C7orf13 IGF2DRD5 KLK10 CELSR1 PRSS22 PALM2−AKAP2

CXCL6 KCNQ1 PCSK6ADRA2C SLC27A2MYOD1GPR6 ROR2NPTX2RELNISL1MSX2CXCL6KCNK12TRPA1

SO X17 TMEM132D GA TA4 TCEB2VAX1

PDE4C SYT10 PTGDRFAM19A4 SPHKAP

GBX2

HO

XC11PRKCBPTGER2 HMG20B SRD5A2PRR16 SMOC1NHLRC1PRKCHISL1MYO3AFAM150A ATP8A2BARHL2HHEXZNF540TESCXCL1 DMRT2RFX6FOXB1TBX20 BNC1 ASCL2TLX3ZFP42 HTR1B KCNK5SLC18A3SSTR1

TMEM132D

GREM1 FO

XE3

RASEF

ARHGAP9

DLC1 BNC2DIRAS3CD84TMEM116KCNH4DLC1 IRF6 EIF4ENHEJ1 HDAC1

ASAP3C17orf57 CD300C CHRNA6

CDKN2AIPNL

SLAMF1GST

A3

AMICA1 FOXO1SP140 AQP5CCRL2SH2D2A RNASE6 CD300LFTYROBP CD33ECE1GPR132TRAF3IP3NUAK1 IL4REVI2A STYK1

HLA−DRA

TMC8 VGLL2DOCK9PKP2DMR

T2TUBB6 SERHLSYKC13orf15HOXD4MLNR

FERD3L

SYK

FERD3L ZMYM2

NEFM RGS10IRX2 PAX9

IGF2BP1 CCDC140ECHDC3PHO X2BDGKE VAX2 SUSD3 KLK10GALNT14 INA TRIM58SLC35F3RUNX3 ONECUT2 GPR78GPR6 TRPA1 PRDM14 RALBP1 SLIT2 MSC KLK10NPY

CNTNAP2 CNTNAP2HTR5A ZNF177ZNF702P EFCAB1HOXD8 IRX2GPR83RASGRF1 SLC12A5EGR4NETO1ZNF177 HTR1EWNT2 RAB37SYN2TLX3FOXA1

SLC27A6WBSCR17 KHDRBS2 BNC1 IRX4 MSX2IRF8GA TA6FLT3 PRKCB CYYR1NPR3 SLC25A21 TNFSF9HCN1 EPB41L3MYO3AFO XA1 ZNF560 H methylation −1 −0.5 0 0.5 1

(e) DNA Methylation Figure 4: (a) Heatmap of W. The clustering of PIntMF was compared to glioblastoma subtypes. (b) Survival curves with p-value of log-rank test. (c, d, e) H matrix for the three considered omics blocs on glioblastoma dataset

percentage of explained variation (Fig. S1). Besides, this initialization is performed at the integrative level rather than separately on each block of data.

PIntMF tunes automatically the penalties on matrices Hkand W, without any intervention of the user, and we noticed

that all the matrices are quite sparse on real datasets (Figure 4). The user needs to choose only one parameter that is the number of latent variables. The last parameter can be chosen by looking at the MSE, cophenetic coefficient, and the PVE (Supplementary Materials Fig. S2 to S6). All these criteria are implemented in the R package. For non-correlated data simulations, only the cophenetic coefficient and the PVE allow choosing properly the correct number of latent variables.

It is still difficult to evaluate the performance of an integrative method on simulations (Cantini et al., 2020). The relationships between blocks of omics are complex, often not well-known, and the modeling of these links is not easy. To our knowledge, there does not exist any reference dataset to assess performances. Therefore, we evaluated the algorithm on two different simulation frameworks (completely simulated and based on real-data) and two real datasets. Besides, we compared it with several other state-of-the-art integrative methods. We demonstrated, on the first simulated dataset (non-correlated blocks), that PIntMF outperforms the other methods on both clustering and variable selection. Indeed, on simulated data, the clustering from PIntMF makes few errors of classification. We also highlighted that PIntMF is more robust to heterogeneous data compared to the others: the method performs

(13)

as well on gaussian distributions as on binary or beta distributions for the variable selection. On another simulated framework based on real data (correlated blocks), we observed good performances at clustering (perfect classification) and variable selection levels (AUROC upper than 90%). With applications on two real datasets (BXD and TCGA data section 4.3), we demonstrated that the method could deal with real datasets. Besides, the application on the two real datasets shows that we found original subgroups but also interesting variables linked to the clinical phenotypes (diet and overall survival).

A weakness of the model is that the convergence of the algorithm to an optimal solution is not mathematically justified. Besides, a significance test for the variable selection is not given due to the use of the LASSO regression (Jain and Xu, 2021). Jackknife could provide an idea of the confidence in the selected variables (Supplementary Materials Fig. S10). However, this type of approach is very time-consuming when datasets are large.

Another improvement of the method could be dealing with missing values. Missing values could be inside a block for a few variables. These missing values could be imputed by the average of other correlated variables or by the values of the nearest neighbor or more complex methods as proposed by (Voillet et al., 2016; Gonz´alez et al., 2009; Husson and Josse, 2013). Commonly, a whole block can also be missing for an individual. In this case, the matrix W could be computed only on the present blocks for this individual. Thanks to the W matrix, we could deduce a new profile

for this patient from the Hkmatrix inferred with the other individuals.

We could also extend PIntMF by including prior information such as the genome structure. For instance, we could force the algorithm to select the same genes in the DNA methylation block and the expression block. A group Lasso penalty (Simon et al., 2013) could be added to the proposed model to include such a prior.

To conclude, PIntMF is an easy and flexible method to integrate omics data. It exhibits good performance in terms of classification or variable selection in both cases (correlated blocks or non-correlated blocks). Among all tested methods, it is the one that works in most situations. PIntMF is fast and automatically tunes the penalty for each block to select an appropriate number of variables (sparse matrices). Besides, it provides a sparse matrix W to perform more easily the clustering of samples. We also provide three criteria namely MSE, PVE, and cophenetic coefficient to choose the best number of latent variables.

The integration of several types of omics with our method could help in discovering potential markers even with a small number of patients. Finally, it could also help to classify patients with unknown phenotypes.

6

Software

An R package named PIntMF can be used to reproduce all simulations and figures and is available online at ??.

References

Bersanelli, M., Mosca, E., Remondini, D., Giampieri, E., Sala, C., Castellani, G., and Milanesi, L. (2016). Methods for the integration of multi-omics data: mathematical aspects. BMC bioinformatics, 17(Suppl 2), 15.

Bock, C., Farlik, M., and Sheffield, N. C. (2016). Multi-omics of single cells: strategies and applications. Trends in biotechnology, 34(8), 605–608.

Brunet, J.-P., Tamayo, P., Golub, T. R., and Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12), 4164–4169.

Burstein, M. D., Tsimelzon, A., Poage, G. M., Covington, K. R., Contreras, A., Fuqua, S. A., Savage, M. I., Osborne, C. K., Hilsenbeck, S. G., Chang, J. C., et al. (2015). Comprehensive genomic analysis identifies novel subtypes and targets of triple-negative breast cancer. Clinical Cancer Research, 21(7), 1688–1698.

Cantini, L., Zakeri, P., Hernandez, C., Naldi, A., Thieffry, D., Remy, E., and Baudot, A. (2020). Benchmarking joint multi-omics dimensionality reduction approaches for cancer study. Nature Communications.

Chalise, P. and Fridley, B. L. (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PloS one, 12(5), e0176278.

Chalise, P., Koestler, D. C., Bimali, M., Yu, Q., and Fridley, B. L. (2014). Integrative clustering methods for high-dimensional molecular data. Translational cancer research, 3(3), 202.

Chauvel, C., Novoloaca, A., Veyre, P., Reynier, F., and Becker, J. (2019). Evaluation of integrative clustering methods for the analysis of multi-omics data. Briefings in Bioinformatics.

Chen, J. and Zhang, S. (2018). Discovery of two-level modular organization from matched genomic data via joint matrix tri-factorization. Nucleic acids research, 46(12), 5967–5976.

Chung, R.-H. and Kang, C.-Y. (2019). A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification. GigaScience, 8(5), giz045.

(14)

Gaujoux, R. and Seoighe, C. (2010). A flexible r package for nonnegative matrix factorization. BMC bioinformatics, 11(1), 367.

Gonz´alez, I., D´ejean, S., Martin, P. G., Gonc¸alves, O., Besse, P., and Baccini, A. (2009). Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical correlation analysis. Journal of Biological Systems, 17(02), 173–199.

Huang, S., Chaudhary, K., and Garmire, L. X. (2017). More is better: recent progress in multi-omics data integration methods. Frontiers in genetics, 8, 84.

Husson, F. and Josse, J. (2013). Handling missing values in multiple factor analysis. Food quality and preference, 30(2), 77–85.

Jain, R. and Xu, W. (2021). Hdsi: High dimensional selection with interactions algorithm on feature selection and testing. PLOS ONE, 16(2), 1–17.

Jerome, F., Trevor, H., and Robert, T. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788.

Meng, C., Helm, D., Frejno, M., and Kuster, B. (2015). mocluster: Identifying joint patterns across multiple omics data sets. Journal of proteome research, 15(3), 755–765.

Mo, Q. and Shen, R. (2018). iClusterPlus: Integrative clustering of multi-type genomic data. R package version 1.18.0.

Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., Powers, R. S., Ladanyi, M., and Shen, R. (2013). Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences, 110(11), 4245–4250.

Network, C. G. A. et al. (2012). Comprehensive molecular portraits of human breast tumours. Nature, 490(7418), 61.

Nowak, G., Hastie, T., Pollack, J. R., and Tibshirani, R. (2011). A fused lasso latent feature model for analyzing multi-sample acgh data. Biostatistics, 12(4), 776–791.

Pierre-Jean, M., Deleuze, J.-F., Le Floch, E., and Mauger, F. (2019). Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Briefings in bioinformatics.

Ramazzotti, D., Lal, A., Wang, B., Batzoglou, S., and Sidow, A. (2018). Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nature communications, 9(1), 4453.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.

Reilly, B., Tanaka, T. N., Diep, D., Yeerna, H., Tamayo, P., Zhang, K., and Bejar, R. (2019). Dna methylation identifies genetically and prognostically distinct subtypes of myelodysplastic syndromes. Blood advances, 3(19), 2845–2858.

Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A., and Kim, D. (2015). Methods of integrating data to uncover genotype–phenotype interactions. Nature Reviews Genetics, 16(2), 85.

Rodosthenous, T., Shahrezaei, V., and Evangelou, M. (2020). Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: A comparison study. Bioinformatics.

Rowlands, D. S., Page, R. A., Sukala, W. R., Giri, M., Ghimbovschi, S. D., Hayat, I., Cheema, B. S., Lys, I., Leikis, M., Sheard, P. W., et al. (2014). Multi-omic integrated networks connect DNA methylation and miRNA with skeletal muscle plasticity to chronic exercise in type 2 diabetic obesity. Physiological genomics, 46(20), 747–765.

Sastry, A. V., Hu, A., Heckmann, D., Poudel, S., Kavvas, E., and Palsson, B. O. (2020). Matrix factorization recovers consistent regulatory signals from disparate datasets. BioRxiv.

Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22), 2906–2912.

Shen, R., Mo, Q., Schultz, N., Seshan, V. E., Olshen, A. B., Huse, J., Ladanyi, M., and Sander, C. (2012). Integrative subtype discovery in glioblastoma using icluster. PloS one, 7(4), e35236.

Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). A sparse-group lasso. Journal of computational and graphical statistics, 22(2), 231–245.

Sneath, P. H., Sokal, R. R., et al. (1973). Numerical taxonomy. The principles and practice of numerical classification.

Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, pages 33–40.

Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika, 76(2), 257–284.

Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J., and Frouin, V. (2014). Variable selection for generalized canonical correlation analysis. Biostatistics, 15(3), 569–583.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.

Tini, G., Marchetti, L., Priami, C., and Scott-Boyer, M.-P. (2017). Multi-omics integration - a comparison of unsupervised clustering methodologies. Briefings in bioinformatics.

Vasaikar, S. V., Straub, P., Wang, J., and Zhang, B. (2017). Linkedomics: analyzing multi-omics data within and across 32 cancer types. Nucleic acids research, 46(D1), D956–D963.

Figure

Figure 1: Adjusted Rand Index of PIntMF, intNMF, SGCCA, MoCluster, iClusterPlus, and CIMLR methods on simulated datasets
Figure 2: AUROC of PIntMF, MoCluster, SGCCA, iClusterPlus and intNMF for OmicsSIMLA simulations on (a) DNA methylation and (b) Gene expression blocks
Figure 3: BXD cohort results: Top 10 selected variables with PIntMF of each dataset (Metabolites, Proteins and RNA), the clustering given by PIntMF and the true clustering are on the right.

Références

Documents relatifs

We also demonstrate its genome- wide application to the integrative search of new regions with strong association between DNA copy number and gene expression accounting for

As a conclusion, SGCCA performs better than the com- peting methods, in terms of cross-validated test error rates and moreover provides shorter lists of relevant features, which

Nevertheless, the respective proportions of cDEF and tDEF (Figure S8B) are significantly different between the two floral phenotypes from Late Stage 2 – Early Stage 3 onwards, with

Nevertheless, the respective proportions of cDEF and tDEF (Figure S8B) are significantly different between the two floral phenotypes from Late Stage 2 – Early Stage 3 onwards, with

If an artificial two step selection procedure is performed in PLS, first by ordering the absolute values of the loadings and then selecting a chosen number of variables, to

Keywords and phrases: Categorical multivariate data, clustering, mix- ture models, model selection, penalized likelihood, population genetics, slope heuristics,

But if the specification generation stage is performed by a clients panel, this is again an experts panel which performs the concepts evaluations since the experts

2.1 A reporting protocol for cancer risk factor selection, data source selection, and data integration informed by a multi-level IDA case study.. In a previous study, we assessed