HAL Id: hal-03154671
https://hal.archives-ouvertes.fr/hal-03154671
Preprint submitted on 2 Mar 2021
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
FACTORIZATION METHOD FOR MULTI-OMICS
DATA
Morgane Pierre-Jean, Florence Mauger, Jean-François Deleuze, Edith Le
Floch
To cite this version:
Morgane Pierre-Jean, Florence Mauger, Jean-François Deleuze, Edith Le Floch. PINTMF:
PENAL-IZED INTEGRATIVE MATRIX FACTORIZATION METHOD FOR MULTI-OMICS DATA. 2021.
�hal-03154671�
M
-A PREPRINT
Morgane PIERRE-JEAN Universit´e de Paris-Saclay,
Centre National de Recherche en G´enomique Humaine, CEA, Evry, France, [email protected]
Florence MAUGER Universit´e de Paris-Saclay,
Centre National de Recherche en G´enomique Humaine, CEA, Evry, France Jean-Franc¸ois DELEUZE
Universit´e de Paris-Saclay,
Centre National de Recherche en G´enomique Humaine, CEA, Evry, France Edith LE FLOCH
Universit´e de Paris-Saclay,
Centre National de Recherche en G´enomique Humaine, CEA, Evry, France
March 2, 2021
A
BSTRACTIt is more and more common to explore the genome at diverse levels and not only at a single omic level. Through integrative statistical methods, omics data have the power to reveal new biological processes, potential biomarkers, and subgroups of a cohort. The matrix factorization (MF) is a unsupervised statistical method that allows giving a clustering of individuals, but also revealing relevant omic variables from the various blocks. Here, we present PIntMF (Penalized Integrative Matrix Factorization), a model of MF with sparsity, positivity and equality constraints.To induce sparsity in the model, we use a classical Lasso penalization on variable and individual matrices. For the matrix of samples, sparsity helps for the clustering, and normalization (matching an equality constraint) of inferred coefficients is added for a better interpretation. Besides, we add an automatic tuning of the sparsity parameters using the famous glmnet package. We also proposed three criteria to help the user to choose the number of latent variables. PIntMF was compared to other state-of-the-art integrative methods including feature selection techniques in both synthetic and real data. PIntMF succeeds in finding relevant clusters as well as variables in two types of simulated data (correlated and uncorrelated). Then, PIntMF was applied to two real datasets (Diet and cancer), and it reveals interpretable clusters linked to available clinical data. Our method outperforms the existing ones on two criteria (clustering and variable selection). We show that PIntMF is an easy, fast, and powerful tool to extract patterns and cluster samples from multi-omics data.
1
Introduction
The improvement of high-throughput biological technologies enables the production of various omics data such as genomic, transcriptomic, epigenomic, proteomic, and metabolomic data (Ritchie et al., 2015; Yugi et al., 2016). The
generation of these data allows investigating biological processes in cancer or complex diseases. For example, The Cancer Genome Atlas (TCGA (Network et al., 2012)) has already produced numerous omics data for a set of 32 cancer types (Vasaikar et al., 2017). Recently, other multi-omics studies on complex diseases and single-cell data are also emergent (Rowlands et al., 2014; Bock et al., 2016; Yang, 2020).
However, integrating omics data addresses several statistical challenges, such as dealing with a large number of vari-ables, few samples, and data heterogeneity (Bersanelli et al., 2016). Indeed, the statistical distributions of omics data are very heterogeneous. For instance, mutations can be modeled by a binary distribution, while RNAseq data can be modeled by a Negative Binomial distribution and metabolomic data by a Gaussian distribution. Besides, the omic block sizes could vary from one hundred to one billion variables. Furthermore, collecting several omics for a single sample could be difficult due to the cost and access to the biological material.
To identify potential biomarkers and new classifications in complex diseases, since the last decade, unsupervised integrative methods have been developed to analyze the multi-omics datasets (Tini et al., 2017; Huang et al., 2017; Chauvel et al., 2019; Pierre-Jean et al., 2019; Cantini et al., 2020). Blocks of omics data can be seen as matrices, and relevant information can be extracted using dimension reduction methods, particularly, matrix factorization (MF) methods (Sastry et al., 2020) and canonical correlation analysis (CCA) (Tenenhaus and Tenenhaus, 2011).
CCA methods are used to integrate multi-omics data and aim to maximize the correlation between omics under con-straints (Tenenhaus and Tenenhaus, 2011; Tenenhaus et al., 2014; Rodosthenous et al., 2020).
Then, MF techniques infer two matrices when applied to a single omic data: the first one describes the structure between variables (e.g., genes, probes, regions) and the second one describes the structure between samples.
One famous MF method is the Non-Negative Matrix Factorization (NMF, (Lee and Seung, 1999)). This method implements non-negativity constraints on the two inferred matrices. NMF provides a way to explain the structure of data by providing variable profiles (dictionary for each dimension). Besides, NMF enables a classification of the samples thanks to the second matrix. The NMF is a commonly applied method used for a single omic block to identify disease subtypes in gene expression data (Burstein et al., 2015) or recently, in DNA methylation data (Reilly et al., 2019).
More recently, extensions of MF have been developed to perform integrative analysis (Mo et al., 2013; Chalise et al., 2014; Chen and Zhang, 2018). MF extensions need to infer more than two matrices: one matrix for each omic block is computed and one matrix for samples.
Matrix Factorization showed that it is a powerful technique to integrate heterogeneous data (Chauvel et al., 2019; Pierre-Jean et al., 2019; Cantini et al., 2020). In our article, we propose a Penalized Integrative Matrix Factorization method called PIntMF, to discover new patterns and a new classification of a cohort. First, to add sparsity on the first inferred matrix (corresponding to the variable blocks), we used a common regularization technique: the Least Absolute Shrinkage and Selection Operator (LASSO (Tibshirani, 1996)). Then, sparsity, non-negativity and equality constraints are added to the second matrix (corresponding to the samples) to help for the interpretability of the clustering. Besides, we propose criteria to choose the number of latent variables and to properly initialize the algorithm.
The performance of this new unsupervised model was evaluated on both simulated and real data. We applied PIntMF on a simulated framework introduced by our group in (Pierre-Jean et al., 2019) but also on a simulated framework from (Chung and Kang, 2019). We compared our method to several existing unsupervised methods that perform both variable selection and clustering: intNMF (Chalise and Fridley, 2017), SGCCA (Tenenhaus et al., 2014), MoCluster (Meng et al., 2015), CIMLR (Ramazzotti et al., 2018), and iClusterPlus (Mo and Shen, 2018). Then, we applied the model on a murine liver dataset (Williams et al., 2016) and glioblastoma cancer data from TCGA already used in (Shen et al., 2012).
2
Method
2.1 Model description
In the following, A denotes a matrix, a a vector and a a scalar. We consider K matrices X1, . . . XK as the input of
each method. Each matrix Xk is of size n × Jk (n is the number of samples and Jk the number of variables for the
block k. In this article, we propose a model based on the matrix factorization method i.e.:
Xk≈ WHk
(1)
where W denotes a common basis matrix and Hka specific coefficient matrix associated with the block k. W is of
size n × P and Hkis of size P × J
k. Therefore, the variable P is the number of latent variables in the model.
To ensure identifiability and improve interpretation of the model, non-negativity and sparsity constraints are imposed on W (as in intNMF model described in (Chalise and Fridley, 2017)). W will be used to cluster samples
simultane-ously across the K omics blocks. On Hk, a sparsity constraint is imposed to perform variable selection simultaneously
to the clustering of samples. The model 1 can be extended to the following optimization problem:
min W,H1,...,Hk K X k=1 kXk− WHkk2 F + λkkHkk1+ (2) n X i=1 µikwi•k1 s.t. W ≥ 0 where kHkk 1=PPp=1PJj=1k |hkpj|. 2.2 Solving equation
The optimization problem 2 is not convex on W, H1, . . . , Hk, but is convex separately on each matrix. Consequently,
it can be solved alternatively on W, H1, . . . , Hkuntil convergence.
2.2.1 Solve on W
In this step, Hkis fixed and the problem 3 is solved on W.
min W K X k=1 kXk− WHkk2F+ n X i=1 µikwi•k1 st. W ≥ 0 (3)
All individuals are independent for the weights W when Hkare fixed. The problem for an individual i can be written
as follows: min wi• K X k=1
kxki•− wi•Hkk2+ µikwi•k1 st. wi• ≥ 0 (4)
Equation 4 is equivalent to min wi• K X k=1 Jk X j=1 (xkij− wi•hk•j) 2 + µikwi•k1 st. wi•≥ 0 (5)
The optimization problem described by 5 is a classical lasso problem with a positivity constraint. It can be easily and fastly solved by glmnet R package (Jerome et al., 2010).
2.2.2 Solve on Hk
When W is fixed, each Hk can be solved independently. In this section, to be more readable, the index k is removed
from the equations.
min H Q(H) = minH kX − WHk 2 F+ λ P X p=1 J X j=1 |hpj| (6) Q(H) = Trace(X − WH)(X − WH)T + λPP p=1 PJ j=1|hpj| = vec(X − WH)Tvec(X − WH)+ λPP p=1 PJ j=1|hpj| We denote h = vec(H) = H11 .. . HP 1 .. . H1J .. . HP J and x = vec(X) = X11 .. . Xn1 .. . X1J .. . XnJ . Q(H) = (x − vec(WH))T(x − vec(WH)) + λkhk1 = (x − (IJ⊗ W)vec(H))T(x − (IJ⊗ W)vec(H)) +λkhk1 = (x − ˜Wh)T(x − ˜Wh) + λkhk 1
where IJis the identity matrix of size J and ˜W = IJ⊗ W
We can reformulate the problem as follows:
Q(H) = kx − ˜Whk2+ λkhk
1
λ will be optimized for each block k = 1, . . . , K.
As for W, we used the glmnet package to solve this problem.
2.2.3 Normalization
We would like to consider W as a weight matrix. To avoid problems of convergence or non-identifiability, the nor-malization by the sum of weights for each row of W is added after computing the matrix, i.e. each row is divided by its sum after each step:
wi•= wi• PP p=1wip (7) 2.3 Stopping criteria
The stopping criterion of the model is determined by the convergence of the matrix W. The stability of the similarity of matrix W between two iterations means that the model has converged therefore we stop the algorithm. The similarity
between Wt−1and Wtis measured with the ARI. The users have also the possibility to define a maximum number
2.4 Automatic tuning of sparsity parameters
For each block Xk, we need to calibrate the sparsity parameter λk and µi. The main advantage of glmnet package
is the speed (see Supplementary Materials Fig. S9). Besides, glmnet implements a cross validation technique to choose the best λ or µ. PIntMF takes advantage of glmnet to calibrate the penalty on each block. Therefore the only parameter that the user needs to tune is the number of latent variables P .
2.5 Clustering
In this article, all clusterings are obtained by applying a hierarchical clustering with the ward distance (Ward Jr, 1963) on matrix W. For the optimal number of clusters, P is chosen.
2.6 Criteria to choose the best model
In this section, we present three different criteria to choose the appropriate number of latent variables (P ).
2.6.1 Mean square error
The number of latent variables can be optimized by looking at the curve of the Mean Square Error (MSE). In this context, the mean square error (MSE) for each dataset k is defined by:
M SEPk = kX
k− WHkk2 F
n × Jk
(8)
Then, the total MSE is then defined by averaging the different M SEPk:
M SEP =
X
k
M SEkP/K (9)
2.6.2 Percentage of variation explained (PVE)
To measure the performance of the method, we computed the Percentage of Variation Explained (Nowak et al., 2011) defined by the following formula:
P V E(W, Hk) = 1 − kX k− WHkk2 F kXk− ¯Xk1 J kk2F (10)
where ¯Xkis a vector containing the average profile of each individual:
¯
Xki =
P
jxij
Jk , and 1J k= (1, . . . , 1) is a row-vector of size Jk.
Then, we computed the global PVE as the mean of the PVE on the K blocks i.e.:
P V E = 1 K K X k=1 P V E(W, Hk) (11) 2.6.3 Cophenetic distance
We were inspired by (Gaujoux and Seoighe, 2010) for the last criterion.
We want to assess if the distances in the tree (after hierarchical clustering on W) reflect the original distances accu-rately.
One way is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function on W (Sokal and Rohlf, 1962). The clustering is valid, if the correlation between the two quantities is high. Note that we use the cophenetic function defined by (Sneath et al., 1973).
The cophenetic correlation usually decreases with the increase of P values. Brunet et al. (2004) suggested choosing the smallest value of P for which this coefficient starts decreasing.
3
Performance criteria
Two criteria are used to assess the performance of our method and to compare it with others.
3.1 Adjusted Rand Index (ARI)
On a simulated dataset and on well known real datasets, it is possible to compute the similarity between the true and the inferred classifications. We use the Adjusted Rand Index as a criterion to evaluate the performance of our method. The Adjusted Rand Index (Rand, 1971) is equal to one when the two classifications that are compared are totally similar and zero or even negative if the classifications are completely different.
3.2 Area under the ROC curve (AUROC)
On a simulated dataset, the variables that drive the subgroups are known, and it is easy to compute false-positive and true-positive rates. First, variables are ordered by their standard deviation (from the highest to the lowest) computed on the H matrix to highlight the largest differences between the P components and therefore the most contributory to the clusters. To summarize the information of these two quantities, we compute the area under the TPR-FPR curve (AUROC). An AUROC equal to one means that the method selects the variables with no error. An AUROC under 0.50 means that false-positive variables are selected before the true positive ones.
4
Results
4.1 Optimization of the algorithm
4.1.1 Initialization
Often in NMF algorithms (Lee and Seung, 1999), the matrices are initialized by non-negative random values. We assess four kinds of initialization for PIntMF (hierarchical clustering, random, Similarity Network Fusion and Singular Values Decomposition).
The best initialization is based on the SNF algorithm (Wang et al., 2014) (Fig. S1). This initialization has the advantage to take into account simultaneously the K blocks of the analysis.
Therefore, for all the following analyses, SNF initialization was used.
4.1.2 Computing optimization of H
Several algorithms to solve the Lasso problem on Hk were tested. glmnet is the fastest package among them
(Sup-plementary materials Fig. S9).
4.2 Performance on simulated datasets
We assess the performance of PIntMF in two simulated frameworks described below.
4.2.1 Simulations on independent datasets (non-correlated blocks)
The performance of PIntMF to cluster samples and to select relevant variables was evaluated on simulated data de-scribed in (Pierre-Jean et al., 2019). The framework of these simulations is composed of three blocks with three different types of distribution (Binary, Beta-like, and Gaussian) to simulate the heterogeneity of the integrative omics data studies. Indeed, a binary distribution could match a mutation (equal to 1 if the gene is mutated and 0 otherwise); a Beta-like distribution could match DNA methylation data, and a Gaussian distribution could match gene expression values.
Four unbalanced groups (composed of 25, 20, 5, and 10 individuals) have been simulated (Benchmarks 1 to 5). Datasets with 2, 3, and 4 balanced groups have also been simulated (Benchmarks 6 to 8). Each benchmark is simulated 50 times.
PIntMF was compared to several integrative unsupervised methods (Pierre-Jean et al., 2019) that perform both cluster-ing and variable selection namely: intNMF (Chalise et al., 2014), SGCCA (Tenenhaus et al., 2014), MoCluster (Meng et al., 2015), iClusterPlus (Mo et al., 2013), and CIMLR (Ramazzotti et al., 2018).
On the eight simulated benchmarks with various levels of signal to noise ratio, PIntMF and MoCluster outperform the other methods with an ARI equal to 1 in most cases (Fig. 1).
iClusterPlus CIMLR SGCCA MoCluster PIntMF intNMF Benchmar k1 Benchmar k2 Benchmar k3 Benchmar k4 Benchmar k5 Benchmar k6 Benchmar k7 Benchmar k8 Benchmar k1 Benchmar k2 Benchmar k3 Benchmar k4 Benchmar k5 Benchmar k6 Benchmar k7 Benchmar k8
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Figure 1: Adjusted Rand Index of PIntMF, intNMF, SGCCA, MoCluster, iClusterPlus, and CIMLR methods on simulated datasets. B1 :Referrence, B2: More Gaussian noise, B3: More Gaussian noise and more Binary noise, B4: More Beta noise and more Binary noise, B5: More Relevant variables, B6: 2 balanced groups, B7: 3 balanced groups, B8: 4 balanced groups
The performance of variable selection is assessed using the area under ROC curves (AUROC) after computing False Positive Rates (FPR) and True Positive Rates (TPR) (see section 3.2). The computation of the AUROC shows that PIntMF performs as well as MoCluster on the three types of data (Table S1 in Supplementary Materials). Indeed, PIntMF reaches either the first or the second-best AUROC for these simulations. Besides, the lowest AUROC is equal to 0.88 which means that the method is both sensitive and specific.
4.2.2 Simulation based on real data (correlated blocks)
We evaluate the performance of PIntMF on a simulated framework based on cancer real data and developed by (Chung and Kang, 2019). Indeed, the previous framework does not simulate any correlation between omics blocks.
OmicsSIMLA is a simulation tool for generating multi-omics data with disease status. This tool simulates CpGs with methylation proportions, RNA-seq read counts and normalized protein expression levels. Here, we simulated 50 datasets containing 50 cases (i.e., short-term survival) and 50 controls (i.e. long-term survival), and three omics blocks (RNAseq, DNA methylation, and proteins). We try to recover the two groups but also the different features that drive overall survival by using DNA methylation, expression, and protein data. For two of the three blocks (expression and DNA methylation), the variables differentially expressed or methylated between the two groups are known.
The simulated data are described in Supplementary Materials (Section 5).
In these simulations, we also compare the performance of PIntMF to other methods in terms of clustering and variable selection. First, CIMLR does not give any results on these simulations (the algorithm does not converge). For all the other methods, the ARI is equal to 1 (maximum value) for all 50 datasets.
Then, we compare the variable selection performance of PIntMF, intNMF, iClusterPlus, MoCluster, and SGCCA by computing the AUROC on expression and DNA methylation blocks only (the protein block does not contain any variable simulated with differential abundance, more details are given in Supplementary Materials section 5). DNA Methylation dataset: PintMF and iclusterPlus outperform the others with similar performances but the AUROC of iclusterPlus is significantly higher. Then, the AUROC of PintMF is significantly higher than for MoCluster, SGCCA and intNMF (Fig. 2).
Expression dataset: PIntMF is the best method with an AUROC significantly higher than the others. However, all methods achieve an AUROC higher than 0.92. (Fig. 2)
On these simulations, PIntMF gives similar results to iClusterPlus, but with automatic tuning of parameters. Besides, the algorithm of PIntMF is faster than iClusterPlus.
0.00024 p < 2.22e−16 p < 2.22e−16 p < 2.22e−16 0.7 0.8 0.9 1.0 1.1
MoCluster SGCCA intNMF icluster PIntMF
A UC (a) Methylation 0.00015 7.1e−06 1e−10 5.5e−15 0.92 0.94 0.96 0.98 1.00
MoCluster SGCCA intNMF icluster PIntMF
(b) Expression
Figure 2: AUROC of PIntMF, MoCluster, SGCCA, iClusterPlus and intNMF for OmicsSIMLA simulations on (a) DNA methylation and (b) Gene expression blocks
4.2.3 Stability selection
Jackknife was performed to evaluate the stability of variable selection. To perform this technique, we run the model PIntMF on the data without one sample at each step. Therefore, we obtain n datasets containing n − 1 individuals on which we apply the method.
The stability of the selected variables for Binary, Gaussian, methylation and expression datasets seems to be strong (Fig. S10 in Supplementary Materials). For proteins and for beta-like data, the bootstrap reveals that some selected variables are not stable. The Jackknife method could be used to remove false-positives variables.
4.2.4 Summary
To summarize this simulation part (see Table 1), our method PIntMF provides satisfying clustering and variable se-lection both on correlated blocks (Simulation Framework 2) and on non-correlated blocks (Simulation Framework 1). PIntMF is the only method that performs well on all simulated settings.
We conclude on these two frameworks of simulated data that PIntMF is a fast and flexible tool.
Clustering Variable selection Automatic Tunning Parameters left to tune iClusterPlus + ++ - ¿2 intNMF +++ - +++ 1 SGCCA ++ ++ - ¿5 MoCluster +++ +++ + ¿2 CIMLR + ++ +++ 1 PIntMF +++ +++ +++ 1
Table 1: Summary of the performance of the PIntMF compared to other methods
4.3 Applications
In this section, we assess the performance of the PIntMF method on real data by considering two applications. The first one is a dataset from murine liver (Williams et al., 2016) under two different diets already used in two previous comparison articles (Pierre-Jean et al., 2019; Tini et al., 2017), and the objective is to recover the diets of the mice (fat diet or chow diet). The second one is a glioblastoma dataset from TCGA used in (Shen et al., 2012) and the goal is to find the tumor subtypes.
4.3.1 PIntMF highlights variables linked to phenotypes of samples
We analyzed the BXD cohort (composed of 64 samples) (Williams et al., 2016); the mice were shared into two different environmental conditions of diet: chow diet (CD) (6% kcal of fat) or high-fat diet (HFD) (60% kcal of fat). Measurements have been made in the livers of the entire population at the transcriptome, the proteome, and the metabolome levels.
Therefore, we applied PIntMF to this dataset as well as intNMF, MoCluster, SGCCA, iClusterPlus, and CIMLR (Supplementary Materials Table S2).
PIntMF produces a perfect classification of the individuals for this real dataset .
For this dataset, all criteria for the model selection were computed (Supplementary Material Fig. S6), and 2 groups were selected for further analysis.
PIntMF highlights interesting variables that seem to have different abundance between the two groups CD and HFD
(Fig. 3): VITAMIN E (C29H5002), Cholesteryl (C36H62O5), Mustard Oil (C4H5NS). Saa2 gene that codes for a
protein involved in the HDL complex seems to be deferentially expressed between the two groups. Then, the Cidea gene that is involved in the metabolism of lipids and lipoproteins has a slightly different level of expression between the two groups. Finally, Cyp2b9 oxidies steroids, fatty acids, and xenobiotics are less expressed in the high-fat diet group. To conclude, PIntMF succeeds well to recover classification and relevant markers in all datasets.
4.3.2 PIntMF reveals a new classification of non annotated samples on TCGA dataset.
Secondly, we analyze a subset of the glioblastoma dataset from the cancer genome atlas (TCGA): the Glioblastoma study (2009) used in (Shen et al., 2012). The dataset contains three matrices: copy number variation (1599 regions),
Metabolites
C37H70O5
C39H72O5
C4H5NS
C9H15N3O2S
C26H46O6
C29H50O2
C33H64O5
C36H62O5
C24H35O2
C39H60O5
Proteins
Srd5a−1 R75370 Ccdc44 Lpsb2 Pcn MTP galectin−1 GstpiB Gst3 Ces6RNA
Cd36 Lcn2 Saa2 Cyp2b9 9030619P08Rik Cyp2a22 Cidea D630002G06RikAB056442 Cyp2b13
PIntMF
T
ruth
Metabolites
5
10 15 20
Proteins
−10 −5
0
5
RNA
5
10 15
PIntMF
1
2
Truth
CD
HFD
Figure 3: BXD cohort results: Top 10 selected variables with PIntMF of each dataset (Metabolites, Proteins and RNA), the clustering given by PIntMF and the true clustering are on the right.
DNA methylation (1515 CpG), and mRNA expression (1740 genes) in 55 samples. GBM samples were classified into four subtypes (Classical: CL, Mesenchymal: MES, Neural: NL, and Proneural: PN). Besides, there are samples with no subtype (NA). Using the PIntMF method, we highlight samples with no classification close to labeled samples. Looking at the three criteria, the best number of latent variables seems to be 5 (Supplementary Materials Fig. S7). For example, the green cluster from PIntMF matches a part of the CL subtype, and one sample labeled as NA is in this green cluster. Then, the purple cluster from PIntMF matches the PN subtype, and one sample labeled as NA can be classified with the PN subtype (Figure 4a). Clusters 1 (red) and 2 (blue) are more heterogenous. However, the red one is composed of NL and NA labeled samples. The blue one is close to samples labeled as PN.
We performed a survival analysis to identify a relation between groups found by PIntMF and the survival rate (Figure 4b). The survival test gives a significant p-value at 5% (p-value =0.00013 with log-rank test). The prognosis for the purple (4) group is better than those of the red and green (1 and 3) groups and even better than the orange and blue (2 and 5) groups. Note that the PN subtype is split into two groups (purple and blue) that have two very different survival curves.
The previous study (Shen et al., 2012) performed with iCluster method (Shen et al., 2009) identified 3 subgroups with a less significant p-value (0.01) than PIntMF for the survival differences between subgroups. Their Cluster 1 matches the PN group, Cluster 2 matches the CL group, and Cluster 3 is mostly composed of the MES subtype. Authors do not give any information about the samples with no subtypes.
H matrices exhibit various types of genomic profiles according to the clusters (Figure 4). For instance, the orange clus-ter (5) shows few alclus-terations at the copy number variation level (Fig. 4c) but a particular profile for DNA methylation and gene expression data (Fig. 4e). The blue cluster (2) has a distinct pattern of expression (Fig. 4d).
5
Discussion
We presented a new model to discover new subgroups of a cohort and potential new markers from several types of omic data. PIntMF is a matrix factorization model with positivity and sparsity constraints (Lasso) on inferred matrices. The method and all scripts of this article are available in an R package named PIntMF.
The main advantage of this method is the automatic tuning of the lasso penalties for both variable and sample matrices. To optimize the algorithm at the computational time level (Supplementary Materials Fig.S9), we tried several
algo-rithms to infer matrices Hk. glmnet is very fast compared to the others (ncvreg, quadrupen, and biglasso), therefore
it was retained for all analysis. We also optimized the initialization of the algorithm that is obtained by using the SNF algorithm (Wang et al., 2014). This initialization provides, at the end of the algorithm, the best clustering and the best
1 2 3 4 5 0001 0002 0006 0027 0054 0057 0085 0089 0107 0113 0115 0133 0003 0007 0034 0037 0038 0046 0047 0052 0071 0074 0099 0143 0009 0021 0043 0083 0102 0125 0126 0137 0145 0148 0010 0011 0014 0024 0028 0058 0060 0069 0080 0114 0128 0129 0033 0055 0064 0075 0086 0122 0130 0139 0147 Types PIntMF W values 0 0.5 1 Types CL MES NA NL PN
(a) Heatmap plot of W: Homogeneity between subtypes and subgroups identified by PIntMF
+ + p = 0.00013 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 Time (months) Sur viv al probability
Strata + clust=1 + clust=2 + clust=3 + clust=4 + clust=5
(b) Kaplan-Meier plot: The subgroups identified by PIntMF show survival differences
3 5 4 2 1 EGFR SEC61G TSP
AN12 ING3FAM3C MEST CALU
KIAA0828
FLNC HIG2
MRPS33
CAV2TFECTESGPR37 CALD1CADPS2 TBXAS1GNAI1 SAMD9ACN9CAV1AZGP1 PILRBAASS ARPC1BSERPINE1PLOD3 SFRP4TSP AN13 ITGB8GPNMBAOAHAHR THSD7ATMEM106B SCIN PTNTRIM24 DENND2A HIPK2 TMEM140 DGKI LAMB1 NRCAMSYPL1LRRC17 IFRD1
PRKAR2BPBEF1 DUS4L LRRN3DNAJB9SGCESEMA3AABCB1PPP1R9ADBF4COL1A2 GNG11PEG10 CR
OT SEMA3C DYNC1I1TRIP6 PCOLCE ELN STEAP1PDGF A GBASHO
XA7CHN2 SNX10CPVLNFE2L3DPY19L1ARL4AIL6BZW2
OSBPL3 MPP6 ANKMY2 DGKB KLHL7 RAPGEF5 ETV1MEO X2 IGF2BP3
GLI3AMPH FGL2FLJ13195STYXL1 HSPB1HIP1TPST1 AEBP1UPP1 TNS3STK17A RAMP3 IGFBP3 PGAM2 GRB10PSPHGIMAP5 GIMAP4 NCAPG2EZH2
CLEC5ARARRES2 CNTNAP2
ZYX
ZNF43CACNA1ACD97 TPM4GDF15BST2 IFI30PLVAP
RNASEH2A
ICAM3 DNMT1MAN2B1ASF1BLDLRICAM1 JUNBDNAJB1 ZNF536AXLZFP36DLL3BLVRBGMFG ITPKC SIRT2
ZNF573 GPIHAMP GADD45BMY O1FCNN2ZNF177ANGPTL4HNRPMC3 M6PRBP1 TPX2 HCKID1 PI3
MYBL2SLPISDC4 MAFB IFT52GDAP1L1TMEP
AI BMP7AURKA EYA2CTSA PLTP UBE2CB4GAL T5 SP AG4 PROCRZNF217DOK5 CTSZKCNQ2 GINS1CST3 C20orf39 CD93 SLC24A3NKX2−2JAG1 C20orf23 RASSF2 PCNA C20orf42PLCB4C20orf103SNAP25FLR T3 PLCB1 PAFAH1B3 PLA UR
ZNF228PEG3 APLP1FXYD1FXYD5COX7A1
TYR
OBPSMO
X
APOC1 APOC2 ZNF83CD37 RRAS EMP3 EHD2 LAIR1UBE2S TTYH1FPR1ZNF415 LILRB1PLA2G4C PLEKHA4CA11PIK3C2BNFASCBTG2CHI3L1CTSK FMODCTSSMLLT11DNM3 SELLATP1B1 PRRX1CA14C1orf54 C1orf61NESS100A3 S100A1 S100A9 S100A6S100A13S100A8 S100A2F11R IFI16SLAMF8 FCER1G ATP1A2 EFNA1OLFML2BKCNN3 TAGLN2S100A4SHC1 DDR2 AIM2 UAP1
CSRP1KLHDC8A RGS5 RGS4 MND A TXNIPS100A10 S100A11CHIC2PDGFRA NMUKITKDR PPAT IGFBP7KIF14PDZD2RGS2FAM5C RGS1PR OX1 PTPRCB3GAL T2 CENPFASPM CRB1MOSC2 DTLTGFB2EMP2 TNFRSF12A GPRC5BCP110ABA T LOC440350 SUSD4 IL32 IL4R VSNL1PRKCB1 SEZ6L2 COR O1A PYCARD MVPNUPR1CRYM ITGAMNO
TCH1OLFM1 ZNF423ORC6L NET
O2SALL1SHCBP1CREG1NCF2 RABGAP1L NPL FAM5B GLT25D2NMNA T2 LAMC1 SO AT1 EPHX1ENAH
EFCAB2 SMYD3CNIH3ZNF124EXO1 CHMLPTGS2PLA2G4AGNG4 ITPKB ACTN2KCNK1NID1RAB4A
HIST3H2A
RGS7 TRIB2
GALNT2
NEK2DIRAS2SHC3
AGTPBP1
SYKCKS2CTSL1 DAPK1 DBC1 TLE1 PSAT1 GADD45GSEMA4DOSTF1 PCSK5 ANXA1 C9orf95PRUNE2GNA14ALDH1A1CD302 RPRMPSCDBPBAZ2BLY75
TNF
AIP6CNGA3IL1BIGFBP2ODC1 ITM2CRRM2 LRP1BNRXN1ANGPTL2DNM1 PBX3AK1
UGCG ZFP37 ST
OM
NTRK2EDG2TNCABCA1TLR4 RMI1PTGS1ASPN
GABBR2
ECM2 FBP1 SMC2
TMEFF1GPR56PTPN13MT1X NQO1 MMP2 PLAC8MT1G
HSD17B11 MT1EHSP A4L SPR Y1GAB1CCNA2MAD2L1GRIA2LEF1 DKFZP564O0823 CTSOMGST2 PDGFC ENPEPADAMTS3AGXT2L1PF4 GUCY1B3 GUCY1A3C4orf18 EDNRATLR2 TDO2 EMCNGLRBSLC7A11HPSE
MAPK10
SNCAHERC6IBSPHERC5PGDSPPID AGACHRNA9UGDHATP10D BMP2KGALPDE2AODZ4 UCP2PRSS23 C11orf75CTSCKCTD14SLCO2B1SERPINH1SYTL2 NO
X4
SLCO1C1
PRCPLRRC51 ENDOD1PLEKHB1FOLR2 DHCR7 DDIT3
B4GALNT1
C3AR1CLEC4ARAD51AP1
ARNTL2 CCND2ITGA7 ITPR2 SOX5
BCA
T1ITGA5
TMEM106C
NELL2 SOCS2PLEKHA5ACCN2RACGAP1LIMA1METTL7ACNTN1HOXC10YEATS4 LYZNAV3 CSRP2NCKAP1L SLC38A1 TIMELESSHSD17B6PRIM1 SSPNBHLHB3KCNJ8GOL
T1B CD163C1RLFOXM1 TNFRSF1A C1SEMP1 LMO3 MANSC1 PHLD A1
BBS10FAM60ANT5DC3ASCL1CHST11 C12orf48TMPO FAIM2 SYT1DRAM DUSP6PDZRN4GNS LRP1SLC16A7HEBP1FLJ22662CLEC4EOLR1A2MPHC1PTPR
O ARHGDIB MGPGPR19CLEC7A CD69 C1QB IGSF3 OLFML3 CD53CD2PHGDHVAV3
SLC16A4 AHCYL1GSTM5CD58GSTM4 GSTM3 GPSM2 PSRC1 KCNA2ADORA3CHI3L2 VCAM1AGLLPPR4 PALMD EDG1 PXDNBCAR3 ELTD1 FNBP1LIFI44LIFI44 SNX7DPYDPTBP2F3AK5SSX2IPGBP1DNAJB4LPHN2
ARHGAP29
GBP2CYR61TGFBR3CDC7DDAH1IFIH1 GCARBMS1ADI1CAPG FRZB
CALCRL B3GAL
T1
KLHL23GAD1 LTBP1 RBKS
LRR
TM4 TFPI
CTNNA2VAMP8 GYPC MREGLIMS1
AD
AM23STEAP3MCM6 GPR17
ARHGAP15SLC20A1 EFEMP1
RND3 FZD7 CLK1KCNE4 SP110 ACSL3SLC11A1ACYP2 CPS1 CHPFEPHA4FN1STAT1CCL20 MYO1BSCG2 SATB2MAP2THNSL2ERBB4TMBIM1
SERPINE2
CXCR4
NMIFZD5DNAH7IDH1MERTKPGAP1 STON1PLCL1CYP1B1 IGFBP5CRIPTPIGF MSH2RSAD2 EPAS1GPC1
DKFZp762E1312
RAMP1 ARL4C DTYMK PDE1A DNAJC6CHN1 KIF1AEFHD1 FNDC4CYBRD1GRB14RAPGEF4SCN1A VAMP5SCN3A REEP1PDK1HOXD11 ITGA6CRIM1GULP1COL5A2QPCT SSFA2COL3A1
RASGRP3
MXRA8 DSCAMSDC3
LAPTM5
CTPSID3STILKIF2C NASP CDC20
HPCAL4KIAA0754CDCA8 PIK3R3 SLC2A1
IFI6HPCAC1orf38 STMN1PLA2G2APLA2G5PADI2PLOD1 DHRS3
TNFRSF1B
PDPN PER3 SPSB1SPATA6 CIT
SELPLG
TESCALDH2 ATP2B1RPH3AOAS1 DCNCRY1
FLJ21963
LIN7AHSPB8 TAGLNFXYD6MOG CA9 TPM2HSPB2 CRYABZNF84RDXNCAM1IL18 CASP1 PDGFD CASP4 EXPH5 SP
A17RBM7
SLC35F2REXO2 SORL1 DDX25NNMT CHEK1CEP290 C11orf63CDON TTC12
LOH11CR2A
THY1OPCMLBIRC3 MMP7B3GAT1CDK4NUP107CPMKNTC1 KPNA2
LGALS3BPSLC16A3KCNJ16 SCPEP1ABCA5PITPNC1FO XJ1 HN1 ABCA8 BIRC5 HLF COL1A1ITGA3 TMEM100COPZ2 HO XB7 HO XB2WSB1ABCC3 TRAF4 MF AP4 HS3ST3B1
PMP22 EVI2BIFI35 RND2GRNMAPTIGFBP4 HIGD1B TOP2ACPDSPAG5LGALS9 TMEM97ALDOCEVI2A
SERPINF1
CENT
A2ASP
A
CCL5ATP1B2CCL2MYH10 AURKB GAS7FAM64A SLFN12C21orf91CXADRSAMSN1JAM2DONSON ADAMTS1ADAMTS5 NCAM2PCP4 GRIK1OLIG2
C21orf62 SH3BGR
TTC3 ETS2 CBS MX1BACE2COL6A2MX2ITGB2SQSTM1CPLX2 GFPT2ZFP2 LCP2DUSP1 KCNIP1DOCK2HOMER1THBS4 HMP19FGF1SPRY4JAKMIP2SNCBIQGAP2 CCNB1KIAA0888ENC1 HEXBPDE8B PIK3R1C7PPAP2A
PCDHGA8
TRIP13 SLC1A3DAB2
SEPP1HMGCS1GHR FYB PLK2 CD14IL7RRAI14SEMA5APTTG1GPX3ATP10B CYFIP2SLC26A2LOC63920CSF1RGABRA1CD74
HAPLN1
F2RMAP1BDHFR CHD1 PARP8ITGA2C5orf13HISPPD1APCTRIM36 KCNN2PAMCDH10IRF8CYBA IL1R1 IL1R2RARRES3 HRASLS3CD248EFEMP2SERPING1MS4A4AFEN1AHNAKMS4A6A ASRGL1GNG3 FADS2C11orf9SLC15A3 C1QTNF3FAM105A SRD5A1 CTNND2LRP4FOLH1DMN
SLCO3A1
AQP9CCNB2SV2BBLMMYO1EPRC1 NMBNTRK3 IQGAP1PIGBKIAA0101PTPN9
FAH
RAB27A THAP10 ACSBG1 TCF12 ARNT2MNS1ISLRFANCI CA12RPS27L VPS13CKIF23BCL2A1 SCAMP5CTSHANXA2GLCE NEO1DMXL2SCG3 TIPINKIAA1199CSPG4 MYO5CGATMSQRDLSEMA6DCAPN3NDNAPBA2NUSAP1BUB1BCASC5 MEIS2RYR3TRIM2KIAA0922FGFR3 TACC3LDB2 CD38 QDPRUCHL1CFIANK2CXCL10 SLC4A4 SC4MOLIQCG UGT8 ARSJPROM1
GABRA2
TLR1
GABRB1 D4S234ECRMP1 STK32BKIF1B TFRC MSX1C4orf19 CENTD1ABCG2HSPA6CFH
ANKRD15S100B MLL
T4ALK
BTBD3 ZNF91FCGR2AC9orf46
APOBEC3B
IGSF6 GSTT1USP18GOLGA8AFCGR2B SLC2A3FKBP5 TUSC3 FCGBPSCG5CLSTN2C1GALT1PRODH
LOC23117 SELENBP1
PDE4BITGB3BPDIRAS3FLJ10986 GADD45ACLDN1 IL1RAP RPE65ROR1GPR177JUN
CDKN2C DEPDC1
CTHALG6
MTHFD2PPARGC1A KIAA0746
SLIT2ACTG2 CRYZ
LRRC40 HRASLSHES1 APOD FGF12ETV5DGKGSST
IGF2BP2
RTP4RFC4RPL39LTIA1CENPEENOSF1COLEC12TRPM3 TMEM2TJP2 CPEC9orf61FLJ20035 LEPREL1BDH2
SLC39A8
BCL6SLC1A4 FNDC3BHMGB2SCRG1TNIK
SERPINI1
EVI1GOLIM4 PAIP2B ANXA4
TNFSF10
ECT2IL17RB NLGN1 ST
AB1
LAMB2 DOCK3 LRRC2 MOBP CDCP1CX3CR1 NKTR KIF15LTFCCR1 AQP4PMAIP1 RBBP8 NEDD4LMAPK4MALT1
CYB5AST8SIA5
CDH2
CCDC102B
NOL4
PHLPP TWSG1TUBB6FAM38BARHGAP28EPB41L3 CDKN1ASERPINB9TREM1
HIST1H4C HIST1H1A
TRIM38 BTN3A2 TREM2FAM50B
HIST1H2BDSERPINB1 TNFRSF21 HLA−DP A1 HLA−DPB1HSD17B8 KIFC1 HLA−DMA ID4LY86PSMB9 ZNF184GMPRELO
VL2LRRC16EDN1NEDD9 F13A1PHACTR1HLA−E HLA−CTMEM14B HLA−DMBNRN1 LST1AIF1ENPP4C2
ATP6V1G2
CLIC1 CAP2 GSTA4LRRC1BTN3A3GCLC HLA−FMELK
POLR1E LRP2BP
FAT
PDLIM3TLR3ACSL1LPHN3ST3GAL6 TMEM45ACOL8A1PROS1DTNA
SERPINB8
OXTRZNF659TIMP4SLC6A1 BHLHB2EPHB1 CEP70RBP1PCOLCE2SA
TB1
TOP2B PCAFZIC1SCHIP1SSR3 MLF1 PFN2RARRES1WWTR1 TM4SF1PTX3MFSD1CPLXN
PLSCR1 P2R
Y13
PLSCR4 TIP
ARPPLOD2SGEF PLEKHSPA1A BAI3SNAP91 OGFRL1 ANKRD6NT5ERWDD2A ZNF292PGM3ELOVL4 ME1
CYB5R4 RRA
GDPHIPFAM46AMYO6TTKFUT9
POPDC3
AIM1IL17RASLITRK3BCHECLASP2FBXL2TGFBR2REV3L SESN1 WASF1CD164 NR2E1SLC25A4LIPGC3orf14CDH5 GBE1 MAGI1ACOX2FAM107ALRIG1 ADAMTS9FLNBWNT5A CDH11 CENPNGCSHMBPPTPRMCHL1 STACLAMA4GJA1CDC45L GGTLA1 C6orf60DSE
SMPDL3A EPB41L2 CTGF HEY2TPD52L1PTPRK LAMA2 MO XD1VNN2 SQLENDRG1HAS2 ANGPT1 TNFRSF11B SLA ENPP2KHDRBS3ATAD2 DEPDC6ADCY8
CA8CHD7 GGHMYBL1 CCNE2PGCP SDC2 MATN2
HRSP12
TOX
NCALD ZFPM2FZD6
ANKRD46 CHCHD7
HEY1C8orf70SNAI2 MCM4EYA1
SULF1LACTB2 LY96SGK3 TPD52IMPA1CA2FABP5 GEM PMP2 ST18PBK PLA
T
C8orf4SLC20A2MTUS1 FDFT1CTSB FZD3 STC1RBPMS EPHX2 STMN4SLC39A14DUSP26ADAM9
PDGFRL
NAT1LPLPSD3 ChGn
ADAMDEC1 ANGPT2
NEFL
AD
AM28KBTBD11ALCAM RAB6B
TF
NEK11PLXND1 ZBTB20CD200TAGLN3
GRAMD1C
GAP43 LSAMPOSBPL11FSTL1 CD86 PVRL3 PDIA5 MCM2ALDH1L1HCLS1SLC15A2ITGB5RNASET2BRP44L PDE10ANCF4LIF TSTSEC14L2PACRG RGS17 MYCT1 AKAP12FBXO5SO X10 POLR2FAPOBEC3G TCN2RAB32LGALS1MLC1GTSE1WTAP KDELR3 LIMK2HMO X1
APOL6 MYH9TOMM22RAC2THBS2 SEZ6LMAP7SASH1UST EYA4
PHA CTR2 TBPL1 MAP3K5 PERP PEX3 TNF AIP3PLA
GL1DDX58 KIF3AFBN2 IRF1ALDH7A1PDLIM4ACSL6 TGFBISPOCK1KIF20APPICSNCAIP SEMA6ANME5CXCL14 MARCH3P4HA2GRAMD3EGR1LMNB1LRRTM2
SLC22A4
LOX
CDO1
COMMD10
PCSK1CAST GLRXCETN3LRAP PAX6
PRMT3HTATIP2FBXO3 CAT CD44 MPPED2KIF18ALGR4BBO X1 DKFZP586H2123 FJX1SLC1A2 SPON1ZDHHC13 ADM SW AP70
SCUBE2ARNTLCYB5R2DKK3 RRM1DCHS1 TRIM22TPP1IFITM2 IFITM3 CD151 IFITM1SLC7A7 PSME2NOVA1
ARHGAP5
PRKD1RNASE2NDRG2RNASE4 RNASE6SALL2ANG
RNASE1CCNB1IP1MAP4K5PARP2LGALS3PYGL DLG7 NID2POLE2CGRRF1ERO1L PLEKHC1TXNDC1TRIM9 CDKN3CRIP1KIAA0423BAZ1A EGLN3 NFKBIAAHNAK2VRK1FBLN5 FLRT2NRXN3FOS DIO2RPS6KA5TGFB3GALCGPR65 ACTN1
C14orf109 KIAA1622 SERPINA3
CKB ZFYVE21 BDKRB2 NPC2 IFI27 ALDH6A1SERPINA5 WARS KIAA0247SIPA1L1 RTN1 DAAM1 HSPA2 SYNE2 TRMT5SLC38A6CR YL1IFT88 SACS ALO
X5AP NBEAHSPH1EBI2POSTN KDELC1WASF3FGF14UGCGL2DZIP1COL4A2
LOC728215 EFNB2FLJ10154COL4A1MAB21L1 ALG5 EXOSC8 SOHLH2CCNA1LHFP RFC3 DNAJC15 LCP1P2R Y5 C13orf18SPR Y2
CKAP2 EDNRB KCTD12MYCBP2PCDH9HLA−DRA
GLDCSLC1A1VLDLRKIAA0020RFX3PTPRDSNAPC3PSIP1NFIBFAM29AADFPMOBKL2BKLHL9
IDI1PFKP
AKR1C3 PFKFB3PTPLAPTER KLF6 BAMBINEBL
PIP4K2A
SVIL
MAP3K8
ZEB1 NRP1ZNF22RASSF4 C10orf10 CXCL12CCDC6 ARID5BCDC2JMJD1CANK3 DKK1P4HA1SPOCK2DDIT4 PLA UZWINTH2AFY2PPA1IFIT1 LIPA ADD3LGI1 BLNKDNMBP PLCE1 CH25H FER1L3PPP1R3CCEP55IFIT2GSTO1SCD FAS
SORCS3
INA
ACTA2SLIT1RGS10 ABLIM1INPP5FKIAA1598MGMT BNIP3ADAM12 MKI67 H copy number
−2 −1 0 1 2
(c) Copy number variation
2
1
3
4
5 POSTNNNMT PTX3 ASPNCOL1A2 COL3A1 COL1A1MMP7 DKK1
SEMA3CGUCY1A3POPDC3 STEAP1 SLC15A3SMPDL3A SLCO2B1 CCL5CD2LILRB1 DOCK2IFI16 TES
CLEC7A LCP2 CYBBVNN2FYB CENT A2PLEK CCR1MGP SLC11A1 CTSB OLFML3 AD AM28FXYD5 MY O1F ITGAM TNFRSF1B MYH9 CD93 NRP1 SHC1
MAN2B1 TGFBR2ADAMTS1 MXRA8ITGA5COL8A1MVP
OLFML2B DSEIL1R1IL7RRGS2HMOX1 SAMSN1 PSCDBP TLR1 CD14 KIAA0746 IL1B PCOLCE DCN TNF AIP3RAB27ALCP1 CTSCIL4RST
AB1RAC2CNN2GFPT2IL1R2CYP1B1 SLAMF8TNFSF10MAFBBCL2A1 ARPC1BPLAURNCF2GPR65LY96 AQP9FOLR2DAB2AOAH CLEC4E LGALS9ARHGAP15
CD37 OLR1CREG1 C3AR1MS4A6A TBXAS1
EVI2B ACSL1 HPSEC1QBISLRPTER
SCPEP1
FGL2 TFECGIMAP5TLR7OSTF1HLA−DRARGS1DUSP1RNASE1 TYROBP
HLA−DPB1 CD74 HLA−DMA CPM SLACYBA HLA−DP A1 ADORA3HLA−DMB SCIN FCGR2A TLR2 NCKAP1LMAP3K8 NCF4HCKLAIR1PTPRCMND A
CTSS ITGB2SLC7A7CD86CSF1RSYKC2FBP1SEMA3AVAMP8AIF1PYCARDGMFGC1orf38SRGNFCER1GSERPINF1LAPTM5 RNASE6SQRDL HCLS1CD53 LY75CXCL12 PMAIP1AIM2MERTKDDIT4C7SVIL
CLEC4A
TCN2AZGP1SELPLG P2RY13EDNRA MFSD1CTGF
GGTLA1 IL32 IL17RAGRNBMP2K RABGAP1LCOR O1A SERPINB9 MAP1BITGB5NFE2L3FPR1 CHCHD7 MX2 ARHGDIBPROCR CTSZ C11orf75HPR T1ETS2 C21orf91 SELL ADAMDEC1 IRF8RNF128TDO2 FCGR2BCH25H GPNMB LYZ VSIG4MS4A4A CCL2SFRP4 ZFP36GADD45B SERPING1
PPICCYR61 TAGLN FAS
CXCR4 VCAM1ANGPTL4 EBI2 FLJ22662 IFI30 ALO X5AP S100A11ICAM1S100A4BIRC3 GBP2PLTP
CASP1SDC2 CAV1GEMCASP4 P4HA2RNASE2CFI C1SGBP1 TIMP1DPYDSNAI2SLPICOL5A2LAMB1LIF
KDELR3
IBSP BGN CPD FN1COL6A2PTGS2 PLOD2 MALT1STC1BDKRB2CDCP1ADFP
AD AM12 CLEC5AHSP A6 IGFBP3C8orf4PLP2FSTL1TNCACTN1 SERPINH1 WWTR1 NDRG1 APOBEC3B SLC2A3 CA9 AIM1 SERPINE1 CD163 CCL20 F13A1RARRES1TGFBI IL6
TREM1 S100A9 S100A8
PI3 LOX SRPX2 PCSK1PLA U FAM129AABCC3CA12 CHRNA9 NDNUGT8
TMEM100TUSC3 ALCAM BAMBICDR1 SOX10
SA
TB1
SLC1A1SGK
SPOCK1
CLIC2
DEPDC6TXNIPSEMA4D FAM105ALIP
A
IGSF6 ENPP4
EPB41L3RCAN2EVI2A SULF1ENPP2PARP8GCLC
CALCRL
PPA1
GNAI1 CHIC2 TBPL1PLEKHA5GHRPIP4K2A FAM38B PDE1A CADM3ST18C11orf9SLC16A7KLHL9 UCP2CACNA1A
ARHGAP28 HIST1H2BDRASSF4 ENOSF1 PTPRM DNAJB4
TNS3ENO
X1
MED21 CETN3 RRA
GD
USP18 PSPH EMCNNEBL
TMEM106C
SGK3BEST1PRKCB1INPP5FCISD1 ChGnCNTNAP2 GUCY1B3HSPB2PHACTR1 GJB1NRXN3ME1PADI2MGMTCD38CAPN3RAPGEF5APLP1 EFHD1GYPCIFI44LACTB2NUPR1PGDS
SLCO3A1 CTSOPPAP2ADNAJB9FNDC4CO X7A1 SELENBP1MRPS33 STYXL1 NAT1 DKFZP564O0823 QPCT PLAC8 CR YAB ZNF536OSBPL11 AGTPBP1NFASCAMPH C20orf39 CA8 MARCH3 TESC SPA17 C1QTNF3 ACN9 AK1 DNAJC6PRKAR2B IMPA1
LRRC51PIGF NXT2ATP1B1 MAP7 FOLH1DDIT3DKK3
AKR1C3ALDH1A1
LMO3 IFIT1 ASP
A
APOD
KIAA1598
PLP1TFMBP MOGSERPINI1 DYNC1I1 KCNK1 NAP1L2ADD3SNAP25SNCAAK5NEFL MOBPNAV3
VSNL1 ALDH2KIAA1622
DBC1 SV2BSST
S100A1ITM2A PAIP2BSYN1KIAA1107 SLITRK3 RUNDC3BNTRK2 GNG3 GSTM3CRYMSYT1KCNN3PLLP
NCAM2 HPCAL4C20orf42
TOX3
HRASLSPCDH11Y
CHD7NKX2−2NOL4 DLL3 GNG4DCXGPR17 TMSL8 CASC5CDC2 ASPM MKI67 KIF14CENPFKIAA0020CEP55 BAZ1A UGDHDEPDC1CKS2TACC3SHCBP1CDKN3 KIF20A TRIP13 CENPEKIF23 HAS2CHEK1KIAA0101RRM2 MELK
PDGFRA MAD2L1
KIFC1 KIF2C EXO1CDC45LFBXO5 CCNA2 UBE2CKIF4A SPAG5TTKNDC80 LMNB1PBKTOP2ADTLKIF15 NMUFAM64A
HIST1H4C GOLGA8A NET O2 PR OX1 NO TCH1 ZNF91 CCNB1IP1 SACSZEB1LRP1B NO VA1ETV1GRIA3 ZNF83 F2R NLGN4X FZD3ZNF43ZNF228 SEMA5AZIC1SNCAIPHMGCS1C5orf13HEY1UBE2STIA1CPS1 MSH2 PSIP1ENAHCKAP2CBSMYH10 ZNF184CSPG4RFC3 TMEM97
RFX3 MLLT4CLK1 WSB1PFKPPTPN13FLJ10154ZNF573HIP1 IFT52PARP2ZBTB20 EXOSC8MED27MNS1 ENC1 GINS1
RNASEH2A
DCHS1 TRIM24CDKN2CDZIP1 PLCE1
SLC24A3
IFT88PRKD1BLMSALL1 TRIM9SEMA6A LRRTM2ZNF177GAB1 KIF1B GLDCCCND2 JMJD1CH2AFY2
HIST3H2A
ZNF711 PGAP1 REV3LZFP37NFIB SOX5TRAF4 LPHN3ANKRD15ZNF22PAR5
TCF12PPP1R9A
KIF3A APBA2PHC1ZFP2KCNQ2 CLASP2MAPT SHC3PAFAH1B3 MA GI1GAD1RAPGEF4NUDT11PHLPP OLIG2 DSCAMSLIT1SCRG1PGM3 GULP1LRRC40BTG2ZDHHC13 PLA2G4AZNF423ITGB3BP TM4SF1 PGRMC1TMEM14BPBX3CYB5A C7orf44FAM117AEFNA1 C9orf46
LOC23117LOC339047 LOC440350UGCGL2 CXorf45 CCDC131HISPPD1
BCL6
MAP4K5PCDHGA8
ZNF292 ZMYM2 SFRS18NKTR
KIAA1641ZNF84 NASPHNRPM DNMT1 SYNE2 CEP290PPAT
TOMM22PPID FBN2
EPB41L2PTPRKMTHFD2ANGPTL2SMYD3GLCEC15orf5 ZNF124CCDC88ABAZ2B PIK3R1PDE8BLRP2BP MYCBP2ST3GAL6HIPK2 FDFT1
PHLD A1 C13orf27 TTC3 POLR2F IDI1 KIAA1166
REC8PLCB4 PLCB1 TOP2B CP110SMA4MYBL1RMI1 TOX
SNAPC3TNFRSF21
FAM29APSD3 MEIS2
TRMT5 C9orf45VPS13CCCDC6PVRL3CHST11DMXL2 GPX3RPS27L DNMBPCRIM1KIAA0922LIMA1PHIP CTPSZWINT CENPNPEG10ATAD2
TIMELESS
SMC2
KHDRBS3STIL FEN1DBF4DHFRVRK1FAM60A TMPO CHML PRIM1RACGAP1 HMGB2 ORC6LNEK2 DLG7BUB1BCCNB2TPX2KNTC1PRC1
DKFZp762E1312
GTSE1 FOXM1CDCA8 KIF18ASQLE ECT2
RAD51AP1
AURKBMYBL2 SPC25 MCM4EZH2 FANCINUSAP1 IGF2BP3POLE2 PTPR
O
STK32BGGHSTON1CCNE2 GINS2 MCM2 PTTG1 BIRC5DSN1
DONSON TIPIN POLR1E RRM1 IGF2BP2 PCNA NCAPG2
SMC4AURKAKDELC1 C12orf48KPNA2 ASF1BRFC4CDC20 CCNB1 MCM6 RGS17PF4PDK1 CDONDTYMKNLGN4YLRRC1 PTBP2MAGEH1
MOBKL2BSORCS3GADD45G SIRT2 GPSM2 CIT SLC38A1FGF12DGKI LRR TM4RPRMMLL T11 WASF1PDE10AEYA1 CLSTN2CDC7HN1 C16orf80 ODC1SCN1ACDK4
B4GALNT1TMEFF1EPHB1 PDE2ATCEAL2PLCL1ATP10BCXorf57 NMNA
T2
FBXL2 UCHL1USTPFN2GABBR2ZFPM2 CD200 SUSD4ING3RWDD2A ELO
VL4
LOC728215
PIK3R3 CEP70 CRIPTRTN1HEY2
PIK3C2B RPS6KA5BBS10 DOCK3PTPRDGLRB
ANKRD6LOC63920
STMN1PDZRN4KCNA2 THAP10RND2KIAA0888SCN2A BTBD3ST8SIA5GPR56APCLRIG1 RGS5TNIKIGSF3JAKMIP2 NTRK3LRP4ARNT2 SALL2C13orf15 MAPK10FAIM2 HLFCXorf1B3GALT2 SCAMP5 ASRGL1CYFIP2PLEKHB1 SERPINE2PCYT1B NAP1L3ERBB4GABRA1 B3GA
T1PEG3
ANKRD46
AKTIPMANSC1TRIM2SLC25A4DIRAS2 OLFM1 RAB6BKBTBD11FADS2GPR19HPCAHSPB8 TPPP3IL13RA2PMP2 ITM2CRPE65MLF1
SCUBE2
PCP4
D4S234ERAMP3 PROM1DDX3Y EIF1A
Y
RPS4Y1LAMA4KLF6MMP2ID1NID1PXDN
SLC26A2
FLNBP4HA1SLC20A1 CYB5R4HSD17B11GOL
T1B
PL
VAP
FLR
T2
NT5DC3SGCE CDH11TPM4TMEM2CALD1SLC39A14FNDC3B GPC4 FLNAZYXADAM9FAM46A ARID5BZNF659PLOD3LAMC1PLAGL1 PLOD1GALNT2ER
O1L
ACTG2 CTSKAHR NID2
TMEM45AWNT5ATHY1GALSLIT2IGFBP4 MYO1BRND3MFAP4TPM2 CD248ACTA2EMP2
SERPINB8
A2M
SLC16A3CRIP1PRSS23DDR2PRRX1PLXND1SOAT1UAP1SPSB1 RAI14TFPIHOXB2PDIA5 WTAPGBE1
SRD5A1VCAN PTPLABZW2RBBP8USP9YJARID1DTSP
AN13CCNA1 VLDLR IL1RAP COL4A2COL4A1AKAP12BCAT1
VEGF A IGFBP2FRZBTIPARP FOS NDUF A4L2 PCOLCE2UGCG NO X4
ARL4C LIMK2LEF1 IL33SYTL2TAGLN2PLSCR1AEBP1 LRRC2 FBLN5HIG2 EDG2TGFB2DOK5GDPD2 ABCA5 KCNJ8NMBGPR177RDXDCLK1 NEK11 FOXJ1CDO1THNSL2 C11orf63EGLN3 TPST1IQCG SCG5S100A3 NUP107KAL1TRIM36 GRB14MESTC8orf70 ADAM23 ALG6 DNM1
EFCAB2 BNIP3FNBP1LCRY1 YEA TS4 RAP1GAPDLEU1 KLHL7 ID3 HO
XD11 VIL2GAP43 CXADRGSTT1HOXA7 PIK3IP1 EPHX1ADI1 IFIT2SESN1SCD DIO2LDB2OXTRSMO
XIFI6 ENDOD1 BDH2HERC6IFI27MAOA IGSF1C4orf31PER3 LRAP SPOCK2 AXL MAP3K5HERC5FLJ10986 ARSFGNA14IFIH1RSAD2OAS1
GNG11 BCAR3 GDF15METTL7APHKA1 IFI44L SORL1HSD17B6 GLT25D2 PCAF MY
O6 GPRC5BARHGAP5 RAB4A GCSH ABLIM1ABCG2KIAA0828ACSL3ALKEYA2 SPR Y2AASSSPATA6 MAB21L1 DENND2A SGEFCITED1PTN PITPNC1 SDC3 DTNALRRN3 MYO5C PPARGC1APCDH9ALDH6A1 PAX6
PLA2G4C SLC7A11 C20orf103
AGL
TGFBR3
EDG1 QDPRCPEPPP1R3CACYP2CRYL1GSTM4CA11ZFYVE21C4orf19 ACTN2
PDE4B PRMT3 TSP AN6MEG3ADCY8 ABCB1JAM2 CGRRF1 ANK3EXPH5HO XC10SLC1A4CPLX2 SASH1 TST HSP A4L HRSP12CLDN5SNCB RGS7 AHCYL1 ZNF415 TJP2 SLC20A2FBXO3ANKMY2DGKG DNAH7CAP2 GRAMD3
ANK2MTUS1CHN1 EDN1DUS4LBRP44L HIGD1BFLJ13195HOXB7
HSD17B8 MSX1KCTD14GBASFAT LHFPABCA8 ZMA T3 SSF A2
C14orf109DAPK1 ALG5 RPGRPEX3KIAA0423GOLIM4HOMER1NEO1C6orf60 DHCR7CTH FJX1DAAM1
RASSF2 SOHLH2 PILRB LGR4MID1 PLEKHA4 GPILRP1 NT5E SEC14L2 DNAJC15 MPP6 ITGA7
GADD45ALAMB2SLC35F2 SIPA1L1DUSP6 STK17ACD97 ITGA2ETV5SPRY1TRPM3ITGB8
KIAA0754
CHD1 JAG1
CCDC102B
LPHN2LDLR EYA4JUNFZD5ENPEP ARNTL2CDH5TXNDC1PTPN9KDREPHA4NEDD4LFAM3CIDH1IQGAP2 PDGFCADAMTS5RBM7RNF19A
HIST1H1A
ELNODZ4
ANGPT2 DPY19L1PLEKHC1
PSRC1ADAMTS3B4GAL
T5GPC1IGFBP5 ARL4ATFRC NRN1NEDD9EVI1EFNB2GALCCLDN1LIPGEPAS1
DNAJB1 TWSG1
LGALS3BP
PIGB LIN7AEHD2M6PRBP1NRCAMGLI3 IRF9ELTD1MT1X
MOSPD2
CHN2 CRYZ
PDLIM3
TMEM106B
AGA CAT
PFKFB3CHPFATP10D EGR1 TPD52 IFRD1HSPA1AFZD6
C1GAL
T1
SEZ6L2
LXN
ATP2B1 HSPH1MYCT1SSX2IPPERP GAS7 ITPR2DMN
COMMD10 PLA T SC4MOL SLC2A1SA TB2 SEMA6D TRIB2NESFOXG1 B3GAL T1 CCDC144A TLE1 CDH2PDZD2ADAMTS9 SPR Y4 HS3ST3B1
JUNBPROS1RBMS1 NFKBIACALU PLK2RBPMSCTSA LIMS1AHNAKPHACTR2TMEP
AI
MYO1ECD164 WARSROR1
KIAA0247TUBB6PGK1GNSSTAT1SSR3 SNX7
RASGRP3
DDX58 ITPKCBTN3A2 PSME2FLJ20035BTN3A3SLC22A4ABCA1TPP1SQSTM1 C20orf23SP110 SAMD9HLA−F
FAH
APOL6C10orf10 SLFN12REXO2ITGA3IGFBP7CFH IRF1 SAT1TRIM38 IFITM1CAST
BHLHB2 SLC38A6 HEXB P2RY5 PDLIM4CTSL1SLC39A8 PDGFRL RRAS CPVL DRAMFER1L3IFITM2 TNFRSF12A RAB32AHNAK2 F11RLGALS1 FCGBPBA CE2 ARHGAP29 S100A2 SP AG4 THBS2CAV2
PALMDC4orf18RARRES3C9orf95 CD302NMI IFI35ANXA4TMEM140CDKN1AHLA−CANGIFITM3 PSMB9HLA−EVAMP5HSPB1PRUNE2THBS4ANGPT1GSTM5
TNFRSF11B PAMEMP1FAM50B S100A6TRIM22PLS3 SW AP70FKBP5KCTD12COPZ2 APOBEC3G LTBP1 BST2 ICAM3CXCL10 tcag7.1314
UPP1TMBIM1 DIRAS3SDC4 PGCPDYNL
T3MT1E MT1G
BHLHB3
TLR4APOC1TPD52L1
BTK
GIMAP4LST1RGS10CD69NPL IL18 LY86CTSH HEPHTUBA4ARGS4 TLR3 NQO1 SEPP1OGFRL1MGST2RBKS STOMCD58BLVRB
CADPS2 C1orf54 RTP4BLNKGST O1 CX3CR1TREM2 APOC2 SERPINA5 NPC2C3 IL13RA1 CAPG SERPINB1 RNASET2 GLRX HAMP SKAP2RPL39L DHRS3 LOH11CR2A GCA LEPREL1 SH3BGRHT ATIP2 HRASLS3HEBP1MX1 PIRECM2C3orf14 DKFZP586H2123 SSPN CYB5R2 F3
PMP22 CSRP1PNPLA4GJA1CA2GPR37 MAPK4NME5EPHX2 EFHC2SLC16A4PLA2G5ACOX2
S100A13KCNE4 PCSK5TNF AIP6 RNASE4 STEAP3TNFRSF1A LRRC17SCG2 SRPX PYGL FZD7 OSBPL3 XAF1
C13orf18PTGS1 CD151KIAA1199IQGAP1 ZNF217 LAMA2STACTGFB3 CNIH3ARSJGRB10 PDGFDCROT MREG SYPL1 TTC12CYBRD1
TRIP6CSRP2VAV3 LRRC16ALDH7A1ARNTLCHL1NR2E1 MEOX2 FLJ21963ANXA1LGALS3 MO XD1 CXCL14COLEC12 CP RARRES2TMEM176AS100A10SERPINA3 SNX10 CHI3L2RBP1FABP5 FMOD PLA2G2A CD44 EFEMP1ANXA2 CLIC1EFEMP2 C1RL CCDC109B PBEF1ADM FLNCPDPNCHI3L1EMP3LTF NELL2 MA TN2 PLSCR4 PGAM2 GYG2PACRG SLCO1C1 AQP4MA OB C21orf62 CENTD1
HES1LPLITPKBSLC1A3GATMMLC1 CST3NDPRYR3MOSC2RGNBBOX1GRIK1 CNGA3PRCPKCNN2IL17RBACSL6HAPLN1GRAMD1CSLC15A2DD
AH1
CTNNA2 PR
ODH
GABRA2 FAM107A
ITGA6 GMPRRAMP1 ATP1B2 FGFR3SPON1SLC6A1 TSP
AN7NDRG2 KCNIP1 FAM5BWASF3EDNRBACSBG1 SLC4A4SLC1A2GABRB1FGF1HSPA2FXYD1LGI1LPPR4ALDH1L1PSAT1DGKBALDOC S100B C9orf61 KCNJ16ELOVL2
KLHDC8ARPH3A SOCS2 KLHL4 PDGF A SEC61G EGFR DNM3CNTN1PAK3 SNAP91FUT9FAM5CSTMN4DUSP26HMP19KIT PHGDH RUNDC3ANRXN1KIF1A ATP6V1G2 INA CRMP1 KLHL23 SCN3ACRB1GST A4
ACCN2 FGF14GDAP1L1NCAM1 LSAMP SCHIP1 DDX25TIMP4 REEP1FLR T3OPCML TAGLN3THSD7A FAM70A MY
O16 ID4
NCALD FXYD6BMP7NLGN3 NLGN1BAI3
TSP
AN12
PCSK1N
ABA
T
NBEA CA14CTNND2 MPPED2SEZ6L MAP2WSCD1GRIA2 ASCL1 BCHEBEX1CKBTTYH1 CDH10 NCANSCG3C1orf61AGXT2L1 ATP1A2 H expression −5 05 (d) Gene Expression 5 4 1 3 2 OBP2BSLC44A2GPR75TXKCUGBP2ZNF583 PRSS1CUL7 VSIG2C1orf64SLC47A2GALR3KCTD12 C19orf21 TFAP2E
SORBS2
UCNTNXBDHX32 FABP7FNDC3BRDH5 RDH5 RAB34DCLK1HPDSLC7A11 ACSBG1 FKBP10PDE6BHIST1H3ES100A16
CHFRGRB10 BBO
X1AGTAQP4TGFB3 SOCS2DARCPHF20 PDE6B C4orf26KALRNAHR CD36EDAR
PKHD1
SLC22A18
LRFN3 NUPR1 MS4A1MPP7 ARSBPRDM11 C11orf76CNTNAP4LCE1DC13orf29 C1QTNF9LPAR5TAGAPLGTNZNF541AIM2 RCN3CASP2CRX
CD164L2
RGS5
SEMA3BTMPRSS8
TAP1 DAKGDF2TRIM65 TSSK2 UGT1A1THPARK2DEFB118C4orf50 GPR152ABLIM1BTBD6 AZGP1
KR
TAP4−2C3orf22CCR9
CHRNA2GCKRSULT2B1ASB16LMAN1LCD79APCOLCEHK1 MPZKRT13IL1F7PRRG2 GPR35KRTDAPATP8B1C10orf11 RNF186PDE6HPCK1CLCA1 SPINK5ACMSDLCE2DC20orf186 APOBEC4PAPPA2LACR
T
CATSPER1 OP
ALIN PRXTRPV6 MYO1AHTR3BEPX FRKACMSD FOXI1 ADAMTS13CASQ1WFIKKN2 C16orf47IL1RL2 GST
A5
PLA2G3GNASGCNT3MYH7GALR3ABRAITGBL1 BCAS1HBZ
KLHDC7A
FUT1RGSL2 TACR2WFDC13CRCT1ANGPT4CCL16PLA2G4ECYP2A7PART1KLF1B3GNT3SFTPBLY6D
KR
TDAP
C17orf73HTR3CCIB3CYP4F3HFE2RCVRN OR5V1CTSK TEX19
SERPINB12
SPRR3TFF3
MAB21L2 PCOLCERUNX3 IGHG3
SERPINA10FAM107B NLRP14 INS CA CNG3 PDCD1LG2 RBP3 LIMD1HNF4AINS RAP1GAPIGSF9 RIMS3SYNGR2FOSL1 HAMP PDCD1APCDD1MRI1ST6GAL2MICAL1CNFNVPS33AEBI3STMN1C13orf30 CCDC69SULF1COLEC11 C14orf93CDC42EP3ATP10ACR1BNC1
ALS2CR11 C7orf52CNFN SLC4A11 TP63SUSD1FLJ37396 TES PLXNB1SFRP1 SMPD3FAAHADCY5 KCNQ1DN SMPD3SOX8PAX3 D4S234E
PENKPSMD5 TSP50NDNASRGL1ACTL9PSMD5L3MBTL2SKAP1 CYFIP2ST18C2orf82 SLC5A8 CALCRLCCR3 CHFROPRM1 HOXA4 RIN2CDK10SLC8A2 HOXA13GPR27 GRIN1 FOXA2
TMEM147 SRR T RPS6KC1 HSD17B4 SLC25A11BXDC1NDUF A3 GA
TA4MYF6BMP8AEIF4ERUNX3KCTD4RAD51CGRIP1SLC39A7LRFN3SH2D3A C19orf47CUL4A KCTD4CACNG2IL16CRISP2LRRN4CLAKR1C2 TUBA3CHOXA5
PRR
T1CLEC4CRBM17 SMPD3 TRIM54 SOCS4DNAJC5BCCL11KRT33AKRT34KRT33BINHBE KRT14CKMHIF3A SPRR1ARGPD5VWA5B1OR1D2 OR1G1FLJ43826 SLC17A4TPM3 CCL16PKLR
UGT1A3 C12orf59
IL29ASAH2MOGA
T2
NPC1L1SIRPG NR0B2FCER1A TM4SF1UCN3APOC2KRTAP13−3ESM1
WFDC12
ZACN
CELA3B
SLC22A18AS
SDR9C7
IL1R1 AQP3TNNI3MMP26ADH7NCSTN OR2S2DNTTIP2 C10orf81IL5RASEPT12
CEA CAM7 ALO X12B APCS S100A12CR YM SLAMF7
IL1R1 MEFV CD1BADAM7 GMLMGAM FCRLBCNTROB
NEUR OD6KLK12ALDH8A1TM4SF4 ZNF532 ZBTB32 PRDM7LRTM1 BTD CD A C16orf81HEP ACAM2 ADCY10 ALKBH1RGS13CDC45L ZP4
PPAPDC3AQP8MYL4 KLK3OR1F1KIF25LIPEFCGR3A FCGR3A ACTL6B HY
AL4
KCNK18
FUT5 TGM6 SAA4TRIM31 TRPV6 DEF
A1CTSG TRYX3 OR2W1 SPER T SLC36A3SPRR4AD AM29KR T13
C20orf71IL1F9 SGCZ FBXL5SMCPLCE1FZNF280A SH3BP5NOS1 ERP27MSR1
TMEM129 ST6GALNA C1PDIL T GABRA5 FGAPRG4 PGL YRP3 IQCF2 SPRR2A KLK7 FLJ44674UGT1A6XYL T2 C4orf7POU1F1 AD AM21STMN1 ST6GALNA C6 SERPINB5C1orf161WFDC10BC16orf73HIST1H2BO CUZD1PON1 UNC45B CCL8 TP73PF4V1GPR115 EXOC3L CASP14CHST4PSG4 SIGLEC9
LCN1OR7A5 ICAM2 WFDC9 ATP4B
SP ACA3KR T78 NCRNA00161 SPINL W1 AKAP3C14orf68CST9LCYP1A2RIPK3 FFAR1 SCGB1D2
HBE1BTNL2ZNF324 LILRA3C20orf79 C13orf28MND
AGIF C9orf116 NFS1GK2RGS13 MS4A2KIR3DL1KRT9CCL7ZNF266 SUMF1MGMTNTF3FLG PPP1R3ATTLL6 MGMT MSMB KR T15 CDA CLCSPAM1 SDR9C7 LCN6 ABI3IGJ
SLCO1B1ZNF541FLJ40235 FLJ46358FAM12BSLC17A1 SLC34A1TRHRKRTAP10−8 ANP32D CD300EFAT2
SIRPD OR1N1 CD163NLRP10DEFB4
KRTAP13−1C19orf59SYCE1PGL
YRP3 GDEPDYRK4LBP
C21orf56ANXA4
PF4
FAM83FS100A10
PSD4PEX10 USP29 REG3A C2orf53GDF5PRSS16 ZNF274 MTHFRRBM46ISG20L2IL21RSFTPDJPH4 MYT1
OGFOD1
ANK3
C22orf23ANPEP ZC3H7AMYH1 ZNF19 OR7C1HAS1NLRP8SLAIN1PRRG2APODMYT1CD1AFSD1HOXB1 ARPP−21LRRC4DEFB123DGKIFCGR3BLALBALCKDAB2IPOR2C3 SKAP1EDN3C7orf16 CHST13 C7orf16 PRSS16RTEL1
DNASE1L2 AKT3 TMEM140 LYL1 SEC14L4 ANK1KCNQ1XAF1 ENTPD3 UCNISG20 SLC25A10 TTC22 GBGT1RHCGGAS2L1 CYFIP1KRT7 CDH22 P2R Y6CYBA LPAR2OAS2
KCTD14RHCGZMYND15ANKMY1 RASSF1 RASSF1 THNSL2 PYCARDCRIP1 THRBC9orf167NEBL HCG9 HCG9CYB561KIAA0746KCNQ4FTH1A4GAL
T CCDC78GSTM5SLC16A5SPATS1 ALDH1A3TMEM176B HNF1B MOXD1 FAM124BFAM124B CCNA1 CCDC8 LY75 LRRC61 PAOX COR O6 LRRC8E WFDC2 MBP JAG2 ALDH1A3PRR15 PDIA2 ABCG2C6orf150KLF11MYL12ACTSZ
SLC44A3
JAG2
GPR25CDKN2A CDKN2A CDKN2AHOXD3LPCAT2
SERPINB1 RSPH9 ACTA1 HSP A2 CCDC68PCDHA13STEAP4 TRHABCA3CXCL12 CDKN2BSCAPRPL39L HSPA2 TNFRSF10A C6orf227C1orf87HCP5 WNK2 PKP1 HAA O SPINT2OCA2GPR27TP73HSP A2 PCDHGB4 SPD YA LVRNLPAR1MKNK1RHBDD1PCDHGB7 ABLIM3MIXL1HOXD4 GPR124HIST1H4J HIST1H4K GLS2 HIST1H2AI DLX5SYK ENTPD1PCDHGB7 HOXD3VRK2 TBX5 PIP5KL1
APOB SDPRC6orf227RSPH9C1orf107 IQSEC1TBR1TOM1L1ZMYND12 SLC15A3CPXM2 NOD
ALCYBA CTSZ
RASSF3
GBP3
TNFRSF18
CHI3L2 NOXO1CFTR RIPK3CHI3L2ADRA1AAPOL1 CPNE8LY6KC13orf33 CDKN2B SLC5A1 CDKN2BMA T1ACTPS FAM46B ARL4A ARHGAP8 WT1CBR3LXN COL14A1 FZD6 LRRC61FABP5 HAA O ISM2 GPR157
HEYL MESTENTPD1SP100PAPSS1CHRNA4SCN5AERGPRKG2 SUMO3 DLEC1 SMAGP
TNFRSF4
EYA4
SOCS3 SPINT2NGF
C13orf31
MYLKRNF207BAI1GSTM5ASAM TOX2ALDH1A3 FCGR2BPTGISMAP4K2FARP1 NET1HDAC3
CCNA1SLC13A5
TNFRSF10D
C16orf28 C13orf33
MEST PGCP MEST GLRX ETV7
CHRNB4 COG2 GJB6 C10orf10 GAS6 PRPH SYT9 EFEMP1 MT1HFBXL22 RNF43 SLC22A18
IMPDH1SCGB3A1 SLC11A1OCIAD2RHOHOCIAD2PLEK2 F13A1 EPHA2 CILP2 CNTN4 RILPL2FAM123CDAAM2SPNS3 LTB4RACTN2ACSS1 NPPB
ADSSL1 ENTPD3 MPV17LHIST1H1A
TNNI3PGFBSCL2AKR7A3S100B KCNQ1RASSF1OXCT2 DSG2NTRK2LEPCREG1GFI1
TNFRSF10C KR T72 MFSD7 CMTM2 SOCS1 EPHX3 CR YABNGBLPCA T1
TRPV4 WDR85SULT1A1 TBX5ESRP2ZDHHC12GPR126 DNAH3KIFC2ALDH1A3SLC2A2 BEGAINALX4 AQP5MYD88 FGF23 KCNQ1ARHGEF7 TNFRSF1BPCDHB14C19orf35 SLC12A6PSD3PLCD1COX7A1KCNE3 CAMK4SPIBCD244 KLHL1
TNFSF13B TOX2 BANK1CYP26C1IL20RASTEAP4 IER3TCF15 RAC2 MAP2K3APCDD1LACO T12 APCDD1LCPXM2SND1PSKH2 KRT72 SEMA3BMOBKL2ATEKT3 POMCALO X15B EVC2 ZC3HA V1L VPS53 KCNS3ASP A PM20D1 PSMD11SNAI1KIAA1804IGFBP7 PAQR9 TNFRSF10C GNG4 RYR1TRAM1 LHCGRCD14 NCCRP1 MTNR1A PAX1 DSCR6CGB2 TCF12MORN4 CCDC19NAGA C22orf27 SLC5A1
MKX ME1 ME1C7orf13PVRRHOD LPAR2AHNAK ACSL1APH1B AXIN1TET2KCNQ1 TCF21TBX1 FLT4
SLC26A5
GBP4HPNPLAC2RBP1PLAC2SLC6A15FBLN2PLLPENTPD2DLEC1BMP4 GJD2 TFPI2HOXA9TLR2 C1orf115HO XA9 HO XA7 ST6GAL1 NINL KCNQ1DN SNX9KCNK5 KCNC4 MCHR2B4GALT6 B4GAL T6
TSGA14 ADRA2BGALNT14 PRKCDBPADSSL1 PRKCZGIPC2 BANK1 ULBP1C10orf82MNX1
TNFRSF10D
KCNS1VRK2PRRG4CREM CDH1
TNFRSF10D
RAC2
SLC13A5RPL39LPCDHB13 HIST1H4LENTPD2PCDHB12KCNA1SECTM1TLR2BHMT2KLSIX6GCM2CTHRC1TCTEX1D1SPAG17
CCDC140 C1orf115 SLC16A3CYB5R2 SLAMF7 FBLIM1
HPN PHLD A2SST PYCARDPRKCDBPSEMA3F HEYL TMEM171FO XJ1 RAD9B ZNF22 APLNR FO XJ1
EFCAB4B CLEC14AIRAK3 GNAI2
KLKLC1 KIAA0323 GJB6 ECHDC3 UBXN10 PRA C PLA2R1 HTR5ADLX1 PHO X2B KCNK17 SLC7A10LRRC56HO XB8
TCF21 GHDC CCL23 GCNT1TMEM92AMPD3SLPICD244MYOM2GOLM1FAIM3P2RX1 TRIM40ABTB1TREML2 C22orf33 CXCL10 SLC26A4AIM2HESX1 HTR3E CSRP3GPR114 FCGR2A GIMAP1 KCNK17TNFRSF9CD48ASGR2 ZG16B EGFL7 PRTN3
NO
TCH4 F2RL3SDC4CSDC2CD93RGS14DPYSL5MBNL1 PODXLMEGF11GSTP1VSTM2LELANE NFATC2
KCTD17HO XA2 COMPNFATC2 SHKBP1 PR OKR2 TER T
MADDC19orf55 PARP12 BFSP1EPOCHFR IFNA2
CCDC87
HCP5 NAT8
TCF15BTN3A2ZFPL1 MYBL2 GFRA4GIMAP5 TAS2R60TREM1 KCNG1C21orf128MS4A1 STMN2BMS1 KLK4HSPA1L WFDC3PCDHB2 SLC9A2ALX4 SNNLIMD2 KCNA3BNC1 MALL SNX9KCNA3 HOXA9 PCDHB15GNA14PCDHB2UFSP2LRRC34 KCNQ1 SLC27A6CBX7HAND2 H2AFY MYO1BHOXA2 PCDHB15HO XA7 DUSP1FGFBP2NUDT4 MRGPRX2CCDC64 CD53 WNT10A RINL UBXN10GCET2MRGPRFWDR41TRAPPC1FCER2 SPON1NXNHOXB2 ELO VL1 RALGDS MAPRE3 EML1CASP8SH2D3AALDH1A3TRIM58 HOXC11 HO XA11 HO XD11 OLIG3PITX2H2AFYESR1 RASSF1NEUR OG1 FAM69BCDX2 AJAP1
NKX3−2 MAP4K1 RASSF1 PPP1CA
NEUR
OG1HO
XA9
NEUR
OG1ALOX15CDCP1TAC1LDHCRUNX3HOXA9KCNS2 ZFP41MRPL41KLF11SYKCPT1APKDREJPRLHRDLX5HIST1H4IRNF149WT1ACOT4TAC1TACR3 SIM1PPIEID2HTR7WT1IL12BWBP2NLWWP1DLX5 WIT1
C14orf102 SIM1 PR OCA1 NHLRC1ALAS1WT1PTMS TYMP NEUR OG1KLF14 ANLNCCDC96
HHEX NMBRHOXA6
TBX20 TSHR GNA11 PRG2ATP5G2HIST1H4I WT1 TNFRSF8 SLC2A14 DLX5 GPC2 BOLL TNFRSF10D DPYS RIBC2SIM2 HO XD10HTR1B SPAG6 LRRC3 GA TA4
SYNM NOL10 GNG4 TRHDEAMNADCY5LVRNHOXA9BNC1 MTL5
TNFRSF10C CXCL1NEFH ELL2 C1orf188SPAG6TDRD5MOSWDR52SYN2 HT ATIP2 SLC22A16 HO XB4 IZUMO1 EHHADHC19orf41GA TA4 ARHGAP24 DKKL1RPL26L1
FOXI2LOC84856C15orf51P4HA3BRF1 HBQ1FLJ45983CALCA CALCA HOXD9 DIO3CHP2NFE2L3 HOXD9KCNQ1EN2ACOT8
PTH2RTCERG1LCAR TPT WT1 NINL GA TA4PAX9 XRCC3 C4orf32 NXNL2 SO X14 GA TA4
SLC5A8 HOXD12USH1C C1orf59IRF6GATA4SLC5A7 MAP4K1 SPATA18
HO XA11 ICMTHO XB4 ADAMTSL3 CYP2E1RASGRF2MYO1BNEFMHOXD4 HO XD12FOXD3 ALDH1A2KLF14 MAFBNPY STRAD A SRD5A2 WDR69 C7orf13 IGF2DRD5 KLK10 CELSR1 PRSS22 PALM2−AKAP2
CXCL6 KCNQ1 PCSK6ADRA2C SLC27A2MYOD1GPR6 ROR2NPTX2RELNISL1MSX2CXCL6KCNK12TRPA1
SO X17 TMEM132D GA TA4 TCEB2VAX1
PDE4C SYT10 PTGDRFAM19A4 SPHKAP
GBX2
HO
XC11PRKCBPTGER2 HMG20B SRD5A2PRR16 SMOC1NHLRC1PRKCHISL1MYO3AFAM150A ATP8A2BARHL2HHEXZNF540TESCXCL1 DMRT2RFX6FOXB1TBX20 BNC1 ASCL2TLX3ZFP42 HTR1B KCNK5SLC18A3SSTR1
TMEM132D
GREM1 FO
XE3
RASEF
ARHGAP9
DLC1 BNC2DIRAS3CD84TMEM116KCNH4DLC1 IRF6 EIF4ENHEJ1 HDAC1
ASAP3C17orf57 CD300C CHRNA6
CDKN2AIPNL
SLAMF1GST
A3
AMICA1 FOXO1SP140 AQP5CCRL2SH2D2A RNASE6 CD300LFTYROBP CD33ECE1GPR132TRAF3IP3NUAK1 IL4REVI2A STYK1
HLA−DRA
TMC8 VGLL2DOCK9PKP2DMR
T2TUBB6 SERHLSYKC13orf15HOXD4MLNR
FERD3L
SYK
FERD3L ZMYM2
NEFM RGS10IRX2 PAX9
IGF2BP1 CCDC140ECHDC3PHO X2BDGKE VAX2 SUSD3 KLK10GALNT14 INA TRIM58SLC35F3RUNX3 ONECUT2 GPR78GPR6 TRPA1 PRDM14 RALBP1 SLIT2 MSC KLK10NPY
CNTNAP2 CNTNAP2HTR5A ZNF177ZNF702P EFCAB1HOXD8 IRX2GPR83RASGRF1 SLC12A5EGR4NETO1ZNF177 HTR1EWNT2 RAB37SYN2TLX3FOXA1
SLC27A6WBSCR17 KHDRBS2 BNC1 IRX4 MSX2IRF8GA TA6FLT3 PRKCB CYYR1NPR3 SLC25A21 TNFSF9HCN1 EPB41L3MYO3AFO XA1 ZNF560 H methylation −1 −0.5 0 0.5 1
(e) DNA Methylation Figure 4: (a) Heatmap of W. The clustering of PIntMF was compared to glioblastoma subtypes. (b) Survival curves with p-value of log-rank test. (c, d, e) H matrix for the three considered omics blocs on glioblastoma dataset
percentage of explained variation (Fig. S1). Besides, this initialization is performed at the integrative level rather than separately on each block of data.
PIntMF tunes automatically the penalties on matrices Hkand W, without any intervention of the user, and we noticed
that all the matrices are quite sparse on real datasets (Figure 4). The user needs to choose only one parameter that is the number of latent variables. The last parameter can be chosen by looking at the MSE, cophenetic coefficient, and the PVE (Supplementary Materials Fig. S2 to S6). All these criteria are implemented in the R package. For non-correlated data simulations, only the cophenetic coefficient and the PVE allow choosing properly the correct number of latent variables.
It is still difficult to evaluate the performance of an integrative method on simulations (Cantini et al., 2020). The relationships between blocks of omics are complex, often not well-known, and the modeling of these links is not easy. To our knowledge, there does not exist any reference dataset to assess performances. Therefore, we evaluated the algorithm on two different simulation frameworks (completely simulated and based on real-data) and two real datasets. Besides, we compared it with several other state-of-the-art integrative methods. We demonstrated, on the first simulated dataset (non-correlated blocks), that PIntMF outperforms the other methods on both clustering and variable selection. Indeed, on simulated data, the clustering from PIntMF makes few errors of classification. We also highlighted that PIntMF is more robust to heterogeneous data compared to the others: the method performs
as well on gaussian distributions as on binary or beta distributions for the variable selection. On another simulated framework based on real data (correlated blocks), we observed good performances at clustering (perfect classification) and variable selection levels (AUROC upper than 90%). With applications on two real datasets (BXD and TCGA data section 4.3), we demonstrated that the method could deal with real datasets. Besides, the application on the two real datasets shows that we found original subgroups but also interesting variables linked to the clinical phenotypes (diet and overall survival).
A weakness of the model is that the convergence of the algorithm to an optimal solution is not mathematically justified. Besides, a significance test for the variable selection is not given due to the use of the LASSO regression (Jain and Xu, 2021). Jackknife could provide an idea of the confidence in the selected variables (Supplementary Materials Fig. S10). However, this type of approach is very time-consuming when datasets are large.
Another improvement of the method could be dealing with missing values. Missing values could be inside a block for a few variables. These missing values could be imputed by the average of other correlated variables or by the values of the nearest neighbor or more complex methods as proposed by (Voillet et al., 2016; Gonz´alez et al., 2009; Husson and Josse, 2013). Commonly, a whole block can also be missing for an individual. In this case, the matrix W could be computed only on the present blocks for this individual. Thanks to the W matrix, we could deduce a new profile
for this patient from the Hkmatrix inferred with the other individuals.
We could also extend PIntMF by including prior information such as the genome structure. For instance, we could force the algorithm to select the same genes in the DNA methylation block and the expression block. A group Lasso penalty (Simon et al., 2013) could be added to the proposed model to include such a prior.
To conclude, PIntMF is an easy and flexible method to integrate omics data. It exhibits good performance in terms of classification or variable selection in both cases (correlated blocks or non-correlated blocks). Among all tested methods, it is the one that works in most situations. PIntMF is fast and automatically tunes the penalty for each block to select an appropriate number of variables (sparse matrices). Besides, it provides a sparse matrix W to perform more easily the clustering of samples. We also provide three criteria namely MSE, PVE, and cophenetic coefficient to choose the best number of latent variables.
The integration of several types of omics with our method could help in discovering potential markers even with a small number of patients. Finally, it could also help to classify patients with unknown phenotypes.
6
Software
An R package named PIntMF can be used to reproduce all simulations and figures and is available online at ??.
References
Bersanelli, M., Mosca, E., Remondini, D., Giampieri, E., Sala, C., Castellani, G., and Milanesi, L. (2016). Methods for the integration of multi-omics data: mathematical aspects. BMC bioinformatics, 17(Suppl 2), 15.
Bock, C., Farlik, M., and Sheffield, N. C. (2016). Multi-omics of single cells: strategies and applications. Trends in biotechnology, 34(8), 605–608.
Brunet, J.-P., Tamayo, P., Golub, T. R., and Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12), 4164–4169.
Burstein, M. D., Tsimelzon, A., Poage, G. M., Covington, K. R., Contreras, A., Fuqua, S. A., Savage, M. I., Osborne, C. K., Hilsenbeck, S. G., Chang, J. C., et al. (2015). Comprehensive genomic analysis identifies novel subtypes and targets of triple-negative breast cancer. Clinical Cancer Research, 21(7), 1688–1698.
Cantini, L., Zakeri, P., Hernandez, C., Naldi, A., Thieffry, D., Remy, E., and Baudot, A. (2020). Benchmarking joint multi-omics dimensionality reduction approaches for cancer study. Nature Communications.
Chalise, P. and Fridley, B. L. (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PloS one, 12(5), e0176278.
Chalise, P., Koestler, D. C., Bimali, M., Yu, Q., and Fridley, B. L. (2014). Integrative clustering methods for high-dimensional molecular data. Translational cancer research, 3(3), 202.
Chauvel, C., Novoloaca, A., Veyre, P., Reynier, F., and Becker, J. (2019). Evaluation of integrative clustering methods for the analysis of multi-omics data. Briefings in Bioinformatics.
Chen, J. and Zhang, S. (2018). Discovery of two-level modular organization from matched genomic data via joint matrix tri-factorization. Nucleic acids research, 46(12), 5967–5976.
Chung, R.-H. and Kang, C.-Y. (2019). A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification. GigaScience, 8(5), giz045.
Gaujoux, R. and Seoighe, C. (2010). A flexible r package for nonnegative matrix factorization. BMC bioinformatics, 11(1), 367.
Gonz´alez, I., D´ejean, S., Martin, P. G., Gonc¸alves, O., Besse, P., and Baccini, A. (2009). Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical correlation analysis. Journal of Biological Systems, 17(02), 173–199.
Huang, S., Chaudhary, K., and Garmire, L. X. (2017). More is better: recent progress in multi-omics data integration methods. Frontiers in genetics, 8, 84.
Husson, F. and Josse, J. (2013). Handling missing values in multiple factor analysis. Food quality and preference, 30(2), 77–85.
Jain, R. and Xu, W. (2021). Hdsi: High dimensional selection with interactions algorithm on feature selection and testing. PLOS ONE, 16(2), 1–17.
Jerome, F., Trevor, H., and Robert, T. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788.
Meng, C., Helm, D., Frejno, M., and Kuster, B. (2015). mocluster: Identifying joint patterns across multiple omics data sets. Journal of proteome research, 15(3), 755–765.
Mo, Q. and Shen, R. (2018). iClusterPlus: Integrative clustering of multi-type genomic data. R package version 1.18.0.
Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., Powers, R. S., Ladanyi, M., and Shen, R. (2013). Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences, 110(11), 4245–4250.
Network, C. G. A. et al. (2012). Comprehensive molecular portraits of human breast tumours. Nature, 490(7418), 61.
Nowak, G., Hastie, T., Pollack, J. R., and Tibshirani, R. (2011). A fused lasso latent feature model for analyzing multi-sample acgh data. Biostatistics, 12(4), 776–791.
Pierre-Jean, M., Deleuze, J.-F., Le Floch, E., and Mauger, F. (2019). Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Briefings in bioinformatics.
Ramazzotti, D., Lal, A., Wang, B., Batzoglou, S., and Sidow, A. (2018). Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nature communications, 9(1), 4453.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.
Reilly, B., Tanaka, T. N., Diep, D., Yeerna, H., Tamayo, P., Zhang, K., and Bejar, R. (2019). Dna methylation identifies genetically and prognostically distinct subtypes of myelodysplastic syndromes. Blood advances, 3(19), 2845–2858.
Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A., and Kim, D. (2015). Methods of integrating data to uncover genotype–phenotype interactions. Nature Reviews Genetics, 16(2), 85.
Rodosthenous, T., Shahrezaei, V., and Evangelou, M. (2020). Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: A comparison study. Bioinformatics.
Rowlands, D. S., Page, R. A., Sukala, W. R., Giri, M., Ghimbovschi, S. D., Hayat, I., Cheema, B. S., Lys, I., Leikis, M., Sheard, P. W., et al. (2014). Multi-omic integrated networks connect DNA methylation and miRNA with skeletal muscle plasticity to chronic exercise in type 2 diabetic obesity. Physiological genomics, 46(20), 747–765.
Sastry, A. V., Hu, A., Heckmann, D., Poudel, S., Kavvas, E., and Palsson, B. O. (2020). Matrix factorization recovers consistent regulatory signals from disparate datasets. BioRxiv.
Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22), 2906–2912.
Shen, R., Mo, Q., Schultz, N., Seshan, V. E., Olshen, A. B., Huse, J., Ladanyi, M., and Sander, C. (2012). Integrative subtype discovery in glioblastoma using icluster. PloS one, 7(4), e35236.
Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). A sparse-group lasso. Journal of computational and graphical statistics, 22(2), 231–245.
Sneath, P. H., Sokal, R. R., et al. (1973). Numerical taxonomy. The principles and practice of numerical classification.
Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, pages 33–40.
Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika, 76(2), 257–284.
Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J., and Frouin, V. (2014). Variable selection for generalized canonical correlation analysis. Biostatistics, 15(3), 569–583.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
Tini, G., Marchetti, L., Priami, C., and Scott-Boyer, M.-P. (2017). Multi-omics integration - a comparison of unsupervised clustering methodologies. Briefings in bioinformatics.
Vasaikar, S. V., Straub, P., Wang, J., and Zhang, B. (2017). Linkedomics: analyzing multi-omics data within and across 32 cancer types. Nucleic acids research, 46(D1), D956–D963.