HAL Id: hal-03264640
https://hal.archives-ouvertes.fr/hal-03264640
Submitted on 18 Jun 2021
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Introducing group-sparsity and orthogonality constraints in RGCCA
Vincent Guillemot, Arnaud Gloaguen, Arthur Tenenhaus, Cathy Philippe, Hervé Abdi
To cite this version:
Vincent Guillemot, Arnaud Gloaguen, Arthur Tenenhaus, Cathy Philippe, Hervé Abdi. Introducing
group-sparsity and orthogonality constraints in RGCCA. JdS2021 : 52èmes Journées de Statistique,
Jun 2021, Nice, France. �hal-03264640�
Introducing group-sparsity and orthogonality constraints in RGCCA
Vincent Guillemot 1 , Arnaud Gloaguen 2 , Arthur Tenenhaus 2 , Cathy Philippe 3 and Herv´ e Abdi 4
1 Hub de Bioinformatique et Biostatistique, Institut Pasteur, Paris, FR
2 Laboratoire des Signaux et Syst` emes, CentraleSupelec, Gif-Sur-Yvette, FR
3 Neurospin, CEA, Gif-Sur-Yvette, FR
4 The University of Texas at Dallas, Richardson, TX, USA
R´ esum´ e. RGCCA est une m´ ethode flexible et rapide qui—g´ en´ eralisant de nom- breuses m´ ethodes existantes—permet l’analyse de donn´ ees structur´ ees en plusieurs blocs h´ et´ erog` enes. Nous pr´ esentons l’ajout dans RGCCA de deux nouvelles contraintes : une contrainte de parcimonie de groupes et une contrainte d’orthogonalit´ e sur les poids de RGCCA. Ces deux contraintes ont pour but d’augmenter l’interpr´ etabilit´ e de l’analyse de donn´ ees de grande dimension qui poss` edent une structure de groupe. Nous appliquons cette nouvelle m´ ethode—abr´ eg´ ee en gSGCCA—` a l’analyse de donn´ ees de gliome malin p´ ediatrique structur´ ees en trois blocs. Nous montrons sur ces donn´ ees le gain en in- terpr´ etabilit´ e apport´ e par les contraintes de parcimonie et d’orthogonalit´ e.
Mots-cl´ es. RGCCA, parcimonie, parcimonie de groupe, structure
Abstract. RGCCA—a fast and flexible method—generalizes many other well-known methods in order to analyze data-sets comprising multiple blocks of variables. Here we extend RGCCA by adding two new constraints to the RCCCA optimization problem: 1) group sparsity and 2) orthogonality of the block weight vectors. These two constraints facilitate the interpretability of the results when analyzing high dimensional data with a group structure. We illustrate this new method—called gSGCCA—with the analysis of pediatric high-grade glioma data: a set comprising three data blocks. This analysis shows that these new constraints greatly improve the interpretability of the statistical analysis.
Keywords. RGCCA, sparsity, group-sparsity, structure
1 Introduction
Regularized Generalized Canonical Correlation Analysis (RGCCA) [8, 9, 3] is a recent
multiblock component method that generalizes traditional component-based two table
methods—such as partial least square correlation, redundancy analysis, and canonical
correlation—in order to analyze data sets comprising multiple blocks of data. Just like
with other component methods, RGCCA results are often difficult to interpret when there
are (too?) many variables; to mitigate this problem, RGCCA has been extended to be- come Sparse General Canonical Correlation Analysis (SGCCA) [7]: a version of RGCCA that incorporate an ` 1 -norm based constraint in order to generate sparse block weight vec- tors. This sparsification constraint improves the interpretation of the results (because it selects important variables) but at a cost: the block weight vectors are not orthogonal—a pattern that often makes the results difficult to interpret. This trade-off between sparsity and orthogonality is not specific to RGGCA: it affects all component based-methods, es- pecially those based on the singular value decomposition (SVD) and its extensions (e.g., the generalized SVD, GSVD). Recently, however, we found that this trade-off could be eliminated 1) for the SVD: the constrained SVD (CSVD) [4], combines orthogonality and sparsity constraints to the plain SVD, and 2) for the GSVD (including block constraints on observations and variables): as implemented in sparse Multiple Correspondence Analysis (sMCA) [5].
Here, we propose to extend the approach used for the GSVD to create gSGCCA:
the version of RGCCA that includes 1) a group sparsity constraint (and it associated group sparse projection), and 2) an orthogonality constraint on the block weight vectors.
To do so, we applied the same group projection as in sparse MCA, combined with an orthogonality projection with projections onto convex sets (POCS) [1]. We illustrate gSGCCA with the analysis of the pediatric glioma data used in [7].
2 Method
Group sparse GCCA (gSGCCA) is defined as the following optimization problem:
argmax
a
1,a
2,...,a
Jf (a 1 , a 2 , . . . , a J ) =
J
X
j,k=1 j6=k
c jk g (cov (X j a j , X k a k ))
subject to
ka j k 2 = 1 ka j k G
j≤ s j a j ⊥ A j
, ∀j = 1, . . . , J.
(1)
where X 1 , . . . , X J are J centered blocks of data, the function g is defined as any continu- ously differentiable convex function, and the design matrix C = {c jk } is a symmetric J ×J matrix of non-negative elements describing the network of connections between blocks that are to be taken into account. Moreover, a 1 , . . . , a J are block weight vectors (i.e.. the weights applied to each block to obtain the block components), A 1 , . . . , A J are the previ- ously estimated weight vectors combined into matrices, G 1 , . . . , G J are the groups of vari- ables for a block, s 1 , . . . , s J are positive scalars controlling the group sparsity constraint for the block weight vectors, and the group norm is defined as: kxk G = P G
g=1 kx ι
gk 2 ,
where x ι
gis the subvector of x that contains only the elements of group G j . The ` 1,2 -ball
associated with this norm is noted B 1,2 (·). In the next section, we present a monotone convergent algorithm for solving optimization Problem (1).
2.1 The gSGCCA algorithm
The maximization of function f over the parameter vectors a = (a 1 , . . . , a L ), is imple- mented using cyclic Block Coordinate Ascent (BCA [2]); a procedure that updates in turn, each of the parameter vectors while keeping the others fixed. Specifically, let ∇ j f(a) be the partial gradient of f(a) with respect to a j . We want to find an update ˆ a j ∈ Ω j = {ka j k 2 = 1, and ka j k G
j≤ s j , and a j ⊥ A j } such that f (a) ≤ f(a 1 , ..., a j−1 , ˆ a j , a j+1 , ..., a J ). Be- cause f is a continuously differentiable multi-convex function and because a convex func- tion lies above its linear approximation at a j for any ˜ a j ∈ Ω j , the following inequality holds:
f (a 1 , ..., a j−1 , a ˜ j , a j+1 , . . . , a J ) ≥ f(a) + ∇ j f(a) > (˜ a j − a j ) := ` j (˜ a j , a). (2) On the right-hand side of (2), only the term ∇ j f (a) > a ˜ j is relevant to ˜ a j and, so, the solu- tion maximizing the minorizing function ` j (˜ a j , a) over ˜ a j ∈ Ω j is obtained by considering:
ˆ
a j = argmax
˜ a
j∈Ω
j∇ j f (a) > ˜ a j = argmin
˜ a
j∈Ω
jk∇ j f(a) − ˜ a j k 2 2 . (3) This last equality follows from ka j k 2 = 1 as a j ∈ Ω j . This core optimization problem is a projection onto the intersection between the ball defined by the groups, the ` 2 -ball, and the space orthogonal to the already estimated block weight vectors, assembled in A j . This projection on B 1,2 (s j ) ∩ B 2 (1) ∩ A ⊥ j is performed using POCS with two components:
the projection onto the intersection of the group ball and the ` 2 -ball, and the projection onto the orthogonal spaces defined by the already estimated loading vectors combined in the matrix A j . The complete gSGCCA algorithm is presented in Algorithm 1.
3 Application on glioma data
We applied gSGCCA to the glioma data previously analyzed with SGCCA ([7, 6]). This
data-set comprises three blocks of variables: 1) gene expression data (GE), 2) comparative
genomic hybridization data (CGH) and, 3) the location of the tumor in the brain. Here,
we focused on the analysis of only six groups of genes highly associated with different
types of brain tumors and with brain tumor development. The six groups are defined
similarly for both the GE and CGH blocks. For this analysis, we used three different
versions of RGCCA: 1) a classical three-block RGCCA with a complete design (i.e., all
blocks are inter-connected); 2) a structured version of RGCCA where each group of genes
is a block (here the design is complete within each type of data, GE or CGH, and all the
0.00 0.25 0.50 0.75 1.00 value
PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_UP VERHAAK_GLIOBLASTOMA_CLASSICAL VERHAAK_GLIOBLASTOMA_MESENCHYMAL VERHAAK_GLIOBLASTOMA_NEURAL VERHAAK_GLIOBLASTOMA_PRONEURAL
Dim. 1 Dim. 2 Dim. 3 Dim. 4 Dim. 5
PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_UP VERHAAK_GLIOBLASTOMA_CLASSICAL VERHAAK_GLIOBLASTOMA_MESENCHYMAL VERHAAK_GLIOBLASTOMA_NEURAL VERHAAK_GLIOBLASTOMA_PRONEURAL
Dim. 1 Dim. 2 Dim. 3 Dim. 4 Dim. 5
PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_UP VERHAAK_GLIOBLASTOMA_CLASSICAL VERHAAK_GLIOBLASTOMA_MESENCHYMAL VERHAAK_GLIOBLASTOMA_NEURAL VERHAAK_GLIOBLASTOMA_PRONEURAL
Dim. 1 Dim. 2 Dim. 3 Dim. 4 Dim. 5
PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_UP VERHAAK_GLIOBLASTOMA_CLASSICAL VERHAAK_GLIOBLASTOMA_MESENCHYMAL VERHAAK_GLIOBLASTOMA_NEURAL VERHAAK_GLIOBLASTOMA_PRONEURAL
Dim. 1 Dim. 2 Dim. 3 Dim. 4 Dim. 5
Figure 1: Comparison of the block weight vectors, summarized by groups, without sparsity constraints (upper graphs) or with a maximum sparsity constraint (lower graphs), for the GE block (on the left) and the CGH block (on the right).
y cort dipg midl
−6
−3 0 3 6
0 5 10
Dimension 1
Dimension 2
−0.75
−0.50
−0.25 0.00 0.25
−0.3 0.0 0.3 0.6
Dimension 1 (GE)
Dimension 1 (CGH)
−0.50
−0.25 0.00 0.25
−0.2 0.0 0.2
Dimension 1 (GE)
Dimension 1 (CGH)