Contributions a l’analyse de données multivoie : algorithmes et applications

(1)

HAL Id: tel-01619919

https://tel.archives-ouvertes.fr/tel-01619919

Submitted on 19 Oct 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

algorithmes et applications

Olga Lechuga Lopez

To cite this version:

Olga Lechuga Lopez. Contributions a l’analyse de données multivoie : algorithmes et applications. Autre. Université Paris-Saclay, 2017. Français. �NNT : 2017SACLC038�. �tel-01619919�

(2)

T

HÈSE DE

D

OCTORAT

DE L’

U

NIVERSITÉ

P

ARIS-

S

ACLAY,

PREPARÉ À

C

ENTRALE

S

UPÉLEC

Ecole doctorale n

◦

580 Sciences et Technologies de l’Information et de la Communication

Spécialité de doctorat : Mathématiques et Informatique

par

Olga Gisela Lechuga López

Contributions a l’analyse de données

multivoie : algorithmes et applications

Thése présentée et soutenue á Gif-sur-Yvette, le 3 juillet 2017.

Composition du Jury :

M. _{Hervé Abdi} Professeur University of Texas Rapporteur

M. _{Mohamed Hanafi} Ingénieur de Recherche Oniris-Nantes Rapporteur

M. _{Christophe Ambroise} Professeur Université d’Évry Président du jury

M. _{Robert Sabatier} Professeur Université de Montpellier Examinateur

M. _{Arthur Tenenhaus} Professeur CentraleSupélec Directeur de thèse

M. _{Laurent Le Brusquet} Professeur adjoint CentraleSupélec Co-Directeur de thèse

(3)

(4)

P

H

D T

HESIS

OF

U

NIVERSITY

P

ARIS-

S

ACLAY

PREPARED AT

C

ENTRALE

S

UPELEC

Doctoral School n

◦

580 Information and Communication Sciences and Technologies

PhD Specialty : Mathematics and Informatics

by

Olga Gisela Lechuga López

Contributions to multiway analysis :

algorithms and applications

Defended on the 3rd of July 2017, at Gif-sur-Yvette.

Jury :

M. _{Hervé Abdi} Professor University of Texas Reviewer

M. _{Mohamed Hanafi} Research Engineer Oniris-Nantes Reviewer

M. _{Christophe Ambroise} Professor Université d’Évry President of the jury

M. _{Robert Sabatier} Professor Université de Montpellier Examiner

M. _{Arthur Tenenhaus} Professor CentraleSupélec Supervisor

M. _{Laurent Le Brusquet} Associate Professor CentraleSupélec Co-supervisor

(5)

(6)

This thesis is the result of the help and support of many people that I would like to acknowledge here.

First of all, I would like to express my deepest gratitude towards my thesis advisors Arthur Tenenhaus and Laurent Le Brusquet for their immense support during these last four years. I am very proud to have been part of their teamwork their passion for teaching and sharing knowledge kept me motivated along this project. Always in a good mood even through stressful periods. My thanks also go to Vincent Perlbarg for his insight and precious help regarding the data he provided. Without their precious time, help and effort this thesis would have not been completed. Their passion and expertise, was a big source of motivation for my work and I cannot thank them enough for investing their time and effort in helping me complete this project.

Secondly, I also want to to thank the comitee members: a special thanks to my reviewers Hervé Abdi and Mohamed Hanafi for accepting the invitation and spending their valuable time for reviewing my thesis. I would also like to thank Christophe Ambroise, Robert Sabatier and Rémy Boyer for their role as examiners.

I also feel very lucky to have been part of the signaux et systèmes department in CentraleSupélec where I got to meet so many wonderful friends. People came and left along the years but I will always be grateful for the amazing dinners, partys and Tichu games we shared among colleagues; Emanuel Dogaru (hope we meet for some tea!), Ashish Rojatkar, Konstantinos Lekkas, Alexis Brenes, Pierre Prache(un jour à Fontainebleau), Pierre Bisiaux (merci pour tous ces voyages en RER). Also a big thanks to my wonderful colleagues Wacha Bounliphone and Wenmeng Xiong for always sharing their good spirits. Also my thanks to Luc and Anne Batalie for always being happily available to resolve my administrative doubts.

I also want to thank my very good friends Alfredo Nájera for being the best room-mate, Naim Jeanbart for all your fun and sunshine, Osvaldo Kim for all the rock and roll, Camila Pedersen for the many things you taught me, Juliette Karnycheff for the relaxing times, and all of Tous chez Nora for making my life in Paris such a happy one. Also besides the distance I want to thanks my friends that have supported me from Mexico and the love I have always received from my friends in Melba.

Finally, the most important and biggest thanks to my family, as I would not be here today without their support. To Julian and Marcia for the many adventures we’ve shared and the many more to come, you are simply the best froys I could wish for. To my parents Miguel y Olga for all the love and motivation you have given me all my life, there are no words to describe my gratitude.

(7)

(8)

List of Figures i

List of Tables iii

General Overview 1

1 Introduction 5

1.1 Data structures, notations and definitions . . . 6

1.1.1 Multiway structures . . . 6

1.1.2 Unfolding, vectorization and reshaping a multiway array . . . 8

1.2 Essential concepts . . . 9 1.2.1 Supervised methods . . . 9 1.2.2 Regularization . . . 10 1.2.3 Kronecker product . . . 13 2 Background Methods 15 2.1 Standard Methods . . . 16

2.1.1 Singular Value Decomposition. . . 16

2.1.2 Principal Component Analysis . . . 18

2.1.3 Partial Least Squares. . . 19

2.1.4 Fisher Discriminant Analysis. . . 20

2.1.5 Logistic Regression . . . 21

2.1.6 Cox Regression . . . 25

2.1.7 Regularized Generalized Canonical Correlation Analysis . . . 27

2.2 Existing multiway extensions . . . 32

2.2.1 Principal Component Analysis to Parallel Factor Analysis . . . . 32

2.2.2 Partial Least Squares to N-way Partial Least Squares. . . 35

3 Contributions 37 3.1 Overview . . . 38

3.1.1 Kronecker constraint . . . 38

3.1.2 Regularization for multiway data . . . 39

3.1.3 Dataset description . . . 42

(9)

3.2.1 Algorithm . . . 44

3.2.2 Experiments. . . 45

3.3 Multiway Cox Regression . . . 47

3.3.1 Algorithm . . . 47

3.4 Multiway Fisher Discriminant Analysis. . . 51

3.4.1 Algorithm . . . 52

3.4.2 Computational issues. . . 55

3.4.3 Sparse Multiway Fisher Discriminant Analysis. . . 58

3.5 Multiway Generalized Canonical Correlation Analysis. . . 65

3.5.1 Algorithm . . . 67

3.6 Conclusion. . . 70

4 Application 71 4.1 Background: studying the brain . . . 71

4.1.1 Diffusion Tensor Imaging . . . 73

4.1.2 Coma dataset description. . . 75

4.2 Application of Multiway Fisher Discriminant Analysis . . . 76

4.2.1 Entire Brain. . . 76

4.2.2 White matter.. . . 80

4.3 Conclusion. . . 86

5 Conclusion 87 5.1 Discussion and Perspectives . . . 87

5.2 List of publications . . . 89

A Résumé 91

(10)

1.1 Graphical representation of a three-way array. . . 7

1.2 Unfolded three way array. . . 8

2.1 Graphical representation of multiblock data. . . 27

2.2 PARAFAC decomposition. . . 33

3.1 Regularization matrix through grouping variables.. . . 40

3.2 Structure of data used for simulations. . . 42

3.3 Example of data used for simulations. . . 43

3.4 LR and MLR obtained weight vectors. . . 46

3.5 MLR obtained discriminant vector related to J . . . 46

3.6 Standard Cox Regression and Multiway Cox Regression: Evolution of performance with respect to the number of covariates J . . . 49

3.7 FDA obtained discriminant vectors.. . . 61

3.8 MFDA obtained discriminant vector related to J . . . 61

3.9 sparse FDA and group lasso FDA obtained discriminant vectors. . . 63

3.10 sparse MFDA obtained discriminant vector related to J . . . 63

3.11 Graphical representation of multiblock and multiway data. . . 65

3.12 Graphical representation of the MGCCA optimization problem. . . 67

3.13 MGCCA application: data structure. . . 69

3.14 MGCCA weight vectors . . . 69

4.1 Entire brain: FDA weight vector (part 1). . . 76

4.1 Entire brain: FDA weight vector (part 2) .. . . 77

4.2 Entire brain: MFDA weight vectors. . . 78

4.3 Entire brain FDA and MFDA cross-validation. . . 79

4.4 White matter: FDA weight vectors (part 1). . . 81

4.4 White matter: FDA weight vectors (part 2). . . 82

4.5 White matter: MFDA weight vectors. . . 82

4.6 White matter: FDA and MFDA cross-validation curves with RI. . . 83

4.7 Anatomic partition of the white matter into 17 different regions. . . 84

4.8 White matter: MFDA weight vectors considering a group regularization matrix.. . . 84

(11)

4.9 White matter: FDA and MFDA cross-validation considering a group

(12)

2.1 Special cases of RGCCA 1.1 . . . 30

2.2 Special cases of RGCCA 1.2 . . . 31

2.3 Summary of the standard methods extended to the multiway context . . 32

3.1 MLR obtained discriminant vector related to K. . . 46

3.2 MFDA obtained discriminant vector related to K. . . 62

3.3 Sparse MFDA obtained discriminant vector related to J . . . 64

3.4 Computation times comparations.. . . 64

3.5 Summary of the methods we extended to the multiway context. . . 70

(13)

(14)

Motivations

Statistical data analysis methods in their standard definition are not adapted to take into account the increasing number of parameters and structure present in emerging data. Examples of this increasing complexity can be found in different areas such as: image analysis [Geladi 1989], chemometrics [Smilde 1990,Smilde 1991,Bro 1996a], psychometrics [Sands 1980], sensor array processing [Sidiropoulos 2000], neuroscience [Martınez-Montes 2004] or in the review proposed by [Kolda 2009]; posing unprece-dented demands for new statistical methods.

These structured data-sets are usually described by more than two dimensions, forming what is known as a multiway array or tensor; which can be considered as a stack of matrices. Consider for example, the situation where the data is contained in a three dimensional matrix I × J × K, where the dimension I corresponds to the observations, J to the set of variables measured at different occasions and K to different instances (e.g. of time). In order to apply a given statistical method it is required to re-structure the data into a I × J K two way array by concatenating the K modalities next to each other, losing the original structure and potentially leading to:

• a procedure that destroys the wealth of structural information that is inherently contained in the data;

• a very large J × K parameter vector to estimate (when J or K are large) which may lead to the over-fitting of the model to the data.

These aspects may also lead to the appearance of spurious correlations, yielding a complicate interpretation of the resulting model. This indicates that statistical and computational methods in their two-dimensional formulation are insufficient for the analysis of structured data and in order to obtain a relevant model additional struc-tural constraints need to be considered.

This problematic is not new, as the first works in tensor decomposition date back to [Hitchcock 1927], but these ideas were not further developed until the early 1960s, with the works of [Tucker 1966] generalizing the Principal Component Analysis (PCA) and [Cattell 1944] factor analysis. Further methods and applications were developed

(15)

in the 1970’s where [Carroll 1970] proposed CANDECOMP or canonical decomposi-tion while [Harshman 1970] proposed PARAFAC or Parallel Factor Analysis. The PARAFAC/CANDECOMP and Tucker models can be considered as the foundation of current multiway generalizations for which efficient algorithms have been developed since [Kroonenberg 1980, Kroonenberg 1983]. However, even if some work has been made in this area there are still many statistical methods that are very commonly used and have not been generalized to higher dimensions.

In this thesis, we explore the generalization of statistical analysis methods to the multiway context. We focus on commonly used linear algorithms: Logistic Regression (LR), Cox Regression, Fisher Discriminant analysis (FDA) and Regularized Canonical Correlation Analysis (RGCCA). The main contributions of this work can be grouped around three main themes that integrate the generalization to multiway data1_:

• The core of our contribution is a structural constraint imposed to the statistical models. This constraint allows to separate the importance of the variables related to either the J and K dimensions making it easier to identify the most relevant variables.

• The use of a regularization matrix R, necessary in the high dimensional context (I J K) that fits the multiway structure of the data. The simplest choice of R is to consider it equal to the identity, but we will also propose other structures better adapted to the multiway case.

• The interpretability of the models is facilitated since the number of parameters to be estimated is reduced to J + K. In addition, we can independently analyze the importance of the variables and the modalities. Accordingly, sparsity inducing penalties are also considered and further facilitate the interpretation.

Illustrative examples of the different methods will be given through: (i) a spec-troscopy data set where the data is collected in a set of I subjects characterized by spectra which contains J = 750 variables is measured at different layers K = 7 and (ii) a multi-modal brain Magnetic Resonance Imaging (MRI) data set where K neuroimag-ing modalities (each characterized by roughly J = 250, 000 voxels), are collected on a set of I patients.

In this context, an individuals × voxels × modalities (I × J × K) array is consid-ered. When considering standard statistical approaches the number of parameters to

1

(16)

estimate is J × K, which is computationally burdensome and leads to a challenging interpretation. It becomes evident that dedicated modeling algorithms able to cope with the inherent structure of data are mandatory to provide interpretable results.

Layout of the manuscript

The manuscript is structured in five chapters divided as follows:

Chapter 1 introduces different types of data structures and standard tools that are encountered in the multiway context. We start by describing the forms in which multiway data is presented and the different procedures in which it can be re-structured. We also introduce the operators and methodologies required to deal with tensors.

Chapter 2 presents the statistical methods, in their standard formulation, that we generalize to the multiway context. We also review the methods that have al-ready been extended to the multiway case and that serve as foundation to our contributions.

Chapter 3 emcompasses the main contributions of our work. We propose a simple scheme to generalize the methods introduced in the previous chapter to the mul-tiway case. Here are introduced the extensions of Fisher Discriminant Analysis (FDA), Logistic Regression (LR), Cox Regression and Regularized Generalized Canonical Correlation Analysis (RGCCA) with their corresponding extensions (MFDA, MLR, Multiway Cox Regression, MGCCA). We show empirical evidence on a spectroscopy data set that the extension to three-way tensors yield more in-terpretable results, with similar classification rates.

Chapter 4 compares the standard methods to their multiway counter part. For this purpose, we use a Diffusion Tensor (DTI) Magnetic Resonance Imaging (MRI) dataset in collaboration with the Pitié-Salpêtrière University Hospital. The ob-jective of this application is to make a long term prediction of the outcome of comatose patients after a traumatic brain injury.

Chapter 5 concludes this thesis with a discussion of the main results obtained in Chapters 3and 4together with some perspectives.

(17)

(18)

Introduction

Contents

1.1 Data structures, notations and definitions . . . 6

1.1.1 Multiway structures . . . 6

1.1.2 Unfolding, vectorization and reshaping a multiway array . . . 8

1.2 Essential concepts . . . 9

1.2.1 Supervised methods . . . 9

1.2.2 Regularization . . . 10

1.2.3 Kronecker product . . . 13

In this chapter, we introduce the concepts and notations that will be used through-out the thesis. We also introduce the multiway data structure that will be considered. This is followed by the different procedures in which such structure can be rearranged when standard methods are applied.

The chapter continues with a general example of the statistical problems at hand together with the complications encountered in high dimensionality. This is provided to ease the understanding of the algorithms developed in Chapters 2and 3. We conclude this chapter with the description of the Kronecker product which is an essential operator of our approach.

(19)

1.1 Data structures, notations and definitions

Because of the variety of applications of multiway data over a span of different scientific areas, we set the notations that will be used along this manuscript:

X a multiway array by an underlined bold-face capital,

X a two-dimensional array by bold upper-case characters,

x vectors by bold lower-case characters,

x scalars are indicated by lowercase italics.

1.1.1 Multiway structures

The notion of tensor considers arrays of different sizes, in general, an I₁× I₂× · · · × I_N N -way array is a tensor in RI1×I2×···×IN_{, in this work we restrict ourselves to}

third-order tensors. Figure 1.1 gives an example of a three-way array. The elements x_ijk of a three-way array can be arranged in X where the indices go from i = 1, · · · , I, j = 1, · · · , J and k = 1, · · · , K. These matrices form the horizontal, frontal and lateral slices of a multiway array.

The entities along the vertical axis are indicated by the first index i, those along the horizontal axis by the second index j and those along the depth axis by the third index k. The three sets of entities define the three "ways" of the tensor as in [Kiers 2000]. In the case of three-way arrays the I horizontal slices pertain to the entities Xi..:

i = 1, ..., I (mode A); the J lateral slices pertain to the entities X.j.: j = 1, ..., J (mode

B) and the K frontal slices pertain to the entities X..k: k = 1, ..., K (mode C). This

(20)

i=1,…,I Mode A observations i=1,…,I observations 1 I j=1,…,J Mode B variables k=1,…,K Mode C modalities 1 _J variables j=1,…,J 1 K modalities k=1,…,K

Figure 1.1: Horizontal, lateral and frontal slices of the three-way array X: X_i.. is the ith horizontal slice of X of dimension J × K. X_.j. is the jth_{lateral slice of X of dimension}

I × K. X..k is the kth frontal slice of X of dimension I × J . Diagram adapted from

(21)

1.1.2 Unfolding, vectorization and reshaping a multiway array

Unfolding and reshaping

When the original structure of the data is presented as a multiway array an unfolding process is required. For this, all the sets of matrices forming the modes of the tensor can be collected into a single two-way matrix. From this a "supermatrix" forms by the concatenation of the frontal slices next to each other.

This process is also known as matricizing the tensor, an example of this is shown in Figure 1.2. Depending on the application a tensor can be unfolded differently. For the purposes of this work the tensor X of dimensions I × J × K will be unfolded into an I × J K matrix, with mode B entities (j = 1, ..., J ) nested within mode C entities (k = 1, ..., K), and will be denoted as X. The reverse process is called reshaping, and re-stacks the arrays of the unfolded matrix into its original tensor structure.

k=1,…,K i=1,…,I

j=1,…,J

j=1,…,J j=1,…,J j=1,…,J

Figure 1.2: Unfolded three-way array: X, obtained from [Kiers 2000].

Vectorization

It can also be the case where the elements of a multiway array need to be represented as a vector. This can be done by vectorizing the array. For matrices, the vectorization is achieved by successively concatenating the columns of the matrix below each other in a single vector. The vectorization of the matrix U into a vector u is denoted as u = Vec(U). Then to vectorize a higher-way array, we simply vectorize its unfolded version obtaining: x = Vec(X) as the vectorized version of the multiway array.

(22)

1.2 Essential concepts

We begin this section with a brief description of the statistical methods considered in this thesis. We recall a set of useful concepts required for the understanding of our work, namely, the need for regularization in high dimensionality, the dual formulation and the Kronecker product.

1.2.1 Supervised methods

Throughout the manuscript we will work with a given set of observations divided in C classes through a qualitative variable y (or labels) and Y be the dummy variable matrix (yi,c = 1 if the individual i belong to class c). We consider X the input/sample

space and Y as the output/target space.

Formally, our goal is to determine a prediction function g : X → Y such that given x ∈ X , g(x) we are able obtain a good prediction of the output y ∈ Y. Given a set of input/output paired data {(x₁, y1), . . . , (xI, yI)} where each xi ∈ RJ and yi ∈ R we

seek to build a linear predictive model of the form:

yi= g(xi; w) + εi. (1.2.1)

where g is a function of x and ε is an error term that provides the distance between predicted and true values, determining how well g(x; w) correctly predicts y. As the function g(x; w) is fixed, it can be parametrized by a weight vector w ∈ RJ that pro-vides the contribution of the features to the prediction.

The task of finding w can be formulated as a "loss function" minimization problem by estimating the empirical error:

w∗ = arg min w I X i=1 L(g(xi; w), yi), (1.2.2)

The simplest example of this can be given by the linear regression problem, that seeks to parametrize the best fit of w for each pair (x_i, yi) such that w>xi ≈ yi and the error

εi= yi− w>xi. For this, we minimize the sum of squared errors: I

X

i=1

(23)

where k · k2 denotes the square of the standard Euclidean norm. From (1.2.3) the objective function is:

w∗= arg min

w

ky − Xwk2_. _(1.2.4)

Solving optimization problem (1.2.4) boils down to taking the gradient with respect to w and equating to zero, leading to the solution:

w = (X>X)−1X>y. (1.2.5)

The predicted vector is then given by:

ˆ

y = X(X>X)−1X>y. (1.2.6)

However there are cases in which this estimation may be hard to achieve or the per-formance of the model may decrease. For example, when there are many correlated variables or when there is larger number of variables than the number of observations J I. In the following we introduce regularization procedures to overcome such issue.

1.2.2 Regularization

Here we recall the regularization approaches that help solving the problems that arise when J I. Such problems have become of great importance, specially in the field of bioinformatics where datasets have an increasing number of features. The main concerns are characterized by:

over-fitting: The prediction function must generalize the features that can be learned from a given set of data while avoiding over-fitting; which happens when the function only learns specific details of the data set, rather than its global properties, so when a new observation considered the function will be prone to errors.

computational stability: caused by proximity to singularities of various types, such as very small eigenvalues.

As presented in the previous section most learning algorithms required the inver-sion of a matrix X>X of dimension J × J which can be singular. In order to overcome this issue and to reduce over-fitting we must consider additional constraints, such as a regularization function [Tikhonov 1977].

(24)

Ridge Regularization

An example can be given with the regularized version of the linear regression problem, or ridge regression [Hoerl 1970]:

w∗= arg min

w ky − Xwk

2_{+ λkwk}2_, _(1.2.7)

where λ is a regularization parameter that controls the amount of penalization of the function. The solution of equation (1.2.7) is obtained by taking the gradient with respect to w and equating to zero, so the solution for w leads to:

w = (X>X + λI)−1X>y, (1.2.8)

since X>X is a symmetric matrix, it is positive semi-definite. As a consequence, the condition λ ≥ 0 is sufficient to guarantee that (X>X + λI) is non-singular and the ridge estimator exists.

The regularization function is usually equal to the identity, but as we will see later in the manuscript can also serve to impose an a priori on the "form" of the solution.

Dual Formulation

It is possible to simplify an optimization problem (1.2.8) through the use of a dual formulation [Shawe-Taylor 2004,Boyd 2004] considering the equality:

(X>X + λIJ)−1X>= X>(XX>+ λII), (1.2.9)

then w can be computed from (1.2.8) as:

w = X>(XX>+ λI)−1y (1.2.10)

= X>(K + λI)−1y (1.2.11)

= X>α. (1.2.12)

The elements of α are known as dual variables, and the matrix K is calculated as Kij = hxi, xji. The key difference between the primal formulation (1.2.7) and its dual

in (1.2.12) is that the first involves inverting a J × J matrix, while the second requires

(25)

Sparsity constraints

One of the goals of integrating sparsity is to avoid over-fitting, which may be associ-ated with a large number of estimassoci-ated coefficients in w. In the case of regularization algorithms there are mainly two types of norms: Euclidean (`2 sum of squares norm)

and sparsity inducing norms (`₁ sum of absolute value norm). The second allows to perform a model selection as well as serving as regularization. Another goal of feature selection is to identify subsets of features that are the most discriminant, since irrelevant variables may introduce noise and decrease classification accuracy.

Definition 1.2.1 (Sparsity) a vector x ∈ Rn is sparse if only a few entries are non zero, the number of non zeros is called the `0-norm of x:

Least Absolute Shrinkage and Selection Operator (LASSO) is a special form of regression proposed by [Tibshirani 1996]. It regularizes ordinary least square regres-sion with an `1 regularizer. It is an effective linear regression technique for feature

selection. The LASSO leads to sparse solutions by shrinking the coefficients of the ir-relevant or redundant features to be zero.

[Ng 2004] showed that the LASSO is particularly effective when there exists many irrelevant features but only very few training examples. The use of an `₁ penalty to achieve sparsity has been studied extensively in the regression framework [Efron 2004,

Zou 2006b,Zou 2006a].

If X is an I × J data matrix and y is an outcome vector of length I, then the lasso solves:

min

w∈RJky − Xwk

2_{+ λkwk}

1, (1.2.13)

where λ is a nonnegative tuning parameter that controls the amount of sparsity. The lasso forces the sum of the absolute value of the coefficients to be less than a certain value, forcing some coefficients to zero, leading to a simpler model. But it does have some limitations, since it is unable to do a grouped selection, it might select a variable from a group while ignoring the others.

Following this, [Zou 2005] introduced the elastic net, which is an extension of the lasso by adding an additional penalty term, solving:

(26)

min

w∈RJky − Xwk

2_{+ λkwk}

1+ γkwk2, (1.2.14)

where λ and γ are nonnegative tuning parameters. The `1penalty will generate a sparse

model, while the quadratic part allows for group selection.

Group sparsity. We can also consider structured sparsity inducing norms. These induce estimated vectors that are sparse, as for the `₁-norm, but whose support also displays some structure known a priori. This can reflect potential relationships between the variables. Pairing this a priori to the pre-specified disjoint blocks of variables to be selected or ignored simultaneously. In this context, G is a collection of groups of variables, forming a partition of [1; J ], and dg is a positive scalar weight indexed by a

group g, we can define Ω as:

Ω(w) =X

g∈G

dgkwgk, (1.2.15)

where w_g is the sub-vector associated to the variables of group g, this norm is usu-ally referred to as a mixed `1/`q-norm. In practice, frequent choices for q are {2, ∞}.

As desired, regularizing with Ω leads variables in the same group to be selected or set to zero simultaneously.

Choice of weights dg. When the groups vary significantly in size, results can be

improved, in particular under high-dimensional scaling. Choosing an appropriate choice of the weights dg will compensate for the discrepancies of sizes between groups. It is

difficult to provide a unique choice for the weights. In general, they depend on q and on the type of consistency desired.

1.2.3 Kronecker product

The Kronecker product supports a wide range of fast and practical algorithms. This matrix operation has had an increasingly important role in many areas; such as signal and image processing [Van Loan 2000]. This operator is the center piece of the struc-tural constraint imposed to the algorithms presented in Chapter 3.

(27)

Kronecker product (denoted by the symbol ⊗) of A and B is defined as matrix: A ⊗ B =       a11B a12B . . . a1nB a21B a22B . . . a2nB .. . ... am1B am2B . . . amnB       ∈ Rmp×nq. (1.2.16)

The same definition holds if A and B are complex-valued matrices. The Kronecker product is also known as tensor product.

The Kronecker product has a lot of interesting properties, many of them stated and proven in the basic literature about matrix analysis; see for example the work of [Horn 2012, in Chapter 4]. Some of the properties of the Kronecker product used throughout this manuscript are:

A ⊗ (B + C) = A ⊗ B + A ⊗ C, (1.2.17)

(A ⊗ B)(C ⊗ D) = AC ⊗ BD, (1.2.18)

A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C. (1.2.19)

These properties facilitate the computation of tensors and constitute an important tool as part of tensor decomposition. Since we mainly use the Kronecker product between two vectors we give its definition.

Definition 1.2.3 (Vector Kronecker product) Let u ∈ Rm and v ∈ Rp. Then the Kronecker product of u and v is defined as vector:

u ⊗ v =       u1v u2v .. . umv       ∈ Rmp. (1.2.20)

(28)

Background Methods

Contents

2.1 Standard Methods . . . 16

2.1.1 Singular Value Decomposition. . . 16

2.1.2 Principal Component Analysis . . . 18

2.1.3 Partial Least Squares . . . 19

2.1.4 Fisher Discriminant Analysis. . . 20

2.1.5 Logistic Regression . . . 21

2.1.6 Cox Regression . . . 25

2.1.7 Regularized Generalized Canonical Correlation Analysis . . . 27

2.2 Existing multiway extensions . . . 32

2.2.1 Principal Component Analysis to Parallel Factor Analysis . . . 32

2.2.2 Partial Least Squares to N-way Partial Least Squares . . . 35

This chapter is divided in two sections. We begin with a reminder on the Singular Value Decomposition (SVD). Then we introduce the methods for which a multiway extension has already been proposed: Principal Component Analysis (PCA) and the Partial Least Squares regression (PLS). Secondly, we introduce the standard formula-tions of the methods that will be extended to the multiway framework. More specifically, we present regularized Fisher Discriminant Analysis (FDA), Logistic Regression, Cox regression and Regularized Generalized Canonical Correlation Analysis (RGCCA). In the second section, we present the multiway extensions of PCA [Harshman 1970,

Carroll 1970] and PLS [Bro 1998]. Which constitute the starting point of our main contributions presented in Chapter3.

(29)

2.1 Standard Methods

2.1.1 Singular Value Decomposition

Several data analysis techniques seek to obtain reduced-rank approximations of the data. In order to realize such approximation the Singular Value Decomposition (SVD), also known as Eckhart-Young decomposition, plays an important role in this rank-reduction process. This framework can accommodate a wide range of multidimensional techniques and in a way unifies a serie of different analyses as we show in the following. The SVD states that every rectangular matrix X has some sort of generalized eigen-decomposition. It is the factorization of three matrices U, S and V with a simple and geometric interpretation [Abdi 2007,Meyer 2000].

Theorem 2.1.1 (Singular Value Decomposition) Any real X ∈ RI×J matrix of rank(X) = R can be decomposed as:

X = USV>, (2.1.1)

with U is a I × I matrix U>U = IJ, V is a J × J matrix V>V = VV> = IJ and S

a diagonal I × J matrix with positive elements S = diag(s₁, . . . , sR) corresponding to

the singular values of X: s1 ≥ . . . ≥ sR> 0. The columns of U and V are the left and

right hand singular vectors of X respectively.

Remark 2.1.1 From (2.1.1) the SVD of X can be given as the sum of:

X = s1u1v>1 + . . . + sRuRv>R, (2.1.2)

with uj and vj the jth columns of U and V; which means that:

(30)

Theorem 2.1.2 (Truncated SVD) [Hansen 1987] Taking the SVD decomposition of X ∈ RI×J matrix with rank(X) > R. Now U_R is a I × R matrix U>_RUR= IR, VR is

a J × R matrix V>_RVR = IR and SR a diagonal R × R matrix with positive elements

SR= diag(s1, . . . , sR), such that:

X       x11 x12 . . . x1j x21 x22 . . . x2j .. . ... xi1 xi2 . . . xij       I × J = UR     u11 . . . u1r .. . . .. ui1 . . . uir     I × R SR     s11 0 . . . 0 . .. .. . . . . srr     R × R V>R     v11 . . . v1j 0 . .. vr1 . . . vrj     R × J . (2.1.3)

Matrices URand VRare not unique, their columns come from the concatenation of

eigenvectors of symmetric matrices AA> and A>A. Since eigenvectors of symmetric matrices are orthogonal, they can be used as basis vectors to span a multidimensional space. The absolute value of the determinant of the orthogonal matrix is one, thus the matrix always has an inverse. Furthermore, each column and row has unit norm. Since they come from a symmetric matrix, the eigenvalues are all real numbers.

Theorem 2.1.3 Let the rank-R matrix Y be given by the truncated SVD of X

Y = s1u1v>1 + . . . + sRuRvR>= URSRV>R. (2.1.4)

Then Y is a best rank-R approximation of X (in the least square sense)

Y = arg min

Y∈RI×J

kX − Yk2 s.t. rank(Y) = R (2.1.5)

Unfortunately, there is no higher-order SVD for tensors that inherits all the proper-ties of the matrix SVD. So, it is not possible to compute both the orthonormal subspace of each mode, and a minimal rank-R decomposition simultaneously. However, tensor decompositions that partly satisfy each of the above properties exist.

(31)

2.1.2 Principal Component Analysis

PCA was introduced in [Pearson 1901] and since then has become a popular analysis technique. It fully relies on SVD and is used to highlight patterns in high dimensional dataset emphasizing its variations. Another of its great advantages is that it allows for dimensionality reduction, by finding the principal components. PCA transforms the data to a new linear basis in such way that the greatest variance is found through the projection of the data [Smith 2002,Murphy 2012,Adachi 2016].

Definition 2.1.1 (Principal Component Analysis) X ∈ RI×J contains the mea-sures of I observations with J features. The PCA model is given by:

X = AB>+ E ⇐⇒ xij = R

X

r=1

airbjr+ eij (2.1.6)

The PCA objective function is then:

min

A,BkX − AB

>_k _(2.1.7)

where the columns of A are mutually orthogonal and standardized.

It appears that the PCA solution of AB> can be obtained as the truncated SVD of X.

AB> ←→ U_RSRVR> (2.1.8)

with A = n12U_R and B = n 1 2S_RV>

R equation (2.1.8) gives the best rank − R

approxi-mation of X.

Remark 2.1.2 The A factors are linear combinations of the data X:

A = n12U_R = (USV>)V_RS−1 R n 1 2 (2.1.9) = XVRS−1_R n 1 2, (2.1.10) (2.1.11)

which are also uncorrelated:

AA>

n = U

>

(32)

2.1.3 Partial Least Squares

In contrast to PCA, Partial Least squares (PLS) [Wold 1983] finds components from X that are associated with the label y. The most important feature of PLS is that the decomposition (or deflation procedure) is accomplished by successively computing score vectors. Their main property is that they have maximum covariance with the unex-plained part of the dependent variable. As exunex-plained in [Abdi 2003], PLS decomposes both X and y as a product of a common set of orthogonal factors and a set of specific loadings. We can relate PLS to PCA by performing a one component PCA (if R = 1) yielding the following minimization problem:

min

w,w>_w=1 kX − Xww

>_k2 _(2.1.13)

If we approximate y by Xw in problem (2.1.13), then we obtain the following optimiza-tion problem: min w,w>_w=1 kX − yw >_{k =} min w,w>_w=1 tr X − yw> X − yw> > , (2.1.14)

which boils down to consider:

max

w,w>_w=1 w

>_X>_{y =} _max w,kwk=1

cov (Xw, y) (2.1.15)

Thus, PLS amounts to finding the weight vectors w through the linear combination of X such that the covariance between y and Xw is maximized.

(33)

2.1.4 Fisher Discriminant Analysis.

Fisher Discriminant Analysis (FDA) [Fisher 1936,Welling 2005] is a discriminant method used to find a linear combination of features which characterizes or separates two or more classes of objects. The resulting combination may be used as a linear classifier or for dimensionality reduction. FDA consists in finding a projection vector w that maxi-mizes the ratio of the between class variance and the within class variance. The goal is to reduce the data variation in the same class and to increase the separation between classes [Hastie 2009]. Regularized FDA is defined by the optimization problem:

w∗= arg max w,kwk=1 w>SBw w>STw + λw>Rw , (2.1.16) with SB = C X c=1 X i∈Cc (mc− m)(mc− m)>, (2.1.17) ST = C X c=1 X i∈Cc (xi− m)(xi− m)>, (2.1.18)

where SB is the between covariance matrix and ST the total covariance matrix. mc is

the class mean (for class c), m is the global mean and xi the vector of observed

vari-ables. A regularization term λw>Rw, where the scalar λ is known as the regularization parameter and R as the regularization matrix, is added to improve the numerical sta-bility when computing the inverse of S_T in the high dimensional setting (I J K). We note S_B and S_T in matrix notations since we will use this expressions in Section 3.4.2:

SB = X> YY>Y −1 Y>−1 I1I×I X = X>HBX, (2.1.19) ST = X> II− 1 I1I×I X = X>HTX, (2.1.20)

where Ilis the identity matrix of dimension l and 1I×Iis the matrix of ones of dimension

(34)

2.1.5 Logistic Regression

Logistic regression is a commonly used statistical tool for predictive analysis, also called logit model, is used to model dichotomous outcome variables where the log odds of the outcome are modeled as a linear combination of the predictor variables. In other words, it is used to describe data and to explain the relationships between one dependent vari-able and one or more independent varivari-ables [David 1994].

Let x be the vector containing the set of observed values for a given individual and y its class membership (y = 1 or 0). Knowing (P(y = 1|X = x)) implies that of (P(y = 0|X = x)), so we seek to model (π(x) = P(y = 1|X = x)). The logistic re-gression relies on the maximization of the conditional log-likelihoodP

i=1...nlog P(yi|xi).

A model for the conditional probabilities is used while making the hypothesis that the log-ratio of the conditional probabilities is:

log π(x)

1 − π(x) = w0+ w

>_x, _(2.1.21)

w0 and w being the model parameters. By using this model, the expression of the

log-likelihood is: L(w0, w) = I X i=1 yi w0+ w>xi − log1 + exp w0+ w>xi (2.1.22)

The goal of LR is to find w₀ and w by maximizing the log-likelihood. ( ˆw0, ˆw) = arg max

w0,w

L(w0, w). (2.1.23)

Lets pose πi = π(xi), X the matrix formed by a first column of constants equal to

1 and the J columns corresponding to the observed variables of the I subjects, and V the diagonal matrix formed by πi(1 − πi).

Canceling the derivatives of the log-likelihood with respect to wj leads to the score

equations, that has no analytic solution. If we note π the probability vector such that the i−th element equals πi the p + 1 score equations can be written as:

U(w) = ∂L ∂w = X

>_{(y − π).} _(2.1.24)

The solution is computed by maximizing the log-likelihood through the Newton Raphson algorithm, that requires the second derivatives of the log-likelihood. Let H be the Hessian matrix of the second derivatives of the log-likelihood, for which the general

(35)

term is defined by: H(w) = ∂ 2_L ∂w∂w> = − I X i=1 xix>i πi(1 − πi) = −X>VX, (2.1.25)

H being negative definite signifies that the log-likelihood is a concave function thus with a maximum by canceling the score vector U(w) obtained from the first derivatives. The Newton-Raphson method enables the numerical solution of the score equations, and con-structs a sequence of w(s)(with s referring to the s-th iteration) converging towards the estimation of the maximum of the likelihood.

The concavity of the log-likelihood has an important outcome, since it justifies that the algorithm will converge to the global maximum of the likelihood, without it being affected by the initialization. To describe the s-th step we start with the second order Taylor development of L(w) around w(s):

L(w) ≈ L(w(s)) +hU(w(s))i

>

(w − w(s)) +1

2(w − w

(s)₎>_H(w(s)_{)(w − w}(s)_{). (2.1.26)}

For w(s) we chose the value w maximizing the right term of (2.1.26), obtained by canceling its derivate:

U(w(s)) + H(w(s))(w(s+1)− w(s)) = 0, (2.1.27) so we obtain:

w(s+1)= w(s)−hH(w(s)) i−1

U(w(s)). (2.1.28)

The Newton Raphson algorithm converges towards an estimation ˆw = ( ˆw0, ˆw1, . . . , ˆwJ)

maximizing the log-likelihood. By using equations (2.1.24) and (2.1.25), the current stage of the Newton Raphson algorithm can be written as:

w(s+1) = w(s)+ (X>VX)−1X>(y − π), (2.1.29) = (X>VX)−1X>V(Xw(s)+ V−1(y − π)), (2.1.30)

= (X>VX)−1X>Vz. (2.1.31)

The previous steps allow the formulation of the Newton Raphson step as a weighted regression method with response:

(36)

In the case of logistic regression, the Newton Raphson algorithm is called IRLS (Itera-tive Reweighted Least Squares)[Nabney 2004, Cawley 2007a]. The previous equations are solved recursively since π evolves at each iteration as well as V and z.

When the number of variables is larger than the number of observations, some prob-lems might occur during the computation, (as when the matrix H is not invertible). For such cases a regularization parameter of the form w>λRw is considered into the log-likelihood (2.1.22), where the matrix R allows the introduction of an a priori and avoid numerical issues.

Considering this new term leads to the following criteria being maximized:

C(w₀, w, X, y, λ, R) = n X i=1 yi w0+ w>xi − log1 + expw0+ w>xi − w>λRw . (2.1.33)

Following the previous procedure, we now obtain the following expression for the regu-larized log-likelihood for the Newton Raphson optimization step:

w(s+1)= w(s)+ (X>VX + λRJ)−1(X>(y − π) − λw(s)), (2.1.34)

R is a regularization matrix considered to be the identity. Since the previous equation requires the inversion of a matrix of size J × J computational problems may arise. The dual form of the logistic regression exists reducing the computational cost of the optimization problem [Cawley 2007a].

(37)

Dual LR If we consider (2.1.34) with R = I: w(s+1) = (X>VX + λI)−1h(X>VX + λI)w(s)+ (X>(y − π) − λw(s))(2.1.35)i = (X>VX + λI)−1 h X>VXw(s)+ X>(y − π) i , (2.1.36) = (X>VX + λI)−1X>VhXw(s)+ V−1(y − π)i. (2.1.37) (2.1.38)

Considering the equality: (P−1+ B>R−1B)−1B>R−1 = PB>(BPB>+ R)−1 then:

w(s+1) = X>(XX>+ λV−1)−1hXw(s)+ V−1(y − π)i, (2.1.39)

with w(s)= X>α(s) and XX> inversible we obtain:

X>α(s+1) = X>(XX>+ λV−1)−1hXX>α(s)+ V−1(y − π)i, (2.1.40)

obtaining the following dual form:

α(s+1) = (XX>+ λV−1)−1hXX>α(s)+ V−1(y − π)i. (2.1.41)

Remark 2.1.3 With this dual formulation only one matrix of size I ×I is to be inverted at each iteration.

(38)

2.1.6 Cox Regression

In the context of survival analysis, the data is constituted by failure instances from which we seek to explain the probability function through the observed covariates. Here we recall the Cox Regression model from [David 1972].

This model studies the probability of a failure time with respect to the characteristics of an individual at different times t. For an individual i we note T_i its time of failure. We also introduce the notion of censoring or loss Ci of a given i, when before the end of

the experiment we can no longer observe such i without there being a failure time, or if at the end of the experiment the time of failure was not observed. A dummy variable indicates if the observation i was censored:

δi =1{Ti≤Ci}. (2.1.42)

When using survival models we are interested in the rate of failure α(t). This is the probability of failure at time t knowing that there was no event before. For an individual, for which the failure follows a probability density function f , the failure rate is given by:

α(t) = f (t)

s(t), with S(t) = 1 − F (f ). (2.1.43) Then, the Cox model is defined by:

α(t | xi) = α0(t)exp(x>i w), (2.1.44)

with α0(t) giving the hazard function in the absence of covariates (X = 0). xi is

the vector of variables for the individual i and w the vector of coefficients related to the influence of each covariate. We consider the Cox regression a model of propor-tional risks meaning that the relationship between the failure rate with respect to the explicative variables is explained by the term exp(x>_i w) which does not depend on time. The use of the partial likelihood Lp(w), rather than the use of the standard

likeli-hood, allows to avoid knowing α0(t) when computing the coefficients w (see [Cox 1975]):

L_p(w) = I Y i=1 ( exp(x>_i w) PI i0₌₁Y_i0(T_i)exp(x> i w) )δi , (2.1.45) where:

(39)

Yi(t) =

(

1 if subject i is still at risk t(T_i ≥ t)

0 if not (2.1.46)

The maximization with respect to w of the likelihood Lp(w) is generally solved

through Newton-Raphson algorithms consisting in canceling the partial score vector Up(w): Up(w) = ∂ log Lp(w) ∂w = I X i=1 δi xi− Pn i0₌₁Yi0(T_i)x_i0exp(x> i0w) Pn i0₌₁Y_i0(T_i)exp(x> i0w) . (2.1.47)

Using the Newton-Raphson iteration (see Algorithm 1):

ˆ w(q+1)= ˆw(q)+ I−1_I ˆ w(q) Up ˆ w(q) , (2.1.48)

where II(w) is the matrix of second derivatives:

II(w) = −

∂2log Lp(w)

∂w∂w> . (2.1.49)

Algorithm 1: Standard Cox Regression Require: ε > 0, wˆ(0)_, _T, _X, _t

censoring

q ← 0 repeat

Computation of Up wˆ(q) and of In(w) from T, X, tcensoring

ˆ w(q+1) = ˆw(q)+ I−1_I wˆ(q) Up wˆ(q) q ← q + 1 until k ˆw(q)− ˆw(q−1)k < ε return ˆw(q)

(40)

2.1.7 Regularized Generalized Canonical Correlation Analysis

Multiblock analysis concerns the analysis of data structured in blocks of variables and has widely been studied in the literature [Tenenhaus 2011and references therein]. The number and nature of the variables usually differ from one block to another but the individuals are always the same across blocks. Since the blocks are related in some way the main objective in this type of data is to explore their relationships. In this framework, let us consider L data blocks X1, . . . , Xl, . . . , XL. Each block Xl is of

dimension I × J_l and represents a set of J_l variables observed over I individuals as shown in Figure2.1. The number and nature of the variables may differ from one block to another, but the individuals must be the same across blocks.

I

J1

X

1

X

L

J2 JL

X

2 …

Figure 2.1: Example of multiblock data.

Regularized Generalized Canonical Correlation Analysis(RGCCA) [Tenenhaus 2011] constitutes a general framework for many multiblock data analysis methods (refer to Tables2.1and 2.2at the end of this section).

The objective of RGCCA is to find block components: y_l = Xlwl, l = 1 . . . L

(where w_l is a column vector with J_l elements and all variables are assumed centered) summarizing the relevant information within and between the blocks and is defined as the following optimization problem:

maximize w1,...,wJ L X l,l0_=1:l6=l0 cll0 g (cov(X_lw_l, X_l0w_l0)) s.t. (1 − τ_l) var(Xlwl) + τlkwlk22 = 1, l = 1, . . . , L, (2.1.50) where:

• The scheme function g is any continuous convex function that allows to consider different optimization criteria. Typical choices of g are the identity (leading to

(41)

maximizing the sum of covariances between block components), the absolute value (yielding maximization of the sum of the absolute values of the covariances) or the square function (thereby maximizing the sum of squared covariances).

• The design matrix C = {c_ll0} is a symmetric matrix of nonnegative elements

describing the network of connections between blocks that the user wants to take into account. Usually, c_ll0 = 1 for two connected blocks and 0 otherwise.

• The τl is called shrinkage parameter ranging from 0 to 1. Setting the τl to 0

forces the block components to unit variance (var(X_lwl) = 1), in which case

the covariance criterion boils down to the correlation. The correlation criterion is better at explaining the correlated structure across datasets, thus discarding the variance within each individual dataset. Setting τ_l to 1 will normalize the block weight vectors (w>_l wl= 1), which applies the covariance criterion. A value

between 0 and 1 will lead to a compromise between the two first options and correspond to the following constraint w_l>(τlI + (1 − τl)(1/n)X>l Xl)wl in2.1.50.

The choices τl= 1, τl= 0 and 0 < τl < 1 are respectively referred as Modes A, B

and Ridge.

From optimization problem (2.1.50), the term "generalized" in the acronym of RGCCA embraces at least three notions. The first one relates to the generalization of two-block methods: including Canonical Correlation Analysis [Hotelling 1936], Interbattery Factor Analysis [Tucker 1958] and Redundancy Analysis [Van Den Wollenberg 1977] to three or more sets of variables. The second one relates to the ability of taking into ac-count some hypotheses on between-block connections: the user decides which blocks are connected and which ones are not. The third one relies on the choices of the shrinkage parameters allowing to capture both correlation or covariance-based criteria.

We also associate to each matrix Xl a symmetric definite positive matrix Ml of

dimensions J_l× J_l. By setting a_l = M1/2_l wl and Pl = XlM −1/2 l , the optimization problem (2.1.50) becomes: maximize a1,...,aJ J X j,k=1:j6=k cjk g (cov(Pjaj, Pkak)) s.t. a>l al = 1, l = l, . . . , L. (2.1.51)

A monotone convergent algorithm (i.e. the bounded criteria to be maximized increases at each step of the procedure) is proposed in [Tenenhaus 2011andTenenhaus 2017] to solve optimization problem (2.1.51).

(42)

Remark 2.1.4 RGCCA is a rich technique that encompasses several important mul-tivariate analysis methods. Through the choice of the scheme function, the shrinkage parameter according to the nature of the data, and the goal of the analysis it allows to capture the family of methods presented in Table2.1 and2.2.

(43)

Method Scheme function g(x) Shrinkage constants (τl) Design matrix (C) Principal Component Analysis [Hotelling 1933] x τ1 = 1

X

1

X

1 C = 0 1 1 0 ! Canonical Correlation Analysis [Hotelling 1936] x τ1 = 0 τ2 = 0

X

1

X

2 C = 0 1 1 0 ! Interbattery Factor Analysis/PLS regression[Tucker 1958] (or Partial Least Squares

Regression (PLS)[Wold 1983]) x τ1 = 1 τ2 = 1 Redundancy Analysis of X1 with respect to X2

(RR)[Van Den Wollenberg 1977]

x τ1 = 1 τ2 = 0 Generalized CCA [Carroll 1968a] x 2 τl= 0 l = 1, . . . , L + 1 _X 1 XL+1 XL … C =       0 . . . 0 1 .. . . .. ... ... 0 . . . 0 1 1 . . . 1 0       Generalized CCA [Carroll 1968b] x 2 τl= 0, l = 1, . . . , L1, L + 1 τl= 1, l = L1+ 1, . . . , L Hierarchical PCA [Wold 1996a] x 4 τl= 1, l = 1, . . . , L τL+1= 0 Multiple Co-Inertia Analysis [Chessel 1996] x2 Concensus PCA [Westerhuis 1998a] CPCA-W [Smilde 2003] Multiple Factor Analysis

[Escofier 1994]

Table 2.1: Special cases of RGCCA in a situation of L blocks. X_L+1 defined as the concatenation of the L blocks.

(44)

Method Scheme function g(x) Shrinkage constants (τl) Design matrix (C) Sum of Correlations method [Horst 1961] x τl= 0, l = 1, . . . , L X1 X2 X3 XL C =        0 1 . . . 1 1 0 . .. ... .. . . .. ... 1 1 . . . 1 0        Sum of Squared Correlations method [Kettenring 1971] x2 Sum of Absolute Value

Correlations method [Hanafi 2007] |x| Sum of Covariances method [Van de Geer 1984] x τl= 1, l = 1, . . . , L Sum of Squared Covariances method [Hanafi 2006] x2 Sum of Absolute Value

Covariances method [Krämer 2007][Tenenhaus 2011] |x| MAXBET: The SUMCOV criterion is actually a "one component per block"

version of MAXBET [Van de Geer 1984] x τl= 1, l = 1, . . . , L X1 X2 X3 XL C =        1 1 . . . 1 1 1 . .. ... .. . . .. ... 1 1 . . . 1 1        MAXBET B: SSQCOV criterion is a "one component per block" version of MAXBET B [Hanafi 2006] x2 PLS Path modelling [Wold 1982] |x| τl = 0 X1 X2 X3 XL C = (cll0), c_ll0 = 1 for two connected blocks c_ll0 = 0 otherwise

Table 2.2: Special cases of RGCCA in a situation of L blocks. XL+1 defined as the

(45)

2.2 Existing multiway extensions

In the following, we present extensions of PCA to PARAFAC and Partial Least Squares to N-PLS. Table2.3 gives a snapshot of the methods that already have a multiway ex-tension and methods that have yet no multiway equivalent. For the latter, a multiway extension will be proposed in Chapter 3and 4.

Standard Method Multiway extension

PCA PARAFAC PLS N-PLS FDA ? LR ? Cox Regression ? RGCCA ?

Table 2.3: Summary of the standard methods extended to the multiway context and the ones we will be focusing on the following chapters

2.2.1 Principal Component Analysis to Parallel Factor Analysis

The PARAFAC model can be seen as a generalization of the PCA decomposition. It relies on the maximization of a variance criterion but explicitly takes into account the multiway nature of the data by imposing structure on the weight vectors.

In the PARAFAC model the data is decomposed into triads or trilinear components, where each component consists of one score vector and two loading vectors. A three-way PARAFAC model can be described with three loading matrices: A,B and C. This trilinear model is found to minimize the sum of squares of the residuals in the model [Bro 1997a].

X = a1◦ b1◦ c1+ . . . + aR◦ bR◦ cR+ E (2.2.1)

where ◦ defines the outer product. Thus, the elements xijk of X can be expressed as

follows: xijk= R X r=1 airbjrckr + eijk (2.2.2)

where e_ijk is a residual term, a_ir is an element of the I × r matrix A, b_jr is an element of the J × r matrix B and ckr is an element of the K × r matrix C.

(46)

The PARAFAC decomposition can be visualized in Figure 2.2.

X

_k

c

₁

b

₁

a

₁

c

_r

b

_r

a

_r

C

B

A

…

Figure 2.2: PARAFAC decomposition of tensor X. Vectors a₁, b1 and c1 are the

columns of matrices A, B and C. Diagram from [Bro 1998].

Remark 2.2.1 PARAFAC decomposes X into R rank-1 arrays and a residual array E by minimizing kEk2.

The model can be written in matrix notation as:

Xk= X..k = ACkB>+ Ek with k = 1, . . . , K. (2.2.3) Where: A = [a1, . . . , aR] I × R B = [b1, . . . , bR] J × R C = [c1, . . . , cR] K × R       

A, B and C are called component matrices. (2.2.4)

where X_k and E_k are the kth frontal slices of X and E and C_k is a diagonal R × R matrix with row k of C as diagonal.

Then the objective of PARAFAC is to minimize:

kEk2₌X ijk xijk− R X r=1 airbjrckr !2

or equivalently: kEk2 = min

K

X

k=1

kX_k−AC_kB>k2_.

(2.2.5) Equation (2.2.3) appears as a generalized form of PCA when substituting CkB> as B

(47)

The solution to the PARAFAC model can be found through an Alternating Least-Squares procedure [Yates 1933] as shown in Algorithm2. Using this iterative procedure, matrices A, B and C are estimated by fixing two at a time at each iteration.

At each iteration kEk2 decreases and the PARAFAC algorithm converges to a local solution. This dividing approach simplifies the estimation of the parameters through the means of simpler least squares problems.

Algorithm 2: PARAFAC algorithm

Require: Initialize parameters; random starting values for (A, B and C) repeat

Step 1 Find the best A for fixed B and C Step 2 Find the best B for fixed A and C Step 3 Find the best C for fixed A and B until convergence criterion: kE_oldk2_{− kE}

newk2 < ε

Since all the parameter estimations are the least square estimates all the associated loss functions are bounded below by zero. For this reason, the associated loss function will be strictly monotonic decreasing [De Leeuw 1976]. However, the ALS algorithm has no guarantees to converge to a global minima, and several kinds of initializations have been proposed in order to ensure that the algorithm converges to a global minimun. In [Law 1984] they advocate for the use of random starting values and starting the algorithm from several different starting points. If the same solution is reached several times there is little chance that a local minimum is reached due to an unfortunate initial guess. Good starting values can potentially speed up the algorithm and ensure that the global minimum is found.

(48)

2.2.2 Partial Least Squares to N-way Partial Least Squares

Multi-linear Partial Least Squares of N-PLS is a generalization of the classic PLS re-gression method to multi-way data and was proposed by [Bro 1996b]. It relies on the maximization of a covariance criterion but has additional Kronecker constraints on the weight vectors.

As in standard PLS this algorithm depends on a maximization criteria that depends on the covariance between the associated components of X and y. We provide an ex-ample for the three-way context, but the method can be applied to n-way arrays. The solution is easy to interpret compared to standard methods where unfolding is required.

Similarly to PARAFAC this algorithm seeks to decompose the tensor X into a set of triads. This decomposition consists on a score vector t of size I and two weight vectors wJ and wK of size J and K such that kwJk2= kwKk2= 1. N-PLS consists in finding

wJ and wK that satisfy:

(wJ, wK) = max wJ,wK

(cov(t, y)) with kwJk2= kwKk2 = 1, (2.2.6)

where t = Xw with w = wK⊗ wJ.

The optimization problem (2.2.6) can be expressed as:

max wJ,wK I X i=1 tiyi with ti= J X j=1 K X k=1 xijkwJ jwKk. (2.2.7) So: max wJ,wK I X i=1 J X j=1 K X k=1 yixijkwJ jwKk (2.2.8) max wJ,wK J X j=1 K X k=1 zjkwJ jwKk (2.2.9)

If we formulate equation (2.2.9) in terms of matrices we have:

argmin

wJ,wK

[w>_JZwK] (2.2.10)

Where matrix Z is a matrix of size J × K and is constructed as Z =X>

..1y, . . . , X>..Ky.

Finding wJ and wK is done by doing an SVD of Z. This is a fast method to be

(49)

cost. The Kronecker constraint imposed to the weight vectors constitute the core of the multiway generalization of the methods considered in the next chapter.

(50)

Contributions

Contents

3.1 Overview . . . 38

3.1.1 Kronecker constraint . . . 38

3.1.2 Regularization for multiway data . . . 39

3.1.3 Dataset description. . . 42

3.2 Multiway Logistic Regression . . . 44

3.2.1 Algorithm . . . 44

3.2.2 Experiments . . . 45

3.3 Multiway Cox Regression . . . 47

3.3.1 Algorithm . . . 47

3.4 Multiway Fisher Discriminant Analysis.. . . 51

3.4.1 Algorithm . . . 52

3.4.2 Computational issues . . . 55

3.4.3 Sparse Multiway Fisher Discriminant Analysis . . . 58

3.5 Multiway Generalized Canonical Correlation Analysis . . . 65

3.5.1 Algorithm . . . 67

(51)

3.1 Overview

This chapter contains the main contributions of the manuscript, which can be reduced to key elements that have been plugged in the methods described in the previous chapters.

3.1.1 Kronecker constraint

In order to account for the three-way structure of the data, we impose a new structural constraint to the optimization problems reviewed in Chapter2. The main objective re-sides in respecting the original structure of the data and separating the weight vectors with respect to the different dimensions of the tensor.

Then, the weight vector w is modeled as the Kronecker product between a weight vector w_K associated with the K modalities and a weight vector w_J associated with the J variables:

w = wK⊗ wJ. (3.1.1)

This structural constraint results in a more parsimonious model (J +K instead of J ×K parameters to estimate), with the advantage of:

• separately interprets the effects and contributions of the variables and the modal-ities to the discrimination;

• decomposes the optimization problem into iterative problems which use matrices of size I × J and I × K.

Remark 3.1.1 Because of the definition of the Kronecker product this constraint allows to decompose the original problems of size I × J K in a succession of iterations between problems of size I × J and I × K. We will show that this decomposition may help to reduce the computation cost.

(52)

3.1.2 Regularization for multiway data

When the number of parameters to be estimated exceeds the sample size, the problem of estimating w_K and w_J may be poorly posed and may lead to computational errors and over-fitting. Regularization techniques have become essential for the solution of such ill-posed problems. To deflect this problem, the regularization parameter λ together with the regularization matrix R is introduced to avoid numerical issues as in (2.1.16).

In the context of multiway data, it is useful to consider the following additional structural constraint of the R matrix:

R = RK⊗ RJ, (3.1.2)

where RK is of dimension K × K and RJ is of dimension J × J . This constraint allows

to control separately the penalties for w_K and w_J since:

w>Rw = w_K> RKwK w>_J RJwJ . (3.1.3)

Moreover, to some extent, this Kronecker structure for R allows to better fit the three-way structure of the data and to apply separate constraints to w_K and w_J. Fur-thermore, (3.1.3) shows that the penalty applied to wK and wJ remains quadratic, and

thus the constraint does not increase the computational cost.

The choice of λ can be chosen through a cross validation procedure. Besides the identity, other choices can be made to construct R in order to introduce a priori. In the following paragraph, we propose a grouping matrix R.

(53)

R matrix grouping variables

As mentioned, regularization functions can also serve as a way of encoding a priori information about the data. For cases where the data is very structured such as in medical imaging where different regions can be considered it is possible to promote a spatially smooth weight vectors.

A variety of group penalties have been proposed in the literature, the most recent ones being based on `_1/2 norms in order to introduce group sparsity; see for example [Bach 2012] and [Jenatton 2012]. In this section we propose to integrate a priori spatial structures within the regularization matrix R, taking into account groups of variables inherently present. Figure 3.1 shows an example of how these groups are organized in the data.

group 1 group g group G

subject i=1…I modality k=1…K Yi class variables j=1…J

Figure 3.1: Tensor X structure representation of the groups considered for the proposed R matrix with grouping variables. Variables are structured in non-overlapping groups: each individual is characterized by G groups of variables observed through K modalities.

Based on these considerations we propose a regularization term constructed in such a way that:

1. the weights associated to variables of a same group are homogenized, without restraining the variations between variables of different groups,

2. it separates the influence of the variables from the modalities, through a Kronecker decomposition.

The group structure can be considered by appropriately choosing the RJ matrix.

(54)

impose to penalize variations between two spatially close variables such that: (wJ)>RJwJ = G X g=1 X j,j0_∈J g, j and j0neighbors wJj− wJj0 2 , (3.1.4)

This penalty homogenizes the weights associated to the variables of a same group by penalizing the variations at the interior of such group. But without penalizing the distance between the variables of two different groups. This result in a smoother wJ

since variables related to each other are grouped and easier to locate, hence easing in-terpretability.

We note that the R_J matrix is constructed as a block-diagonal structure (two vari-ables of two different groups are not penalized by their weight difference) and that each block is itself sparse (two distant variables are not penalized) which allows for an easy manipulation of matrix R_J and R−1_J . Specific choices for R_K and R_J are made for the brain imaging application in Chapter4.

(55)

3.1.3 Dataset description

In order to achieve a better understanding of the usefulness of the proposed methods and the improvements achieved we compare them to the ones presented in the previous chapter. We used a spectroscopy data set, which measures the wavelenghts and intensity of scattered light from molecules, allowing for the identification of different molecules. The objective of this analysis is to discern if there is a significant difference in the water content between two groups and to identify the most discriminative variables. Spectroscopy allows for very specific chemical identification, with each compound having a specific wavelength. For this application we focus on the water molecules that have a wavelength of around 450. Figure 3.2gives a schemes on how the data is organized.

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7

Figure 3.2: Spectroscopy data showcasing the layer at which the measured where taken (same layers for all subjects).

The data consist of I = 13 subjects divided into two groups of equal size. The spectroscopy measures are taken at 5 different time instances L = 5 at 7 different layers (K = 7 spectra). Each spectrum corresponding to one layer contains 750 wavelengths (J = 750). Initially we will consider only one time instance measure L, resulting in the tensor X of dimension 13 × 750 × 7, whereas the resulting unfolded tensor X is of dimension 13 × 5250. Figure3.3 illustrates the data for a subject from each class.