The DART-Europe E-theses Portal

Texte intégral

(1)Sparse high dimensional regression in the presence of colored heteroscedastic noise: application to M/EEG source imaging Mathurin Massias. To cite this version: Mathurin Massias. Sparse high dimensional regression in the presence of colored heteroscedastic noise: application to M/EEG source imaging. Machine Learning [stat.ML]. Telecom Paristech, 2019. English. �tel-02401628v2�. HAL Id: tel-02401628 https://tel.archives-ouvertes.fr/tel-02401628v2 Submitted on 20 Jan 2020. HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés..

(2) NNT : 2019SACLT053. Sparse high dimensional regression in the presence of colored heteroscedastic noise: application to M/EEG source imaging Thèse de doctorat de l’Université Paris-Saclay préparée à Inria et Télécom Paris Ecole doctorale n◦ 580 Sciences et technologies de l’information et de la communication (STIC) Spécialité de doctorat : Mathématiques et informatique Thèse présentée et soutenue à Palaiseau, le 04/12/2019, par. M ATHURIN M ASSIAS. Thèse de doctorat. Composition du Jury : Gabriel Peyré Directeur de Recherche, Ecole Normale Supérieure. Président, Rapporteur. Mark Schmidt Associate Professor, University of British Columbia. Rapporteur. Nelly Pustelnik Chargée de Recherche, ENS de Lyon. Examinatrice. Olivier Fercoq Maı̂tre de Conférence, Télécom Paris. Examinateur. Julien Mairal Directeur de Recherche, Inria. Examinateur. Joseph Salmon Professeur, Université de Montpellier. Directeur de thèse. Alexandre Gramfort Directeur de Recherche, Inria. Co-directeur de thèse.

(3)

(4)

(5) Contents 1 Motivation and contributions 1.1 Optimization for statistical learning 1.2 The bio-magnetic inverse problem . . 1.3 Contributions . . . . . . . . . . . . . 1.4 Publications . . . . . . . . . . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. I - Faster solvers for sparse Generalized Linear Models 2 Faster solvers for the Lasso: screening, working sets and polation 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Duality for the Lasso . . . . . . . . . . . . . . . . . . . . . 2.3 Gap Safe screening . . . . . . . . . . . . . . . . . . . . . . 2.4 Working sets with aggressive gap screening . . . . . . . . 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Duality improvements for sparse GLMs 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 3.2 GLMs, Vector AutoRegressive sequences and sign 3.3 Generalized linear models . . . . . . . . . . . . . 3.4 Working sets . . . . . . . . . . . . . . . . . . . . 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . .. 17 17 30 37 38. 41 dual extra. . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 43 44 45 55 55 57 63. . . . . . . . . identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 65 66 67 71 74 78 82. . . . . . .. . . . . . .. II - Concomitant noise estimation for the M/EEG inverse problem 83 4 Concomitant estimation 4.1 Introduction . . . . . . . . . . . . . . . . . . 4.2 Heteroscedastic concomitant estimation . . 4.3 Optimization properties of CLaR and SGCL 4.4 An analysis of CLaR through smoothing . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . 5 Experimental validation 5.1 Alternative estimators 5.2 CLaR . . . . . . . . . 5.3 Preprocessing steps for 5.4 Time comparison . . .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . realistic and real data . . . . . . . . . . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. 85 . 86 . 88 . 90 . 96 . 102. . . . .. 103 103 107 114 115. . . . ..

(6) CONTENTS. 5. Conclusions and perspectives. 117. Appendices. 121. A Choice of parameters in Celer. 121. B Concomitant estimation. 125. Bibliography. 135.

(7)

(8) Remerciements Malheur à l’homme seul !. Je souhaite d’abord remercier mes directeurs pour m’avoir proposé ce sujet et m’avoir donné la liberté de l’explorer suivant mes idées. En particulier, merci Alexandre de m’avoir appris (par l’exemple !) l’efficacité et le principe de la trottinette, de m’avoir communiqué cette fameuse vitesse initiale et accordé en suivant une si grande confiance. Joseph, merci pour ta rigueur, ta disponibilité, nos séances à la craie à Montpellier, et l’attention constante que tu as portée à mon avenir dans la recherche académique. Selon l’expression consacrée : merci à tous les deux de m’avoir donné le luxe de me plaindre d’être trop encadré. I thank Gabriel Peyré and Mark Schmidt for accepting to review this thesis. Je remercie également Nelly Pustelnik, Olivier Fercoq, et Julien Mairal d’avoir accepté de constituer mon jury. Je vous en suis très reconnaissant, et espère que la lecture de ce manuscrit vous sera agréable. Parmi les camarades qui ont rendu ces trois années si plaisantes, Pierre “Forgui” Laforgue mériterait son propre chapitre : une passion commune du gibolin et des mostiquos non démentie depuis presque dix ans ! Merci pour ton inaltérable goût de ce qui est kalos kagathos, de la pédale, d’Aymé le mytho, de l’IØ, des capotos bombatos, des unes de l’Équipe sur les turpitudes du VAR, des recettes de Lil’B et des appels croisés relâchés. Tu n’es jamais le dernier quand il faut s’y filer comme un âne, et mort aux ratagasses. Il me faut aussi distinguer le gourmet Kévin Elgui – posso ? – qui a illuminé mes deux dernières années de thèse par son sens – posso ? – du style, de la formule, du triple quantième avec phase de lune et de la casquette Hermès. Quel dommage que tu fasses du 42. Vous êtes mes deux abominables viveurs, mon camaïeu de rouge, et je vous dis à bientôt sous le figuier aixois ou les oies sauvages. Espérons tout de même que Laf n’aura ni l’un des 19 signes de fracture du moral, ni autres chats à fouetter. Ce fut un plaisir de partager un bureau, des moments de grâce entre RER et 91.06B, et des séjours en conférence avec Pierre Ablin, digne héritier de Jacques Mayol. “Where are you guys?”, et puisqu’il a fallu vivre le traumatisme saclaysien, je suis content que ç’ait été avec toi. Devant témoin, je reconnais ta supériorité à Geoguessr : toi seul sais sentir la côte Ouest en quelques secondes. Toujours en E306, merci à mon prédécesseur dans les études du Lasso, le toujours calme et bienveillant Eugène Ndiaye, qui a essayé de m’apprendre la patience. Une partie des travaux de cette thèse ont été réalisés conjointement avec Quentin Bertrand. Qobra, merci pour tous ces échanges toujours constructifs qui nous ont permis d’avancer autant ensemble. J’espère que nous continuerons à explorer ces idées et d’autres encore longtemps, bon courage pour le NeuroImage..

(9) 8. REMERCIEMENTS. Merci à Lambert et Jache, mes petits jeunes préférés : bon courage pour la fin, il ne vous reste plus qu’à apprendre à perdre. Tragique ! Moins jeunes, Quentin et Charles, entre lancers de hache et canassons, auront été les véritables impact players de cette thèse : le sang frais à la soixantième qui fait la différence. Au sein de Télécom, merci à Simon, Adil l’homme par qui le scandale arrive, Anas, Sholom même s’il ne met pas son nom sur ses TP, Kiki “mi-figue mi-mangue”, Lucho le bandido et Guillaume “Big Bob” Papa pour sa résilience inébranlable face aux piquettes à la coinche. Merci aux vieux du FIAP, Romain, Anna, Nicolas et Maël, pour leurs bons plans qui nous ont tant manqué par la suite. A warm thank you goes to the whole Parietal team for welcoming me after a few months of PhD, with special thoughts for Patricio, Antonia, Hamza, Thomas, Marine, Pierre, Jérôme and others with whom I got to spend a bit of time outside of work. Merci à Thomas Moreau de nous avoir rendu la vie facile en ouvrant la voie, et aux inséparables La Tour et Tom Dupré qui ont toujours su remettre les choses en perspective avec malice. J’ai profité de très bons moments à Seattle avec Evguenii Chzhen et Vincent Roulet ; j’ai hâte d’en passer d’autres ailleurs. Merci à Matthieu Durut, Alain Durmus et Michal Valko pour leurs si précieux conseils de recherche qui m’ont énormément servi, et continueront de le faire pour longtemps. I express my gratitude to Taiji Suzuki for hosting me during three months in his team in Tokyo, and to Akiko Takeda, Michael Metel et Pierre-Louis Poirion for welcoming me the way they did. It was a real pleasure to get to work with Boris Muzellec afterwards : I owe you a lot. I thank the Gdr ISIS, MOA and MIA, the STIC doctoral school, and the European Research Council for their financial support, which has enabled me to spend such a privileged PhD. J’ai une petite pensée pour Gaïus et Raymond, pour le réconfort qu’ils peuvent m’apporter dans l’adversité. Merci à Orts ma petite craquotte, à Célius le Régis des hôtes de ces bois, et au Professeur Benhamou pour ses magistrales leçons. Enfin, merski à Maud pour qui ça n’a pas dû être facile tous les jours, pour tes petites analyses décortiquantes, ton risotto et l’ouverture salvatrice à d’autres mondes. Buona fortuna..

(10)

(11) Abstract Understanding the functioning of the brain under normal and pathological conditions is one of the challenges of the 21st century. In the last decades, neuroimaging has radically affected clinical and cognitive neurosciences. Amongst neuroimaging techniques, magneto- and electroencephalography (M/EEG) stand out for two reasons: their noninvasiveness, and their excellent time resolution. Reconstructing the neural activity from the recordings of magnetic field and electric potentials is the so-called bio-magnetic inverse problem. Because of the limited number of sensors, this inverse problem is severely ill-posed, and additional constraints must be imposed in order to solve it. A popular approach, considered in this manuscript, is to assume spatial sparsity of the solution: only a few brain regions are involved in a short and specific cognitive task. Solutions exhibiting such a neurophysiologically plausible sparsity pattern can be obtained through `2,1 -penalized regression approaches. However, this regularization requires to solve time-consuming high-dimensional and non-smooth optimization problems, with iterative (block) proximal gradients solvers. Additionally, M/EEG recordings are usually corrupted by strong non-white noise, which breaks the classical statistical assumptions of inverse problems. To circumvent this, it is customary to whiten the data as a preprocessing step, and to average multiple repetitions of the same experiment to increase the signal-to-noise ratio. Averaging measurements has the drawback of removing brain responses which are not phaselocked, i.e., do not happen at a fixed latency after the stimuli presentation onset. In this work, we first propose speed improvements of iterative solvers used for the `2,1 regularized bio-magnetic inverse problem. Typical improvements, screening and working sets, exploit the sparsity of the solution: by identifying inactive brain sources, they reduce the dimensionality of the optimization problem. We introduce a new working set policy, derived from the state-of-the-art Gap safe screening rules. In this framework, we also propose duality improvements, yielding a tighter control of optimality and improving feature identification techniques. This dual construction extrapolates on an asymptotic Vector AutoRegressive regularity of the dual iterates, which we connect to manifold identification of proximal algorithms. Beyond the `2,1 -regularized bio-magnetic inverse problem, the proposed methods apply to the whole class of sparse Generalized Linear Models. Second, we introduce new concomitant estimators for multitask regression. Along with the neural sources estimation, concomitant estimators jointly estimate the noise covariance matrix. We design them to handle non-white Gaussian noise, and to exploit the multiple repetitions nature of M/EEG experiments. Instead of averaging the observations, our proposed method, CLaR, uses them all for a better estimation of the noise..

(12) ABSTRACT. 11. The underlying optimization problem is jointly convex in the regression coefficients and the noise variable, with a “smooth + proximable” composite structure. It is therefore solvable via standard alternate minimization, for which we apply the improvements detailed in the first part. We provide a theoretical analysis of our objective function, linking it to the smoothing of Schatten norms. We demonstrate the benefits of the proposed approach for source localization on real M/EEG datasets. Our improved solvers and refined modeling of the noise pave the way for a faster and more statistically efficient processing of M/EEG recordings, allowing for interactive data analysis and scaling approaches to larger and larger M/EEG datasets..

(13)

(14) Notation ,. Equal by definition. [d]. Set of integers from 1 to d included. YX. Set of functions from X to Y. Rd1 ×d2. Set of real matrices of size d1 by d2. Idn. Identity matrix in Rn×n. Ai:. ith row of matrix A. A:j. j th column of matrix A. Tr A. Trace of A ∈ Rd×d. A>. Transpose of matrix A. A†. Moore-Penrose pseudo-inverse of matrix A. supp(x). Support of x ∈ Rd. k·k. Euclidean norm on vectors and matrices. k·k0. `0 pseudo-norm on vectors. k·kp. `p -norm on vectors for p ∈ [1, +∞]. Bp. Unit ball of `p -norm. n S++. Positive definite matrices of size n × n. n S+. Semipositive definite matrices of size n × n. k·kS ,p. Schatten p-norm on matrices for p ∈ [1, +∞]. BS ,p. Unit ball of Schatten p-norm. k·k2,1. Row-wise `2,1 -mixed norm on matrices. kAk2,1 =. k·k2,∞. Row-wise `2,∞ -mixed norm on matrices. kAk2,∞ = maxj∈[p] kAj: k. h·, ·iS. n Vector scalar product weighted by S ∈ S++. k·kS. n Mahalanobis matrix norm induced by S ∈ S++. hx, yiS = x> Sy p kAkS = Tr(A> SA). k·k2. Spectral norm on matrices. Tr A =. Pd. i=1 Aii. n o j ∈ [d] : xj 6= 0 kxk0 = supp x. Pp. j=1 kAj: k.

(15) notation. 14 a∨b. Maximum of real numbers a and b. a∧b. Minimum of real numbers a and b. (a)+. Positive part of a ∈ R. a∨0. sign(x). Sign of x ∈ R. sign(x) =. Entrywise product between vectors. (x. 0. Vector or matrix of zeros. 1. Vector or matrix of ones. ST(x, τ ). Soft-thresholding of x ∈ Rd at level τ > 0 d×d0. x |x|. and. (sign(xj )(|xj | − τ )+ )j∈[d]. Block soft-thresholding of A ∈ R. ΠC. Euclidean projection onto convex set C. ιC. Indicator function of set C. Definition 1.6. f g. Infimal convolution of f and g. Definition 1.7. f∗. Fenchel-Legendre transform of f. Definition 1.8. ∂f. Subdifferential of f. Definition 1.12. dom f. Domain of f. {x : f (x) < +∞}. (1 − τ /kAk)+ · A. Model specific X ∈ Rn×p. Design matrix. xi ∈ Rp. ith row of the design matrix. y ∈ Rn. Observation vector. Y ∈ Rn×q. Observation matrix in multitask framework. n Σ ∈ S++. Noise covariance matrix. n S ∈ S++. Square root of noise covariance matrix. =0. y)j = xj yj. BST(A, τ ). at level τ > 0. 0 0. n For two matrices S1 and S2 in Rn×n we write S1 S2 (resp. S1 S2 ) for S1 − S2 ∈ S+ n ). When we write S S we implicitly assume that both matrices (resp. S1 − S2 ∈ S++ 1 2 n belong to S+ .. There is an obvious notation clash: p refers to both the number of features and the index of `p or Schatten p-norms: we ask for the reader’s forgiveness. As much as possible, exponents between parenthesis (e.g., β (t) ) denote iterates and subscripts (e.g., βj ) denote vector entries. We extend the small-o notation to vector valued functions in the following way: for f : Rn → Rn and g : Rn → Rn , f = o(g) if and only if kf k = o(kgk), i.e., kf k/kgk tends to 0 when kgk tends to 0..

(16)

(17)

(18) 1 Motivation and contributions – “Andrea, com’era la mamma ? – Non te la ricordi ? – Prima si, adesso mica tanto.”. Contents 1.1. 1.2. 1.3 1.4. 1.1 1.1.1. Optimization for statistical learning 1.1.1 Statistical learning . . . . . . 1.1.2 Regularization and sparsity . 1.1.3 Convex optimization tools . . The bio-magnetic inverse problem . . 1.2.1 Basis of M/EEG . . . . . . . 1.2.2 Solving the inverse problem . Contributions . . . . . . . . . . . . . Publications . . . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. 17 17 21 26 30 30 35 37 38. Optimization for statistical learning Statistical learning. Let Z be a random variable in a domain Z. A statistical learning task is to find, in a certain set of models H called hypothesis class, the most suitable one. Formally, for a loss function ` : H × Z → R+ , the best model minimizes the expected loss: arg min E[`(φ, Z)] .. (1.1). φ∈H. This framework encompasses tasks such as dimensionality reduction, classification, regression, clustering or feature selection (Shalev-Shwartz and Ben-David, 2014). In this thesis, we are interested in the prediction learning task: we wish to infer the relationship between a random variable X and a target random variable Y, taking values in sets X and Y respectively. In that case, Z = (X, Y), Z = X × Y and the set of models H is a subset of Y X , reflecting a priori knowledge about this dependency (Hastie et al., 2009). As a loss function, we use `(φ, (x, y)) = L(φ(x), y) where L : Y × Y → R+ measures the discrepancy between two values in Y. Problem (1.1) then becomes: arg min E[L(φ(X), Y)] . φ∈H. (1.2).

(19) 18. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. Unfortunately, this expectation is generally impossible to compute since the joint distribution of X and Y is unknown. A most common setting is when a dataset D = {(xi , yi )}i∈[n] ∈ (X × Y)n is available, comprising n independent samples drawn from the joint distribution of X and Y. The unavailable E[L(φ(X), Y)] can be approxim1 Pn ated by the empirical mean n i=1 L(φ(xi ), yi ), and a proxy for the best model can be obtained in the Empirical Risk Minimization (ERM) framework, by solving: n. arg min φ∈H. 1X L(φ(xi ), yi ) . n. (1.3). i=1. We refer to the case where H = {x 7→ h(x; β) : β ∈ Rd }, with h : X × Rd → Y fixed, as the (finite dimensional) parametric case. In this case, learning the optimal function φ reduces to learning the optimal parameter vector β. We will focus on a particular class of parametric statistical models, called Generalized Linear Models (GLMs, introduced in McCullagh and Nelder 1989). For simplicity of the presentation, we now assume that X = Rp and Y ⊂ R (the later used multitask framework, Y = Rq , can be addressed easily at the price of heavier notation). First, let us introduce an exponential family, that is, a family of parametric probability densities (or mass functions), taking the form: {f (υ; θ) = c(υ) exp(η(θ)> T (υ) − κ(θ)) : θ ∈ Θ} ,. (1.4). and such that the support of f (·; θ) does not depend on θ. The function η is called the natural parameter of the family, T is the sufficient statistic, κ the cumulant function, Θ the parameter space and c reflects the integrating measure. Exponential families provide a convenient unifying framework to analyze a variety of commonly used distributions: Gaussian, Poisson, Bernoulli, multinomial, exponential, etc. Example 1.1 (Real Gaussian). If Υ ∼ N (µ, σ 2 ), its density at υ ∈ R is: √. µ (υ − µ)2 1 1 2 1 2 √ exp exp − = υ − υ − µ − log σ , 2σ 2 σ2 2σ 2 2σ 2 2π 2πσ 2 1. (1.5). and it is easy to check that with:  √   c(υ) = 1/ 2π ,     θ = (θ , θ ) = µ , − 1  , 1 2  σ2 2σ 2    η(θ) = θ , T (υ) = (υ, υ2 ) ,     κ(θ) = − θ12 − 1 log(−2θ2 ) =    4θ2 2   Θ = R × −∞, 0 ,. (1.6) 1 µ2 2σ 2. + log σ ,. Equation (1.5) fits the form of (1.4). Moreover, notice that: ∇κ(θ) =. θ1 θ12 1 − , 2− 2θ2 4θ2 2θ2. !. = µ, µ2 + σ 2 = E[T (Υ)] .. (1.7).

(20) 19. 1.1. THESIS MOTIVATION Now consider that σ is known; we can change the parametrization to:   1 1 2 √  exp − υ , c(υ) =  2σ 2  2πσ 2     θ = σµ2 ,    η(θ) = θ ,  T (υ) = υ ,    2   κ(θ) = σ2 θ2 ,     Θ = R .. (1.8). In that case, T is the identity, i.e., Υ itself is a sufficient statistic, and one can easily check that κ0 (θ) = E[Υ]. From now on, we restrict ourselves to distributions for which T is the identity. It turns out that we always have1 ∇κ(θ) = κ0 (θ) = E[Υ], and κ00 (θ) = Var[Υ] > 0. This means that the mapping θ 7→ µ , E[Υ] is one-one, which allows to parametrize the distribution not with θ, but with µ (moment parametrization). In this situation, to postulate that (X, Y) follows a GLM is to assume that for every x ∈ Rp the distributions of Y|X = x (emphasis: not of Y) belong to a common exponential family, and that the parameter µ , E[Y|X = x] is equal to ψ −1 (β > x) (or θ = (κ0 )−1 ◦ ψ −1 (x> β)), for a fixed parameter β and a response function ψ. The hypotheses of a GLM are summarized by: n o fY|X=x (y; β) = c(y) exp η (κ0 )−1 ◦ ψ −1 (x> β) y − κ (κ0 )−1 ◦ ψ −1 (x> β) . (1.9) Equation (1.9) evidences the two choices which characterize a GLM: the exponential family via c, η, and κ, and the response function ψ. To model the data, the choice of the exponential family usually depends on the nature of Y: continuous unbounded data can be modeled by a Gaussian, count data by a Poisson distribution, intervals by an exponential distribution, etc. The response function ψ is usually chosen so that it matches the constraints on the exponential family’s mean, but different choices can be made for the same Y. Note that there exists a canonical choice of ψ: ψ = (κ0 )−1 , resulting in θ = x> β. Performing parameter inference in this setting leads to a variety of popular losses for ERM, with the key property that the Maximum Likelihood Estimator (MLE) problem is a convex one (Pitman, 1936), and therefore the corresponding ERM also is. For the previously introduced dataset D, denoting by θ(i) the parameters of the distributions of Y|X = xi , independence of the samples leads to a log-likelihood equal to: `(θ(1) , . . . , θ(n) |D) =. n X i=1. log c(yi ) +. n X i=1. η(θ(i) )yi − κ(θ(i) ) .. (1.10). We can write this as a function of β only, and the parameter β̂ leading to the MLE (θ̂(1) , . . . , θ̂(n) ), is: arg min β∈Rp. n X i=1. 0 −1 κ ◦ (κ0 )−1 ◦ ψ −1 (x> ◦ ψ −1 (x> i β) − η ◦ (κ ) i β)yi ,. (1.11). which is a finite dimensional, parametric instance of the ERM Problem (1.3) for L(ŷ, y) = κ ◦ (κ0 )−1 (ŷ) − η ◦ (κ0 )−1 (ŷ)y, and H = {x 7→ ψ −1 (β > x) : β ∈ Rp }. 1. the general formula is ∇κ(θ) = Jacη (θ)> E[T (Υ)] where Jacη (θ) is the Jacobian matrix of η at θ.

(21) 20. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. Example 1.2 (Bernoulli variable). In this example, X = Rp and Y = {0, 1}. The probability mass function of a Bernoulli variable of mean µ ∈]0, 1[ is: µ p(y; µ) = µy + (1 − µ)1−y = exp y log + log(1 − µ) , (1.12) 1−µ. which belongs to an exponential family, with:   c(y) = 1 ,    µ   , θ = log  1−µ    η(θ) = θ ,  T (y) = y ,      κ(θ) = log(1 + eθ ) = − log(1 − µ) ,     Θ =]0, +∞[ .. (1.13). So if we postulate that Y|X = x is a Bernoulli random variable of mean µ = κ0 (x> β) = > 1/(1 − e−x β ), we have a GLM with canonical response function, and Problem (1.11) can be written as: n n X X > arg min κ(β > xi ) − η(β > xi )yi = arg min log(1 + exp(x> (1.14) i β)) − yi xi β , β∈Rp. β∈Rp. i=1. i=1. which is the logistic regression ERM. Alternatively, we could postulate that Y|X = x is a Bernoulli random variable of mean µ = Φ(x> β), with Φ the cumulative distribution function of a standard Gaussian. This is the probit model, with MLE accessible via: ! ! n X Φ(x> Φ(x> i β) i β) − log yi arg min log 1 + 1 − Φ(x> 1 − Φ(x> β∈Rp i=1 i β) i β) = arg min β∈Rp. n X i=1. > −yi log(Φ(x> i β)) − (1 − yi ) log(1 − Φ(xi β)) .. (1.15). The examples derived above explain the ubiquity of problems of the form: arg min. n X. fi (x> i β) ,. (1.16). i=1. that we consider in Part I. Finally, the following example is of primary importance for the multitask case which arises in our application. Example 1.3 (Multivariate Gaussian). Let X = Rp , let Y = Rq . Consider the density q of a multivariate Gaussian of mean µ ∈ Rq and covariance Σ ∈ S++ , evaluated at q z∈R : ! 1 1 exp − (z − µ)> Σ−1 (z − µ) 2 (2π)q/2 (det Σ)1/2 ! 1 1 1 1 = exp z > Σ−1 µ − z > Σ−1 z − µ> Σ−1 µ − log det Σ 2 2 2 (2π)p/2 ! 1 1 1 > −1 1 > −1 −1 > = exp z Σ µ − Tr Σ zz − µ Σ µ − log det Σ . (1.17) 2 2 2 (2π)p/2 It belongs to an exponential family, with θ = (Σ−1 µ, − 12 Vec Σ−1 ), T (z) = (z, Vec zz > ) where Vec is the column-wise vectorization operator..

(22) 21. 1.1. THESIS MOTIVATION. Finally, going back to Example 1.1 leads to the most well-known instance of MLE: Ordinary Least Squares (OLS). For X = Rp and Y = R, let us postulate that: Y = X> β ∗ + E ,. (1.18). where E is a real-valued Gaussian of law N (0, σ 2 ), independent from X, and β ∗ ∈ Rp is the true parameter vector. Then following Example 1.1, we have a GLM with:   θ = x> β ,    η(θ) = θ , σ2 (1.19) θ2  κ(θ) = ,  2σ 2    ψ(u) = u , and the MLE derived in Problem (1.11) reads: arg min β∈Rp. n X i=1. κ(β > xi ) − η(β > xi )yi = arg min β∈Rp. = arg min β∈Rp. = arg min β∈Rp. n 1 X1 > 2 (x β) − x> i βyi σ2 2 i i=1. n X i=1. 2 (yi − x> i β). 1 y − Xβ 2. 2. (1.20). ,. > > ∈ Rn×p , and the where we have introduced the design matrix X , (x> 1 , . . . , xn ) observation vector y , (y1 , . . . , yn ) ∈ Rn . Dating back to Legendre and Gauss (Legendre (1805); Gauss (1809), see also Plackett (1972) for a discussion on this discovery), least squares are extremely popular, and optimal amongst linear unbiased estimators (they have the lowest variance in this class2 ), but they also suffer from defects in certain settings which we detail in the following.. 1.1.2. Regularization and sparsity. Consider n realizations of the linear model Y = Xβ ∗ + E, written in vector form as above: y = Xβ ∗ + ε (ε ∈ Rn with entries i.i.d. N (0, σ 2 ), and σ known). Assuming that p ≤ n and rank X = p, the OLS estimator is uniquely defined and reads: β̂ OLS = (X > X)−1 X > y = β ∗ + (X > X)−1 X > ε .. (1.21). The expected (averaged) prediction error is: E[ n1 kX β̂ OLS − Xβ ∗ k2 ] = E[ n1 kX(X > X)−1 X > εk2 ] p = σ2 , n. (1.22). which does not go to 0 when n → +∞, unless p = o(n). This is problematic, as we may want to consider cases where p and n go to infinity together and at the same speed. Other issues arise for MLE, when the number of parameters p outgrows the number of samples n: the solution of Problem (1.20) is not unique, meaning that multiple models exist, all of them perfectly fitting the training data when rank X = n. In 2. even when the noise is not Gaussian, provided the noise variance is constant across observations.

(23) 22. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. that case, which model must be chosen? In addition, perfectly fitting the data is not necessarily a desirable property, as it may lead to poor generalization performance on new data – recent analysis (Hastie et al., 2019) may qualify this interpretation. When data is scarce, empirical risk minimization turns out to be insufficient, and we turn to regularization: instead of looking for the model minimizing the sole datafitting criterion, some constraint is added to the optimization problem, this constraint reflecting prior belief about the desired model. In the case of least squares, with R : Rp → R, this takes the form: 1 minp ky − Xβk2 β∈R 2. s.t.. R(β) ≤ 0 .. (1.23). A seminal choice for R is k·k − τ , with τ > 0. It is noteworthy that the three following problems are equivalent: 1 λ minp ky − Xβk2 + kβk2 , (1.24) β∈R 2 2 min ky − Xβk2. s.t.. kβk ≤ τ ,. (1.25). min kβk2. ky − Xβk ≤ .. (1.26). β∈Rp. β∈Rp. s.t.. Problems (1.24), (1.25) and (1.26) are respectively known as Tikhonov, Ivanov and Morozov regularization (Tikhonov, 1943; Ivanov, 1976; Morozov, 1984), and also as Ridge regression (Hoerl and Kennard, 1970). They are equivalent in the sense that for each (positive) value of one parameter amongst λ, τ or , there exist values of the remaining two such that the three problems share the same solution. Each of these provides a different view on `2 regularization: formulation (1.26) looks for the approximate solution to a linear system with the minimum `2 norm. Formulation (1.25) looks for least squares solutions, with constrained `2 norm. Formulation (1.24) is the most widely employed, but perhaps the less explicit one: while it is clear that in the other formulations, norms can be squared or not in the objective functions, or squared in the constraints provided τ or are squared, it is not easy to see that, up to another choice of λ, Problem (1.24) retains the same solution if one of the squared norms is replaced by a plain norm. In this sense, it is somehow misleading to talk about squared `2 regularization: up to a change of value for λ, plain `2 regularization has the same effect. It is only the geometry of the level lines of the regularizer which matter, and those are not affected by squaring. The square in Tikhonov regularization is only used for practical reasons, as it makes the regularizer smooth. Tikhonov regularization limits the magnitude of the estimate. The type of regularization considered in this work is different: for reasons detailed in Section 1.2, we want to favor simple and interpretable solutions. Generally speaking, the idea that this kind of models should be preferred can be dated back to Ockham’s Razor (Ockham, 1319), and in a modern paradigm to Wrinch and Jeffreys’ simplicity principle: “It is justifiable to prefer a simple law to a more complex one that fits our observations slightly better.” (Wrinch and Jeffreys (1921)) In the context of GLMs, an application of this principle is to perform variable (or feature) selection: search for models which do not include all the potential p variables, but only a small subset of them. In mathematical terms, the estimator β̂ should be.

(24) 23. 1.1. THESIS MOTIVATION. 2. 1. 0 1 0 −1. 0. 1 −1. Figure 1.1 – Values of k·k0 (black) and k·k1 (grey) on B∞ in dimension 2. The `1 norm is greater than any other convex minorant of k·k0 on this set. sparse: kβ̂k0 p. Sparse solutions provide more interpretable models, since it is clear that variables whose coefficients are 0 have no effect on the target variable. The idea to favor sparse models has been applied in various fields: finance (portfolio selection, Markowitz 1952), image processing (wavelet thresholding, Donoho and Johnstone 1994) or statistics (under the name best subset selection, see Miller (2002) for a review). In geophysics, the work of Santosa and Symes (1986) stands out as the first case of `1 penalized least-squares, central to this manuscript (earlier, Claerbout and Muir (1973); Taylor et al. (1979) had used the `1 norm both as datafitting term and regularizer). However desirable sparse models may be, solving the underlying optimization problems is non trivial. For example, instead of looking for the minimum `2 norm approximate solution to a linear system as in Problem (1.26), one may seek the sparsest one by solving: min kβk0. β∈Rp. s.t.. ky − Xβk ≤ .. (1.27). Unfortunately, Natarajan (1995) showed that this problem is NP-hard and hardly tractable when p is large, because the objective function k·k0 is not convex. Approximate solutions can nevertheless be computed. Forward and backward feature selection approaches exist (Efroymson, 1960), either starting from a null vector and iteratively adding the feature improving the datafit the most, or starting with an OLS solution and progressively removing the features less contributing to the model; bidirectional approaches can both add and remove features (Zhang, 2011). This is for example the spirit of the Matching Pursuit (Mallat and Zhang, 1993) and Orthogonal Matching Pursuit (Pati et al., 1993; Tropp, 2006) algorithms. Still, stepwise selection suffers from issues: small changes in the data can result in large differences in models and the estimator is not guaranteed to be sparse (Chen et al., 1998, Section 2.3.2). Another massively followed route to sparsity has been the use of convex surrogates for the `0 pseudo-norm, amongst which the `1 norm holds a place of choice. Indeed, as Figure 1.1 illustrates in dimension 2, the `1 norm is the largest convex minorant of the `0 pseudo-norm on the unit ball of the `∞ norm. The seminal sparse convex estimator,.

(25) 24. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. the Lasso (Tibshirani, 1996) (independently proposed by Chen and Donoho (1995) as Basis Pursuit Denoising), solves, for τ ≥ 0: 1 arg min ky − Xβk2 p 2 β∈R. s.t.. kβk1 ≤ τ .. (1.28). A more employed equivalent form of Problem (1.28) is, for a regularization parameter λ ≥ 0: 1 arg min ky − Xβk2 + λkβk1 . (1.29) β∈Rp 2 As for Tikhonov regularization, Problems (1.28) and (1.29) are equivalent3 in the sense that for any value of λ, there exist a value of τ such that the two solutions coincide, and vice versa. This (data dependent) mapping is again not explicit in general, and we will therefore focus on the form (1.29), which is easier to solve in practice. The respective impacts of `1 and (squared) `2 are well-illustrated on the so-called orthogonal design case, i.e., when X > X = Idp . In that case, the OLS solution is X > y, the Lasso solution is ST(X > y, λ) , (sign(X:j> y) · (|X:j> y| − λ)+ )j∈[p] , and the Tikhonov 1 solution is 1+λ X > y. It is clear that Tikhonov regularization produces a downscaled version of the OLS estimate, but does not set coefficients to 0, while the Lasso sets OLS coefficients below λ in absolute value to 0, and shrinks others by λ. The Lasso gave birth to numerous approaches, such as Elastic Net (Zou and Hastie, 2005), sparse logistic regression (Koh et al., 2007), group Lasso (Yuan and Lin, 2006), sparse-group Lasso (Simon et al., 2013), graphical Lasso (Friedman et al., 2008), multitask Lasso (Obozinski et al., 2010), square-root Lasso (Belloni et al., 2011) or nuclear norm penalization for matrices (Fazel, 2002; Argyriou et al., 2006). These Lasso-type problems all have a convex formulation, and can be solved via a multitude of well-studied optimization algorithms: primal-dual (Chambolle and Pock, 2011), forward-backward (Beck and Teboulle, 2010; Combettes and Pesquet, 2011), Alternating Direction Method of Multipliers (Boyd et al., 2011), accelerated proximal gradient descent (Nesterov, 1983; Beck and Teboulle, 2009), or proximal block coordinate descent (Wu and Lange, 2008; Tseng and Yun, 2009). The convex approach has two benefits: it leads to fast algorithms with global convergence guarantees, and allows for an analysis of estimation consistency, prediction performance (Bickel et al., 2009; Negahban et al., 2010) and model consistency (Zhao and Yu, 2006). In Compressed Sensing, under some conditions, the `1 relaxation allows to recover perfectly the `0 solution (Candès et al., 2006; Donoho, 2006). On the other hand, a notorious drawback is that the resulting estimates are biased in amplitude (Fan and Li, 2001), a bias which is easy to see on an orthogonal design. Alternative substitutes to the `0 penalty were proposed, for instance Smoothly Clipped Absolute Deviation (SCAD, Fan and Li 2001), Minimax Concave Penalty (MCP, Zhang 2010), `p pseudo-norms with 0 < p < 1 (Frank and Friedman 1993, Chartrand 2007 in Compressed Sensing), log penalty (Candès et al., 2008) or CEL0 (Soubies et al., 2015). This type of penalties are usually called folded concave penalties, because coordinatewise they are concave on R+ and symmetric w.r.t. origin. The interested reader may refer to Huang et al. (2012) for a review of convex and non-convex approaches for feature selection. An appealing property of SCAD and MCP is that, although not convex, their proximal operator can be computed in closed-form. Solving other non-convex penalties 3. note that equivalence does not hold for the `0 penalty (Nikolova, 2016).

(26) 25. 1.1. THESIS MOTIVATION. best λ by CV. 2.10−1. λ linear fit. 1.10−1. 0 1. 1 / SNR (=. 2. σ ||Xβ ∗|| ). 3. Figure 1.2 – Optimal value of the Lasso (left) and Concomitant Lasso (right) regularization parameters λ determined by cross validation on prediction error (blue), for a logarithmic grid of 100 values of λ between λmax and λmax /100, as a function of the noise level on simulated values of y. As indicated by theory, the Lasso’s optimal λ grows linearly with the noise level, while it remains constant for the Concomitant Lasso. (log, square root) can be done by iterative reweighted `1 approaches (Zou, 2006; Candès et al., 2008; Gasso et al., 2009; Ochs et al., 2015), hence it remains of high interest, even in the non-convex setting, to have fast solvers for `1 -type regularized problems. Finally, although they perform well theoretically and in practice, the non-convexity of these approaches often makes it difficult or impossible to find the exact solution in practice: algorithms are sensitive to initialization, multiple local minima exist, and a global convergence criterion is lacking. In addition to interpretability, sparsity comes with statistical benefits. In the OLS example, let us assume that β ∗ is sparse, that its support S ∗ of size s is known, and that a sparse model is obtained by setting entries of β̂ OLS outside S ∗ to 0. Then, Equation (1.22) can be greatly improved: E[ n1 kX:S ∗ β̂SOLS − X:S ∗ βS∗ ∗ k2 ] = σ 2 ∗. s , n. (1.30). which goes to 0 if s = o(n), without constraint on p. Of course, in practice S ∗ is not known, but Lounici (2009) showed that under sufficient conditions and for p √ λ = Aσ (log p)/n with A > 2 2, the Lasso estimator satisfies: E[ n1 kX β̂ − Xβ ∗ k2 ] = σ 2. s log s , n. (1.31). i.e., that it only suffers a factor log s from the non-knowledge of S, which is a very appealing statistical guarantee. Yet this approach requires σ to be known and the noise to be homoscedastic, a situation p seldom happening in practice. Since this bound is valid for λ = Aσ (log p)/n, it also suggests that the optimal λ depends linearly on the unknown noise level. This is visible on Figure 1.2, where for a fixed design X (10 000 first columns of the climate dataset), we simulate y = Xβ ∗ + σε for various σ, and for each σ we compute the optimal λ by cross-validation. The dependency indeed appears to be linear in practice. It is worth mentioning that Meinshausen and Bühlmann (2006) showed that prediction and variable selection conflict for the Lasso: the statistically optimal λ for prediction gives inconsistent variable selection results (see also Leng et al. (2006) in the orthogonal design). The Concomitant Lasso, the square-root or the Scaled Lasso estimators (Owen, 2007; Belloni et al., 2011; Sun and Zhang, 2012) achieve the same bound as Equation (1.31), with a regularization parameter independent of σ: in this thesis, we aim at generalizing.

(27) 26. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. this approach to correlated Gaussian noise in the multitask framework. The statistical and practical benefits of sparsity have led to it being used in many applications: audio processing (Zibulevsky and Pearlmutter, 2001), astrophysics, sparse coding (Olshausen and Field, 1997), medical imaging through compressed sensing (Donoho, 2006; Candès et al., 2006), genomics (Bleakley and Vert, 2011), time series analysis (Nardi and Rinaldo, 2011), etc. We now introduce optimization tools involved in the study of Lasso-type problems.. 1.1.3. Convex optimization tools. Throughout this manuscript, we will make extensive use of a convenient class of functions, based on the framework of Bauschke and Combettes (2011). Definition 1.4 (Proper, lower semicontinuous convex functions). We denote by Γ0 (Rd ) d the set of functions f : R → −∞, +∞ which are: • proper: dom f , {x ∈ Rd : f (x) < +∞} = 6 ∅ , • lower semicontinuous: ∀x ∈ Rd , limy→x f (y) ≥ f (x) , • convex: ∀x, y ∈ Rd , ∀α ∈ [0, 1], f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) . Since non proper functions are of limited interest, in the sequel functions are assumed to be proper even if not explicitly stated. Strong convexity and smoothness are two function properties, used to derive convergence guarantees and rates for algorithms minimizing functions: 4 Definition 1.5 (Strong convexity and smoothness). Let f be a differentiable function d from R to −∞, +∞ . For µ, M > 0, we say that f is µ-strongly convex if:. ∀x, y ∈ Rd , f (x) ≥ f (y) + ∇f (y)> (x − y) +. µ kx − yk2 , 2. (1.32). M kx − yk2 . 2. (1.33). and that f is M -smooth if: ∀x, y ∈ Rd , f (x) ≤ f (y) + ∇f (y)> (x − y) +. If f is both µ-strongly convex and M -smooth, we have µ ≤ M and the condition number M µ ≥ 1 is a useful quantity, appearing in the convergence rate of many algorithms. Convex indicators, infimal convolution, Fenchel-Legendre transform and proximal operators are the workhorses of continuous convex optimization. Definition 1.6 (Indicator function). Let C be a subset of Rd . The indicator function of C is: ιC : Rd → −∞, +∞  0 , if x ∈ C , x 7→ (1.34) +∞ , otherwise . 4. a more general definition exists for strong convexity, not requiring differentiability.

(28) 27. 1.1. THESIS MOTIVATION Table 1.1 – Useful Fenchel transforms Function h∗ gh ah k·kp (h + δ) 1 2. k·k2. Fenchel transform ∀h ∈ Γ0 (Rd ). h,. g ∗ + h∗ ah∗ a· , ∀a > 0 ιBp∗ , where h∗ − δ, 1 2. 1 p. +. 1 p∗. =1. ∀δ ∈ R. k·k2. (1.37) (1.38) (1.39) (1.40) (1.41) (1.42). We have that ιC ∈ Γ0 (Rd ) if and only if C is non empty, closed and convex (Bauschke and Combettes, 2011, Examples 1.25 and 8.3). Definition 1.7 (Infimal convolution). Let f and g be two functions from Rd to −∞, +∞ . The infimal convolution of f and g is: f g : Rd → [−∞, +∞]. x 7→ inf f (x − u) + g(u) . u∈Rd. (1.35). Definition 1.8 (Fenchel-Legendre transform). Let f : Rd → [−∞, +∞]. Its FenchelLegendre transform or conjugate, f ∗ is defined as: f ∗ : Rd → −∞, +∞ u 7→ sup u> x − f (x) .. (1.36). x∈Rd. Note that f needs not be convex, but f ∗ always is. Frequently used Fenchel-Legendre transforms are reminded in Table 1.1 (see Bauschke and Combettes 2011, Propositions 13.16, 13.20 and 13.21 and Example 13.24 (iv) for proofs). Proposition 1.9 (Smoothness and strong convexity linked by Fenchel transform, Hiriart-Urruty and Lemaréchal 1993, Thm 4.2.1). Let f ∈ Γ0 (Rd ). Then, for γ > 0, f is γ-smooth if and only if f ∗ is 1/γ-strongly convex. Proposition 1.9 provides a way to transform a function into a smooth one, that we will use in Chapter 4. Given a non-smooth function f ∈ Γ0 (Rd ), we can add a strongly convex function ω to f ∗ , thus making it strongly convex, then take the Fenchel transform again to obtain a smooth function. Formally, the smooth approximation of f is (f ∗ +ω)∗ , which is also equal to f ω ∗ by Equation (1.38). As illustrated on Figure 1.3, this technique is a possible construction for the famous Huber function, a smooth approximation to the absolute value function. Definition 1.10 (Proximal operator). Let f ∈ Γ0 (Rd ). The proximal operator of f , introduced in the seminal work of Moreau (1965), is: proxf : Rd → Rd. 1 x 7→ arg min kx − yk2 + f (y) . 2 d y∈R. (1.43).

(29) 28. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. 2.0. g0.5 h0.5. f (x) = |x|. 2.0. g1.5 h1.5. 1.5. ∞. 1.5. 1.0. 1.0. 1.0. 0.5. 0.0. 0.5. 0.0 −2. 0. 2. −1.0. ∗ g0.5 h∗0.5. ∗ g1.5 h∗1.5. 0.0 −2. 0. 2. −2. 0. 2. Figure 1.3 – Various ways to smooth the absolute value function f , by adding a strongly convex term to f ∗ = ι[−1,1] . Taking the Fenchel transform of the strongly convex functions gρ : u 7→ f ∗ (u) + ρu2 /2 and hρ : u 7→ f ∗ (u) + ρ(u2 /2 − 1/2) yields smooth approximations of f . As ρ increases, the approximations get smoother, but further away from f . The two following proximal operators are extensively used in our work. Proposition 1.11 (Proximal operators of `1 and Euclidean norm, Bach et al. 2012, 0 Section 3.3, p. 45). Let x ∈ Rd , A ∈ Rd×d and τ > 0. The proximal operators of τ k·k1 and τ k·k are respectively the soft-thresholding and block soft-thresholding operators: proxτ k·k1 (x) = ST(x, τ ) , sign(xj )(|xj | − τ )+ , (1.44) j∈[d]. proxτ k·k (A) = BST(x, τ ) , (1 − τ /kAk)+ · A . (1.45) Definition 1.12 (Subdifferential). Let f : Rd → −∞, +∞ . The subdifferential of f at x ∈ dom(f ) is: ∂f (x) , {u ∈ Rd : ∀y ∈ Rd , f (y) ≥ f (x) + u> (y − x)} ,. (1.46). i.e., the set of slopes of all affine minorants of f which are exact at x. Elements of the subdifferential are called subgradients, and for a convex function f , its subdifferential is non empty at every point of the relative interior of dom f . In some sense, subgradients are a generalization of gradients: for a convex differentiable function, the subdifferential at x ∈ dom f has only one element: ∇f (x). Subdifferentiability allows to generalize first order optimality conditions to non differentiable convex functions. Proposition 1.13 (Fermat’s rule). Let f be a proper convex function. Then, for all x̂ ∈ Rd : x̂ ∈ arg min f (x) ⇔ 0d ∈ ∂f (x̂) . (1.47) x∈Rd. The notion of strong duality is extensively used in both Part I and Part II. Proposition 1.14 (Fenchel duality, Rockafellar 1997, Thm. 31.3). Let f ∈ Γ0 (Rn ) and g ∈ Γ0 (Rp ). Let X ∈ Rn×p and λ > 0. The following problems are called respectively primal and dual problems: β̂ ∈ arg min f (Xβ) + λg(β) , {z } β∈Rp | P(β). (1.48).

(30) 29. 1.1. THESIS MOTIVATION θ̂ ∈ arg max −f ∗ (−λθ) − λg ∗ (X > θ) . {z } θ∈Rn |. (1.49). D(θ). Given β ∈ Rp and θ ∈ Rn , the duality gap is P(β) − D(θ) ≥ 0. Strong duality, i.e., P(β̂) = D(θ̂) holds if and only if: −λθ̂ ∈ ∂f (X β̂) , X > θ̂ ∈ ∂g(β̂) ,. (1.50) (1.51). or equivalently: −X β̂ ∈ ∂f ∗ (−λθ̂) ,. β̂ ∈ ∂g ∗ (X > θ̂) .. (1.52) (1.53). These conditions are called Kuhn-Tucker conditions. It is worth mentioning that a sufficient condition for strong duality to hold is that the relative interiors of the domains of f and g intersect, which happens for example if neither f nor g take the value +∞. Definition 1.15 (Block indexing). Let β ∈ Rp , B ∈ Rp×q and X ∈ Rn×p . Let I denote a partition of [p] and let I ∈ I.. The vector βI ∈ R|I| is obtained by keeping only entries of β whose indices are in I. In the multitask setting, BI: ∈ R|I|×q (resp. X:I ∈ Rn×|I| ) is the matrix obtained by keeping only rows of B (resp. columns of X) whose indices are in I. For a L-smooth function f ∈ Γ0 (Rd ), ∇I f ∈ R|I| is the gradient of f when only coordinates in I vary, and LI is the Lipschitz constant of this gradient (which exists because f is itself L-smooth). Proposition 1.16 (Proximal operator of separable function, Parikh et al. 2013, Sec. d 2.1). Let I be a partition of P[d] and g ∈ Γ0 (R ) be a function admitting a block decomposable structure: g(x) = I∈I gI (xI ). Then, proxg (x) is equal to the vector obtained by concatenation of the vectors proxgI (xI ) ∈ R|I| , for I ∈ I. Definition 1.17 (“Smooth + proximable” composite problem). Let f ∈ Γ0 (Rd ) be L-smooth, and g ∈ Γ0 (Rd ) be such that proxg can be computed exactly. We call the optimization problem: min f (x) + g(x) , (1.54) a “smooth + proximable” composite problem. In general, non-smooth convex optimization is harder than smooth convex optimization, in the sense that the worst case convergence rate for first order (subgradient) methods is O(k −1/2 ) (Goffin, 1977), in opposition to O(k −1 ) and O(k −2 ) for (eventually accelerated) first order methods in the smooth case (Nesterov, 1983). But for problems presenting a smooth + proximable structure, which are legion in Machine Learning, one needs not worry: they can be solved with proximal gradient methods, with same optimal rates as gradient methods – up to linear when f is strongly convex (Beck and Teboulle, 2009). In this thesis, we will have a particular interest in solving instances of Problem (1.54), using two algorithms: proximal gradient descent and proximal block coordinate descent (Combettes and Pesquet, 2011). We recall them in Algorithms 1.1 and 1.2. Although both algorithms have worst case convergence rates of O(1/k), Figure 1.4 illustrates that practical results can be very different. Amongst the reasons.

(31) 30. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. P(β k ) − P(β̂). 102 100. proximal gradient descent proximal coordinate descent. 10−2 10−4. 0. 200. 400. 600. 800. 1000. iteration k. Figure 1.4 – Convergence speed of proximal gradient descent and proximal coordinate descent on a Lasso problem (subsampled climate dataset (n = 857, p = 10 000), λ = λmax /20, resulting in kβ̂k0 = 220). Although both algorithm have the same convergence rates, proximal coordinate descent outperforms gradient descent by several order of magnitude.. 1 2. Algorithm 1.1 Proximal gradient descent for Problem (1.54) input : L, T init : β (0) for t = 1, . . . , T do β (t) = prox λ g β (t−1) − L1 ∇f (β (t−1) ) L. 3. return β (T ). 1. Algorithm 1.2 Cyclic proximal block coordinate descent for Problem (1.54) input : {LI }I∈I , T init : β (0) for t = 1, . . . , T do β (t) = β (t−1) for I ∈ I do (t) (t) βI = prox λ gI βI − L1I ∇I f (β (t) ). 2 3 4. LI. 5. return. β (T ). explaining this disparity, the algorithms may take different times to identify the support (and they enjoy a linear convergence rate after support identification), or different constant values in the O. This motivates our preference for coordinate descent in the whole manuscript. Equipped with this mathematical background, we now move to our application focus: the bio-magnetic inverse problem.. 1.2 1.2.1. The bio-magnetic inverse problem Basis of M/EEG. Brain imaging modalities can be divided into two categories: indirect and direct approaches. Indirect approaches, such as near infrared spectroscopy or positron emission tomography, detect brain activity by measuring a correlated physical quantity,.

(32) 1.2. THE BIO-MAGNETIC INVERSE PROBLEM. 31. Figure 1.5 – Schematic view of a pyramidal neuron by the author. The name pyramidal comes from the shape of the soma. e.g., metabolic activity. The best-known indirect brain imaging modality is functional magnetic resonance imaging (fMRI), which measures the hemodynamic response, i.e., the delivery of blood to active neuronal tissues. The main feature of fMRI analysis is its excellent spatial resolution, ranging from 4 mm to 0.5 mm for the most recent MRI scanners (Duyn, 2012; Huber et al., 2017). However, because the hemodynamic response is slow and lagging in time, the time resolution of fMRI is only around 1 s, making it impractical for the study of dynamic brain processes. On the contrary, electroencephalography (EEG), magnetoencephalography (MEG), intracranial electroencephalography (iEEG, also known as electrocortigraphy) and stereotaxic electroencephalography (sEEG) are direct approaches, which record electrical potentials or magnetic fields generated by the activity of the brain. Because they directly measure the quantity of interest, their temporal resolution is excellent: around 1 ms. For this reason, they are widely used to localize foci of epilepsy, or to map brain areas to be excluded from surgical removal, e.g., associated to speech or movement. Very accurate techniques, iEEG and sEEG are also highly invasive: iEEG requires craniotomy in order to insert a grid of electrodes inside the brain, and small openings must be drilled in the skull to insert sEEG electrodes in the brain. In this manuscript, we focus on magneto- and electroencephalography, which in contrast stand out by their low invasiveness. What do M/EEG measure? A neuron consists of a cell (the soma), and dendrites; neurons can be connected to other neurons via axons. As visible on Figure 1.5, a pyramidal neuron possesses an apical dendrite. Each neuron maintains a varying electrical potential at its soma’s membrane, due to ionic concentration differences within the cell. This potential can trigger an action potential, traveling in the axon to connected neighbors, with excitatory or inhibitory effect on the receiving cell. When a neuron receives such a pulse, Excitatory Post-Synaptic Potentials (EPSPs) are generated at its apical dendritic tree (Gloor, 1985). The resulting potential difference between the apical dendritic tree on one side, and the membrane of the soma and basal dendrites on the other causes primary electrical currents to travel intracellularly in the dendrite, from the former to the latter. As a consequence, secondary (or volume) currents travel extracellularly in the head tissue, closing the current loop. These currents can be modeled by a current dipole, oriented along the dendrite. The typical moment of such a dipole is very small: 20 fA.m, and neural activity is only made measurable.

(33) 32. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. Figure 1.6 – Left: sagittal MRI view of the brain. Right: pyramidal cells as drawn by Ramon y Cajal (1899), with dipoles modeling currents from apical dendrites to somas (orange). The parallel alignment of the dipoles results in constructive interference, modeled as an Equivalent Current Dipole of larger moment (red).. because of two phenomena. First, as shown on Figure 1.6, the columnar functional organization of the cortex causes large groups of pyramidal neurons to have parallel alignment of apical dendrites, and thus EPSPs associated currents traveling in the same direction. Second, the long duration of the post-synaptic potentials makes them likely to overlap over a synchronized group of neurons. The axonal action potentials, on the other hand, are not detected by M/EEG: the current flows are in opposite directions and are too brief to interfere constructively (Nunez and Srinivasan, 2006). The spatio-temporal superposition allow primary and secondary currents to interact constructively, adding up to 50 nA.m, a threshold high enough to be measured extracranially by M/EEG (Murakami and Okada (2015) estimate that the critical population comprises at least 10 000 neurons, corresponding to a cortical patch of 25 mm2 ). The currents in such a neuron population can be modeled at the macroscopic scale by an Equivalent Current Dipole (ECD), which is the sum of all the current dipoles in the patch of synchronized neurons. EEG measures the difference of potentials between electrodes and a reference: around 60 electrodes are used, positioned at standard locations on the scalp to allow for reproducibility of recordings. The acquired potentials are of the order of 10 µV. MEG is a somehow more refined technique than EEG: the measured magnetic fields are of the order of 10 fT, seven or eight orders of magnitude smaller than the Earth’s magnetic field. Recording such small values is only made possible by the use of magnetic shielding and superconductivityexploiting magnetometers (superconductivity quantum interference device, SQUIDs). Along with magnetometers, gradiometers measuring the spatial gradient of the magnetic field are used to reduce the sensibility to interferences. There usually are around 200 magnetometers and 100 gradiometers, isolated in a liquid helium cooled vacuum flask, which makes MEG sensors further away from the neural sources than EEG sensors..

(34) 1.2. THE BIO-MAGNETIC INVERSE PROBLEM. 33. Figure 1.7 – Patient undergoing a cognitive experiment in a MEG scanner. Courtesy of National Institute of Mental Health.. These technical differences are reflected in historical landmarks: while the first EEG was recorded in 1924 by Hans Berger, the first MEG recording was performed by David Cohen in 1968 (Cohen, 1968) and it is only in the nineties that the first full head MEG devices were used for the first time. To this day, MEG is still more expensive to operate than EEG, because of magnetic shielding and the liquid helium needed for superconductivity in the sensors. The two techniques are complementary: EEG is sensitive to radial and tangential dipoles, whereas MEG is insensitive to radial sources, but has a higher signal-to-noise ratio, and can use more sensors. Instead of using only MEG or EEG, pooling electrodes, magnetometers and gradiometers allows to locate more accurately the origin of brain activity, for example in the case of epilepsy (Aydin et al., 2015). In our experimental setup, the patient undergoes repetitions of the same simple stimulation (sensory, cognitive or motor) for a short period of time. Neuronal activity can then be divided in two categories: spontaneous and event-related activity. Event-related activity is triggered by the stimuli, and is either evoked if the response is phase-locked with respect to the stimuli, or induced otherwise. Despite the sophisticated sensors and shielding, M/EEG suffers from a poor signal-to-noise ratio (SNR); among corrupting factors are eye movements, heartbeats and other muscle activity, movement, sensor drift and ambient electromagnetic noise (Gross et al., 2013). Various signal preprocessing techniques are used to increase the SNR (Parkonnen, 2010; Gross et al., 2013): spectral filtering, signal decomposition via Independent Component Analysis (Makeig et al., 1996; Ablin et al., 2018) or Signal Space Separation. Another mandatory step to increase the SNR is to average several repetitions (called trials) of the experiment with the same patient. As shown in Figure 1.8, as more and more trials are averaged, the signal becomes smoother and the brain response to the stimuli at t = 0.1 s, appears once the SNR is high enough. The averaging procedure preserves phase locked responses, but removes induced response, hence the need for more refined solvers taking into account all the trials and not only their average..

(35) 34. µV. 10 0 −10. µV. 10 0 −10. µV. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. blank. blank. 10 0 −10 −0.05. blank 0.00. 0.05. 0.10. 0.15. 0.20. 0.25. Time (s). Figure 1.8 – Amplitude of 59 EEG signals, averaged across 5 (top), 10 (middle), and 50 (bottom) trials. As the number of averaged repetitions increases, the noise is reduced and the measurements become smoother, revealing the brain response to the stimuli around 0.1 s after it occurred. The stimuli is an auditory stimulation in the left ear. Magnetometers covariance. 0. 0. 5e-26. 25 50. 0e+00. 75. -5e-26. 100 0. 50. 100. Gradiometers covariance 4e-23. 50. 2e-23. 100. 0e+00. 150. -2e-23. 200. EEG covariance. 0. 2e-11 20 0e+00 40. -3e-11. -4e-23 0. 100. 200. 5e-11. 0. 20. 40. -5e-11. Figure 1.9 – Covariance of the three types of sensors (left: magnetometers, middle: gradiometers, right: electrodes). The covariance matrices are clearly not scalar: EEG covariance has a band diagonal structure, and magnetometers covariance has a block structure.. Apart from averaging data, another critical preprocessing step is spatial noise whitening (Engemann et al., 2015). For the raw measurements, the noise is far from being white, for example because there exist brain noise correlation between neighboring sensors, as shown in Figure 1.9. To decorrelate the noise, a spatial whitening step is applied during the preprocessing, based on an estimate of the noise covariance matrix. This covariance can be estimated in multiple ways: empty room measurements if only MEG is used or empirically using pre-stimulus data, considered as raw noise. When analysis is performed over both EEG and MEG, spatial whitening also allows to harmonize the different units (µV, fT and fT.m-1 ). Empirical estimation being imperfect, various regularization techniques such as shrinkage (Ledoit and Wolf, 2004) have been proposed. In their extensive review, Engemann and Gramfort (2015b) showed that there is no single best approach, and devised an automatic way to select the best method on a case-by-case basis. The non-invasiveness of M/EEG comes at a price: the electrical activity is not measured directly at its location in the brain, but outside the scalp, and is thus transformed by the head tissues. Determining the causal factors (brain activity) from a set of observations they produced (the electromagnetic measurements outside the head) is an inverse problem, which can be solved in many ways..

(36) 1.2. THE BIO-MAGNETIC INVERSE PROBLEM. 1.2.2. 35. Solving the inverse problem. Historically, three kind of methods have emerged: parametric, scanning and imaging ones. They share the same goal: to determine which areas of the brain are involved in a cognitive task, and how these areas interact together. Parametric methods Dipole fitting (Scherg and von Cramon, 1985) models the brain activity by a fixed small number of ECD, whose varying locations and amplitudes are estimated via gradient descent or simulated annealing (Uutela et al., 1998). Sequential dipole fitting estimates the dipoles parameters one by one; for more than one dipole the optimization process – non-linear least squares – is generally non-convex and thus sensitive to initialization. It may also be difficult to correctly estimate the number of dipoles a priori (possible approaches are based on ICA, PCA or SVD as in Kobayashi et al. (2002); Koles and Soong (1998); Huang et al. (1998)), and sequential approaches may fail in the presence of correlated or overlapped source activity. Scanning methods Beamforming techniques (Van Veen et al., 1997) and signal classification ones (MUSIC, RAP-MUSIC, Mosher et al. 1999; Mosher and Leahy 1999) use a predefined grid of potential locations. They apply spatial filters to evaluate the contribution of each source. As dipole fitting techniques, beamforming fails when sources are correlated (Robinson and Vrba, 1999) and requires a covariance to be estimated from short signals. MUSIC and its derivative are greedy approaches and as such suffer from a high sensitivity to the data. Imaging methods In distributed source imaging, dipoles are fitted simultaneously at a set of locations defined a priori. First, they require to solve the bio-electromagnetic forward problem: determining the sensor measurements given a distribution of internal currents. By Maxwell equations, the measurements are a linear function of the dipoles activities. In an ideal noiseless setting, if we postulate a discrete grid of ECD locations in the brain (the source model ), then the noiseless measurements Y ∗ ∈ Rn×q and the true parameter matrix B∗ ∈ Rp×q are linked by: Y ∗ = XB∗ .. (1.55). Each row of Y ∗ is the activity of a sensor – a times series of length q – while B∗ contains p time signals, each one corresponding to the activity of one neural source. Given the source model and a realistic geometrical model of the patient’s head and conductivities of the tissues involved (the head model ), solving the forward problem (computing X) is achieved with a numerical solver based on finite element or boundary element methods. Given muscle activity, spontaneous brain and sensor noise, a realistic model is the multitask regression one: Y = Y ∗ + E = XB∗ + E . (1.56) The typical orders of magnitude for Model (1.56) are n ≈ 100 sensors, q ≈ 100 time instants, p ≈ 10 000 neural sources. This makes the problem ill-posed in the sense of Hadamard: it cannot be solved directly without more assumptions. For example, Ordinary Least Squares yield an infinity of solutions; it is still possible to use the one with minimal Frobenius norm, (X > X)† X > Y , but it is highly sensitive to noise. Using Tikhonov regularization leads to a unique and more stable solution, the Minimum Norm Estimate (Hämäläinen and Ilmoniemi, 1994), which is very fast to compute. Alas, it.

(37) 36. CHAPTER 1. MOTIVATION AND CONTRIBUTIONS. Figure 1.10 – Sparsity patterns obtained by MCE/Lasso (left) and `2,1 /MxNE/Multitask Lasso (right). MCE does not yield a consistent set of active sources over time. produces dense neural estimates, with activity smeared over all sources, making it unfit to identify clearly localized brain activity. Other notable dense methods are dSPM (Dale et al., 2000) and sLORETA (Pascual-Marqui et al., 2002). All of these are linear: in various ways, each computes a kernel K ∈ Rp×q , and the estimate is KY . On the contrary, sparse methods are usually non linear. Sparse Bayesian Learning is a notable sparse approach (Wipf et al., 2008; Haufe et al., 2008), with algorithms such as γ-MAP (Wipf and Nagarajan, 2009) and full MAP (Lucka et al., 2012). They rely heavily on covariance estimation and are therefore not really “parameter free” as one could think. Straightforwardly applying the Lasso to the multitask framework (i.e., penalizing the sum of the absolute values of the coefficients of B) yields a sparse estimate, known in neuroscience as Selective Minimum Norm or Minimum Current Estimate (Matsuura and Okabe, 1995; Uutela et al., 1999). It is easy to show that this approach amounts to solving q Lasso problems independently; as visible on Figure 1.10, the support of the estimate varies from one time instant to the next, which is not plausible. To produce a consistent set of active sources over time, it has been proposed to use group penalties (Ou et al., 2009) imposing joint sparsity over time, yielding the Mixed Norm Estimate (MxNE) (Gramfort et al., 2012): 1 B̂ ∈ arg min kY − XBk2 + λkBk2,1 , B∈Rp×q 2. (1.57). a problem known in the optimization community as Multitask Lasso (Obozinski et al., 2010), also an instance of group Lasso (Yuan and Lin, 2006). More refined formulations based on MxNE have been proposed, for example using non-convex penalties (ir-MxNE, Strohmeier et al. 2016) or sparse group Lasso formulation in the time frequency domain to produce non stationary activations (TF-MxNE, Gramfort et al. 2013); we refer the reader to Strohmeier (2016) for a very clear presentation of the topic. In this family of estimators, MxNE remains the building block of stable spatio-temporal source reconstruction. The setting of Problem (1.57) assumes a fixed orientation: the orientation of each dipole is fixed across time (with a direction usually chosen normal to the cortical mantle). The only quantity to estimate for a dipole is then its magnitude. We may also consider free orientation, where the dipoles are allowed to rotate in time: at each time instant, a dipole is represented by its coordinates in a basis of 3 orthogonal.

(38) 37. 1.3. CONTRIBUTIONS. 0.050 s. 0.070 s. 0.090 s. 0.110 s. 0.130 s. fT. 450.0 300.0 150.0 0.0 -150.0 -300.0. 0.050 s. 0.070 s. 0.090 s. 0.110 s. 0.130 s. fT. -450.0 450.0 300.0 150.0 0.0 -150.0 -300.0 -450.0. Figure 1.11 – Real (top) and simulated (bottom) magnetometers topographic maps. We simulate the activity of two dipoles in the left and right auditory cortex; the real topographic maps exhibits dipolar patterns similar to the simulated one, justifying the dipolar assumption. vectors. In this setting, X ∈ Rn×3p , and B ∈ R3p×q . Using a mixed `2,1 penalty would bias the estimates towards the axes of the orthogonal basis used, which is arbitrary. A formulation leading to an orientation-unbiased solution is: p. X 1 B̂ ∈ arg min kY − XBk2 + λ kBGg : k , B∈R3p×q 2. (1.58). g=1. where Gg = {3g, 3g + 1, 3g + 2}, and thus BGg : ∈ R3×q contains the 3g th , 3g + 1th and 3g + 2th lines of B. This penalty encourages BGg : to be zero, but is isotropic as a change of orthonormal basis does not affect kBGg : k.. 1.3. Contributions. The organization of this manuscript is as follows. Each chapter can be read independently, and thus features a small introduction which can be redundant with this introductory chapter: the reader may feel free to skip them. Part I is devoted to the design of faster solvers for `1 -type regularized ERM: I In Chapter 2, we design an efficient solver for the seminal Lasso estimator. We first describe the two main techniques used to speed-up proximal gradient and coordinate descent solvers for sparse GLMs: screening rules and working set policies. Both ignore non-significant variables from the optimization problem, making it smaller, hence faster to solve. In a backward fashion, screening rules prune the set of features, progressively reducing the number of variables. On the contrary, working sets are forward techniques which solve a sequence of growing subproblems, including more and more variables. We show that screening rules can be used aggressively to design working set policies, and introduce aggressive screening, a working set policy based on the state-of-the-art Gap Safe screening rules. Screening and working sets both rely on duality. We exhibit the Vector AutoRegressive (VAR) behavior of the Lasso dual iterates, when the primal problem is solved with proximal coordinate descent or proximal gradient descent. We exploit.