Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data

(1)

Thesis

Reference

Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data

DESCLOUX, Pascaline

Abstract

Cette thèse porte sur le problème de sélection de variables dans le modèle de régression linéaire en haute dimension. Elle complète la littérature existante sur l'estimateur Thresholded Basis Pursuit (TBP) et, plus généralement, sur l'idée sous-jacente consistant à surajuster le modèle, puis à seuiller les coefficients obtenus. Dans un premier temps, de nouvelles garanties théoriques pour la reconstruction du vecteur de signes par TBP sont démontrées.

Dans un deuxième temps, une extension de TBP, appelée Lasso-Zero, est introduite. La nouveauté réside dans l'utilisation de plusieurs dictionnaires de bruit, concaténés à la matrice de régression afin de prendre en compte la présence de bruit lors de l'étape de surajustement. Enfin, une extension robuste de Lasso-Zero est proposée pour la sélection de variables en présence de données manquantes.

DESCLOUX, Pascaline. Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data. Thèse de doctorat : Univ. Genève, 2019, no. Sc. 5430

DOI : 10.13097/archive-ouverte/unige:139014 URN : urn:nbn:ch:unige-1390143

Available at:

http://archive-ouverte.unige.ch/unige:139014

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Université de Genève Faculté des Sciences

Section de mathématiques Professeur Sylvain Sardy

Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data.

Thèse

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention statistique.

par

Pascaline Privitera-Descloux de

Sâles (FR)

Thèse n^o5430

Genève 2020

(3)

(4)

Acknowledgements

The realization of this thesis would have been much harder without the help and support of many people.

First of all, I would like to express my deepest gratitude to my supervisor, Sylvain Sardy, for his precious advice and guidance, and yet for making me feel like a collaborator more than a student.

I also wish to thank Claire Boyer, Sebastian Engelke, Nick Hengartner and Julie Josse, for kindly agreeing to be part of my thesis committee and for their interesting comments and questions.

Thank you to Caroline Giacobino and Parisa Mamooler, for sharing their experience and for the nice time spent together during lunch breaks.

Many thanks to Jairo Diaz, for regularly sharing ideas and for his very fast answers when I asked for pieces of code.

I would like to thank the maintainers of the Baobab HPC cluster of the University of Geneva, for making this precious tool available and for their fast assistance when needed.

I acknowledge the CUSO (Conférence Universitaire de Suisse Occidentale), both for the Doctoral Program in Statistics, which allowed me to learn a lot and meet many peers from other universities, and the transversal program, in which I gained skills beyond mathematics and statistics.

Many thanks to the secretaries and librarians from the Section de Mathématiques for their kind help.

Thank you to all of my friends at the Section de Mathématiques for our happy breaks at the Z-bar.

More broadly, many thanks to all of my friends for the good times spent together and for making me completely disconnect from my research.

I also would like to thank my parents and my brothers, for their support and for being the greatest family.

Last but not least, thank you Giuseppe. For everything.

iii

(5)

(6)

Résumé

La présente thèse porte sur le problème de sélection de variables dans le modèle de régression linéaire en haute dimension, c’est-à-dire, d’identifier l’ensemble des variables importantes dans le cas où le nombre de prédicteurs potentiels est supérieur à la taille de l’échantillon. Mathé- matiquement, cela revient à identifier l’emplacement des coeﬃcients non nuls du vecteur de régression, supposé parcimonieux. Notre contribution est étroitement liée à l’estimateur Basis Pursuit seuillé (ou TBP, pourThresholded Basis Pursuit), qui cherche une solution au système y=X avec norme `₁ minimale, puis met à zéro les petits coeﬃcients de cette solution.

L’objectif principal de cette thèse est de compléter la littérature existante sur TBP et, plus généralement, l’idée sous-jacente consistant à surajuster le modèle puis seuiller les coefficients obtenus. En effet, dans des travaux précédents, TBP a été étudié soit en supposant que les prédicteurs sont i.i.d. et gaussiens, soit dans un cadre asymptotique dans lequel les dimensions du jeu de données sont fixées et les valeurs absolues des coefficients non nuls du vecteur de régression augmentent. Hors, même si l’indépendance des prédicteurs est souvent supposée (en particulier dans la littérature sur l’acquisition comprimée, ou compressed sensing), le statisticien préfère des résultats apportant des garanties dans les cas où les prédicteurs sont dépendants, ce qui reflète mieux les jeux de données qu’il rencontre en pratique. De plus, on cherche généralement à démontrer que l’estimateur proposé est consistant. Dans le contexte de la sélection de variables, cela signifie que l’estimateur retrouve l’ensemble de variables correct avec probabilité convergeant vers 1 lorsque la taille de l’échantillon tend vers l’infini (et non avec des dimensions du jeu de données fixées).

Concrètement, nous démontrons que le vecteur de signes de la solution de TBP est un estimateur consistant des signes du véritable vecteur de régression lorsque les lignes de la matrice de régression suivent une loi normale multivariée, dans un cadre asymptotique permettant à la taille de l’échantillon ainsi qu’au nombre de prédicteurs d’augmenter.

Puis, nous introduisons une extension de TBP que nous appelons Lasso-Zéro. La nou- veauté réside dans l’utilisation de plusieursdictionnaires de bruit, concaténés à la matrice de régression afin de prendre en compte la présence de bruit lors de l’étape de surajustement.

Nous choisissons le seuil par la méthode dequantile universal thresholding (QUT), et propo- 1

(9)

sons d’exploiter la spécificité de Lasso-Zéro – plus précisément, les coeﬃcients associés aux dictionnaires de bruit – afin de contourner la nécessité d’estimer le niveau de bruit. Des études de simulation sont réalisées afin de démontrer l’eﬃcacité de notre estimateur.

Nous abordons également le problème de la sélection de variables lorsque des données sont manquantes. Lorsque le nombre de prédicteurs est grand, les méthodes habituelles résultent en une importante perte d’information ou sont diﬃciles à appliquer car elles requièrent de spécifier un modèle pour les prédicteurs et/ou le mécanisme de données manquantes. Nous évitons cette diﬃculté en utilisant une seule matrice imputée naïvement et en montrant un lien entre le problème des données manquantes et le modèle de corruption parcimonieuse, qui suppose que le signal est corrompu par deux sources de bruit, l’une dense, et l’autre parcimonieuse avec des entrées potentiellement grandes. Nous proposons une extension de Lasso-Zéro, appeléeLasso- Zéro robuste, pour la sélection de variables en cas de corruption parcimonieuse, et l’appliquons au problème de données manquantes dans une étude de simulation. Nous introduisons aussi Thresholded Justice Pursuit (TJP), l’estimateur analogue à TBP pour le modèle de corruption parcimonieuse, et étendons les résultats concernant TBP à TJP.

(10)

Summary

This thesis concerns the problem of variable selection in the high-dimensional linear regression model, that is, the problem of recovering the set of important variables in cases where the number of potential predictors exceeds the sample size. Mathematically, this corresponds to identifying the locations of the nonzero entries of the regression vector, which is assumed to be sparse. Our contribution is closely related to the Thresholded Basis Pursuit (TBP) estimator, which first looks for a solution to y = X with minimal `₁-norm, and then sets the small coeﬃcients to zero.

The main objective of this thesis is to complement the existing literature on TBP and, more generally, the underlying idea consisting of overfitting the model, then threshold the obtained coeﬃcients. Indeed, in previous works, TBP has been studied either assuming that the predictors are i.i.d. and Gaussian, or in an asymptotic framework in which the dimensions of the dataset are fixed and the absolute nonzero coeﬃcients of the regression vector grow.

However, even though the i.i.d. case is often assumed (especially in the compressed sensing literature), statisticians prefer results providing guarantees in cases where the predictors are dependent, better reflecting the datasets they encounter in practice. Also, one generally seeks to demonstrate the consistency of the proposed estimator. In the context of variable selection, this means that the estimator recovers the correct set of variables with probability tending to 1 as the sample size tends to infinity (and not with fixed dimensions of the dataset).

Concretely, we demonstrate the sign consistency of TBP when the rows of the design matrix follow a multivariate Gaussian distribution, in an asymptotic framework allowing the sample size as well as the number of predictors to grow.

Then, we introduce an extension of TBP calledLasso-Zero. Its novelty resides in the use of several randomnoise dictionaries that are appended to the design matrix in order to take the presence of noise into account in the overfitting step. We select its tuning parameter by quantile universal thresholding, and propose to exploit the specificity of Lasso-Zero – more precisely, the coeﬃcients associated to the noise dictionaries – to bypass the necessity of estimating the noise level. Numerical experiments are performed to demonstrate the eﬃciency of our estimator.

3

(11)

We also address the problem of model selection in case of missing data. When the number of predictors is large, common methods either result in an important loss of information, or are diﬃcult to apply because they require to specify a model for the variables and/or the missing data mechanism. We circumvent this issue by using a single naively imputed matrix and by showing a connection between the problem of missing data and the sparse corruption model, which assumes that the signal is corrupted by two noise sources, one dense, the other sparse with potential large entries. We propose an extension of Lasso-Zero, calledRobust Lasso-Zero, for variable selection in case of sparse corruption, and apply it to the missing data problem in numerical experiments. We also introduce Thresholded Justice Pursuit (TJP), the analogue of TBP for the sparse corruption model, and extend theoretical results about TBP to TJP.

(12)

Chapter 1 Introduction

1.1 Motivation and contribution

Regression analysis – one of statisticians’ most frequent tasks – estimates the relationships between variables and predicts the value of an outcome y 2 R given a set of predictors.

It is often assumed that the relation between variables is linear, i.e. y = x^T ⁰ +✏, where x= (x1, . . . , xp)^T 2R^p is the vector of predictors, ✏is a zero-mean noise term and ⁰2R^p is the vector of regression coeﬃcients, which is generally estimated by the least-squares method.

But even though the linear model probably is the simplest one can assume, there are cases in which traditional least-squares estimation cannot be performed. Indeed, when the number of parameters exceeds the number of observations, the least-squares problem has infinitely many solutions and yields overfitted models that fail to predict future observations reliably.

Nowadays, thesehigh-dimensionaldatasets often arise in practice. Genetics provides a typical modern example, with datasets containing expression levels of thousands of genes for only a few observational units. In such cases, additional assumptions about the structure of the problem need to be made. Aiming at making models simpler and easier to interpret, a common assumption is that only a few of the predictors (orcovariates) at hand are actually relevant for predicting the outcome (or response). For example, it is generally assumed that only a few genes are related to some medical trait, such as the probability of developing a disease.

In this context, statisticians and practitioners are often interested in identifying the set of relevant covariates. This is the task ofmodel selection, or variable selection. Mathematically, assuming that only a few variables are relevant in our problem corresponds to the assumption of sparsity, meaning that most entries of ⁰ are zero, and the task of model selection corresponds to correctly identifying thesupport of ⁰,the set of indices of the nonzero entries.

With a large number of covariates, traditional search strategies like best subset or step- wise selection methods (see e.g. Hastie et al. [2009]) are computationally expensive or even intractable. The use of`1-regularization provides a very attractive alternative. More precisely,

5

(13)

0.0 0.2 0.4 0.6 0.8 1.0

−0.6−0.4−0.20.00.20.40.6

λ β ^

j lasso

τ

− τ

Figure 1.1 – Path of all Lasso solutions for a signal generated according to setting (b) in Section 4.4.1. The support S⁰ has size 10 and is represented by the red curves. None of the Lasso solutions recover S⁰, whereas the BP solution – corresponding to the limit Lasso solution as tends to0 – recovers S⁰ exactly if thresholded properly.

the Lasso [Tibshirani,1996] looks for a trade-offbetween the fit of the model and the`₁-norm – that is, the sum of the absolute coefficients – of ⁰.This has the effect of shrinking the estimated coefficients towards zero, and has the advantage of setting some to zero. The amount of shrinkage is governed by a positive regularization parameter ,which consequently influences the size of the selected model. The simultaneous variable selection and estimation provided by the Lasso as well as the fast algorithms that have been developed to solve it [Friedman et al.,2010] contributed to the popularity of the Lasso and`₁-regularization in general.

Theoretical properties of the Lasso have been extensively studied. In terms of model selection, it is known to include all important variables with high probability under fairly weak conditions (see Bühlmann and van de Geer[2011] and references therein), but it tends to include (too many) false positives [Su et al., 2017], and exact support recovery can be achieved only under a strong condition, called the irrepresentable condition [Zhao and Yu, 2006]. This diﬃculty to recover the correct model is generally attributed to an excessive amount of shrinkage. Some propositions have been made to solve this issue, such as the two-stage adaptive Lasso [Zou,2006], or the nonconvex SCAD penalty [Fan and Peng,2004].

Figure1.1illustrates a typical example¹where the true support cannot be recovered by the

1It is a realization of the simulation setting (b) described in Section4.4.1, with 10 truly nonzero coeﬃcients.

(14)

1.1. MOTIVATION AND CONTRIBUTION 7 Lasso. The curves represent all components ˆ^lasso

,j of the Lasso solution as varies. The red curves correspond to indicesj belonging to the support of ⁰,that we denote byS⁰.Looking at the solution path, it is clear that the Lasso cannot recover the true support in this case, since there is no value of for which the set of nonzero coeﬃcients equalsS⁰.

On the other hand, Figure1.1shows a clear separation between the red and black curves at the limit solution of the path as tends to zero, that is, where shrinkage is the weakest.

Interestingly, in the high-dimensional case, the limit lim _!₀+ ˆ^lasso of the Lasso path is a solution to the so-called Basis Pursuit (BP) problem [Fuchs,2004]. BP has been well studied in the compressed sensing literature. It is known to exactly recover ⁰ in the noiseless case, provided it is sparse enough (seeFoucart and Rauhut[2013] and references therein). Setting the small coeﬃcients of BP to zero corresponds to Thresholded Basis Pursuit (TBP), which has been introduced bySaligrama and Zhao [2011]. Thus, Figure 1.1 illustrates an example where the Lasso does not achieve support recovery, but TBP does for an appropriate choice of threshold, e.g. the one indicated horizontally.

There is only little work available on TBP.Saligrama and Zhao[2011] studied sign recovery, which is slightly stronger than support recovery, by a multistage procedure combining TBP and ordinary least-squares estimation. Their results hold for i.i.d. Gaussian covariates, a common assumption in the compressed sensing literature, in which the measurement matrix can be chosen. However, in statistical applications, covariates are typically correlated.

More recently, Tardivel and Bogdan [2018] showed that the condition of identifiability is necessary and suﬃcient for sign recovery by TBP, however only in an asymptotic setting where the matrix of covariates is fixed, preventing the dimension of the dataset to grow.

This thesis aims at complementing the literature on TBP and the “overfit, then threshold” paradigm underlying it. Firstly, we bring novel theoretical results about TBP, showing that it achieves sign recovery under the so-called stable nullspace property, and proving sign consistency for correlated Gaussian matrices. Compared to previous works on TBP, our assumptions are closer to the ones that are usually made in the statistical literature, permitting covariates to be correlated and the numbers of observations and covariates to grow.

Secondly, we introduce an extension of TBP, called Lasso-Zero. Apart from focusing on the limit Lasso solution at = 0 and the use of thresholding – which motivated its name –, the novelty of Lasso-Zero resides in the use of several noise dictionaries in the overfitting step, followed by the aggregation of the corresponding estimates. It aims at improving TBP in cases where the signal-to-noise ratio is low, since TBP ignores the presence of noise in the first step.

Thirdly, we also propose an extension of Lasso-Zero that handles incomplete data, a re- curring problem in practice. Causes of missing data are various: non-response to a survey question, manual error, failure of a measuring instrument, etc. Eﬃciently handling missing data is particularly diﬃcult in high-dimension, since common approaches either result in an important loss of information, or require to specify a model for the (very) large set of covari-

(15)

ates. To bypass this issue, we rely on a single (maybe naively) imputed matrix, and propose a robust extension of Lasso-Zero that takes into account the error caused by imputation. This robust extension relies on the simple observation that when ⁰ is sparse, then so is the error induced by imputation. This naturally establishes a relation between the problem of missing covariates in high-dimensional datasets and the sparse corruption problem. Given this relation, we will define Thresholded Justice Pursuit (TJP)– the analogue of TBP for the sparse corruption problem – and extend all our theoretical results to this estimator. Just as Lasso- Zero is obtained by solving TBP on augmented matrices, Robust Lasso-Zero is obtained by doing the same with TJP.

1.2 Organization of the thesis

Existing estimators and results that are relevant for the remaining of the thesis are collected in Chapter 2, as well as in the appendices. Chapters 3, 4 and 5 are dedicated to our own contributions.

We start by providing some background on the high-dimensional linear model and support recovery in Chapter 2. After introducing some generalities in Section2.1, BP, the Lasso and TBP are presented in Sections 2.2, 2.3 and 2.4 respectively, together with some theoretical results. In Sections2.5and2.6, we discuss the problems of missing data and sparse corruptions as well as some existing estimators.

Chapter3 is dedicated to sign recovery properties of TBP and TJP. Section 3.1formally defines TJP and proves that the notion of identifiability is necessary and suﬃcient for sign recovery, thus extending a result ofTardivel and Bogdan[2018] to TJP. In Section 3.2, TBP (TJP) is analyzed assuming the (extended) stable nullspace property, under which it is proved that sign recovery holds provided the absolute nonzero coeﬃcients are larger than a specified lower bound. These results are used in Section 3.3 to prove sign consistency for correlated Gaussian matrices.

In Chapter 4, we introduce our novel estimator, Lasso-Zero. The idea and formal definition are presented in Section 4.1. After a first small numerical experiment in Section 4.2, the crucial question of the threshold selection is discussed in Section 4.3. We opt for the quantile universal threshold [Giacobino et al.,2017], derived for Lasso-Zero in Section 4.3.1.

Some guarantees about the family-wise error rate in the low-dimensional case are given in Section4.3.2. Estimation of the quantile universal threshold using a GEV approximation is discussed in Section4.3.3. Finally, numerical experiments are presented in Section4.4.

Chapter 5 proposes a robust extension of Lasso-Zero for the sparse corruption model and focuses on missing data as a particular application. We start by formally defining Robust Lasso-Zero in Section5.1. Section5.2then establishes a relation between the sparse corruption problem and the issue of missing covariates, and proposes to use Robust Lasso-Zero in case of incomplete data. Section5.3is dedicated to numerical experiments.

(16)

1.3. NOTATION 9

1.3 Notation

[p] - the set {1, . . . , p}(where p2N^⇤)

|S| - the cardinality of the set S

S - the complement of the setS ⇢[p]

S1tS2 - the union of two disjoint sets S1 and S2

1_S - the indicator function of the set S, i.e. 1_S(x) = 1 if x 2 S, and 1S(x) = 0otherwise

S - the restriction of the vector 2 R^p to the set of indices S ⇢ [p], or the vector ofR^p obtained by setting to zero the coeﬃcients of indexed by S (will be clear depending on the context)

supp( ) - the support of the vector 2R^p,i.e. the set {j2[p] | j 6= 0} sign( ) - the vector of signs of 2R^p,i.e.sign( )_j =1_R_>0( _j) 1_R_<0( _j)

= ˜s - if and only ifsign( ) = sign( ˜)

k kq - ⇣Pp

j=1| j|^q⌘1/q

,the`_q-norm of 2R^p (q2(0,1)) k k1 - max_1jp| j|,the sup-norm of 2R^p

k k0 - the number of nonzero coeﬃcients of (the`₀-(pseudo-)norm)

| |min - min_j₂_{supp( )}| j|,the minimal nonzero absolute entry of

1_n - the vector of ones inRⁿ

I_n - the identity matrix of size n⇥n

X_S - the submatrix of X 2 Rⁿ^⇥^p formed by its columns indexed by S ⇢[p]

⌃T S - the submatrix of ⌃2R^p^⇥^p formed by its rows indexed byT ⇢[p]

and columns indexed by S ⇢[p]

A >0 - means that the symmetric matrixA is positive definite

min(A), max(A) - the smallest, respectively largest eigenvalue of the symmetric ma- trixA

min(X), max(X) - the smallest, respectively largest singular value of the matrix X diag(v) - the diagonal matrix inR^p^⇥^p with diagonalv2R^p

diag(A) - forA2R^p⇥p,the diagonal matrix whose diagonal is the one ofA

⌘_⌧ - the hard thresholding function, defined by⌘_⌧( )_j := _j1_(⌧,₁₎(| j|) for every j2[p]where 2R^p

logx - the natural logarithm ofx (log = log_e)

f(n) =⌦(g(n)) - there is a constant K > 0 such that |f(n)| K|g(n)| for large enoughn

f(n) =O(g(n)) - there is a constantk >0such that|f(n)|k|g(n)|for large enough n.

(17)

(18)

Chapter 2 Background on sparse vector recovery

This chapter introduces the problem of sparse vector recovery – with an emphasis on sparse support recovery – in the high-dimensional linear model, and provides an overview of some existing estimators. There is a large body of literature focusing on this problem, and it is impossible to provide an exhaustive review of existing methods and the corresponding theoretical results. We focus here only on estimators that are most closely related to our contribution, namely Basis Pursuit, the Lasso, and Thresholded Basis Pursuit – all of them making use of `₁-penalization. We restrict our attention to theoretical results that will be used in the next chapters or that are closely related to our own results.

We also present the problems of missing covariates and sparse corruptions in Sections2.5 and2.6 respectively, and some corresponding estimators. These problems will arise in Chap- ter5.

2.1 The sparse high-dimensional linear model

The linear model assumes that aresponse vectory2Rⁿdepends on adesign matrix X2Rⁿ^⇥^p through the equation

y =X ⁰+✏, (2.1)

where ⁰ 2R^p is an unknown vector of coeﬃcients and✏2Rⁿis a noise term with expectation zero. We will make the classical assumption that✏⇠N(0, ²I_n).The matrixX can be fixed or random. Giveny andX, the statistician’s task is to make inference about ⁰.

When the number n of observations exceeds the number p of parameters to estimate, the linear model is said to be high-dimensional. High-dimensionality causes diﬃculties. In particular, traditional estimators like the least-squares estimator are non longer well-defined.

Indeed, it is well known that

ˆ2 arg min

2R^p ky X k²2 () X^TXˆ =X^Ty, 11

(19)

so the least-squares problem has infinitely many solutions whenever the kernel of X is non- trivial, which is necessarily the case whenp > n.

The identifiability issue is an even more striking illustration of the problem we are facing whenp > n: even if✏= 0 in (2.1), the systemy =X has infinitely many solutions, and it is therefore impossible to recover ⁰ even in the noiseless case. The only way to circumvent this is to impose restrictions on ⁰ by making further assumptions on its structure. The most common assumption made in the statistical literature is thesparsity assumption, namely that most entries of ⁰ are zero. Under the sparsity assumption, estimation of ⁰ can be made with three diﬀerent goals in mind:

• prediction: minimizing the prediction errorE((˜y x˜^Tˆ)²),where(˜y,x)˜ 2R⇥R^p repre- sents a future observation,

• estimation: minimizing the estimation error E(kˆ ⁰k),

• support recovery: recover the support of ⁰,that is, the locations of its nonzero coeﬃ- cients, denoted

S⁰ := supp( ⁰) ={j2[p] | j⁰6= 0}. The cardinality of S⁰ is denoted

s⁰ :=|S⁰|

and is called the sparsity index of ⁰. Sign recovery, i.e. sign( ˆ) = sign( ⁰), clearly implies support recovery.

This thesis focuses on the problem of support recovery. In a statistical regression setting, where each column ofXcorresponds to apredictor orcovariate, support recovery corresponds tomodel selectionorvariable selection. But linear regression is not the only application where the sparse linear model is useful. For example, the segmentation problem can be formulated as a sparse (low-dimensional) linear model. In this context,y = f⁰+✏, wheref⁰ 2 Rⁿ is a piecewise constant vector, that is,f_j+1⁰ =f_j⁰ for most j 2[n 1].It is desired to recover the jump locations, that is, allj2[n 1]for whichf_j+1⁰ 6=f_j⁰.The piecewise constant signal can be written

f⁰ =f₁⁰1_n+X ⁰,

where X 2 Rⁿ^⇥⁽ⁿ ¹⁾ is given by X_ij := 1_{_i>j_} and _j⁰ := f_j+1⁰ f_j⁰. Then the set of jump locations is preciselysupp( ⁰).

The performance of a support estimatorSˆ⇢[p]can be measured by various criteria:

• probability of exact support recovery:

P( ˆS =S⁰); (2.2)

(20)

2.1. THE SPARSE HIGH-DIMENSIONAL LINEAR MODEL 13

• true positive rate:

TPR :=E(TPP), (2.3)

where

TPP := |S⁰\Sˆ|

|S⁰| (2.4)

is the true positive proportion;

• family-wise error rate:

FWER =P(|Sˆ\S⁰|>0); (2.5)

• false discovery rate:

FDR :=E(FDP), (2.6)

where

FDP := |S⁰\Sˆ|

max{|Sˆ|,1} (2.7)

is the false discovery proportion.

TheTPRis a measure of the power of the model selection procedure, since it corresponds to the average proportion of truly non-zero coeﬃcients that are detected. TheFWERand the FDR,on the other hand, are measures of the propensity to make false discoveries. TheFWER is the probability of making at least one false discovery, whereas theFDR corresponds to the average proportion of false discoveries among all discoveries. Both quantities are related by the inequality

FDRFWER. (2.8)

Indeed,FWER =E(U),whereU :=1_{|_S\S_ˆ 0|>0}.By definition (2.7),FDP1,and FDP = 0 when U = 0. So FDP  U, hence FDR = E(FDP)  E(U) = FWER. Moreover, equality holds in (2.8) whenS⁰ =;,since in this case FDP =1_{|_S|>0}_ˆ .

Inequality (2.8) implies that any procedure controlling theFWERat level↵ also controls theFDR at level ↵. Controlling the FWER generally leads to very conservative procedures, which is one of the reasons whyBenjamini and Hochberg[1995] introduced the less conservative FDR.Some recent work on linear regression has aimed at controlling the FDR [Bogdan et al., 2015;Barber and Candès,2015;Candès et al.,2018].

Under the sparsity assumption, it would be natural to look for an estimator ˆwith small

`0(-pseudo)-norm, defined by

k k0:=|supp( )|. In the noiseless case (✏= 0), the problem

min2R^p k k0

s.t. y=X

(2.9)

(21)

looks for the sparsest satisfying y = X , whereas in the noisy case (✏ 6= 0), the so-called best subset problem,

min2R^p

1

2ky X k²2+ k k0, >0 (2.10)

looks for the best trade-oﬀbetween the fitky X k²2and the model complexity. However, both problems are combinatorial by nature and are known to be NP-hard [Natarajan,1995]. One of the greatest advances of statistics and signal processing has been to relax and convexify the

`₀-penalty by replacing it by the `₁-norm. Convexity of the obtained optimization problems allowed to rely on the well developed theory of convex optimization to derive theoretical results.

Moreover, the development of fast algorithms to solve these kind of problems undoubtedly contributed to the popularity of`₁ estimators.

The next two sections present the two best-known`1 problems, namely Basis Pursuit and the Lasso, corresponding to the`₁ counterparts of problems (2.9) and (2.10) respectively. Let us first state here an assumption and a slight abuse of language that we will make throughout the thesis.

Remarks 2.1

a) In the remaining of the thesis we will assume that rankX = n. This way, any system y=X admits at least one solution.

b) For ease of language, we will saythe Basis Pursuit, or Lasso, solution, even though uniqueness is not always guaranteed. When uniqueness is crucial, it will be made explicit. Note that whenX is in general position (see Definition 2.2below), the Basis Pursuit and Lasso solution are unique.

Definition 2.2 A matrixX2R^n⇥p is ingeneral position if for every(u₁, . . . , u_p)2{ 1,1}^p and for everyk <min{n, p},all aﬃne subspaces of Rⁿ of dimension k contain at most k+ 1 pointsu_jX_j.

Examples 2.3

a) When p  n, if the columns of X are linearly independent, then X is clearly in general position.

b) Any random matrix X with absolutely continuous density is almost surely in general position [Tibshirani,2013].

2.2 Basis Pursuit

This section focuses on the noiseless case, i.e. when y = X ⁰.Chen et al. [2001] introduced the convex relaxation of (2.9),

min2R^p k k1

s.t. y=X ,

(2.11)

(22)

2.2. BASIS PURSUIT 15 called the Basis Pursuit (BP) problem. BP can be recast as a linear program, as it is equivalent to solving

+min, 2R^p 1^T_p ⁺+1^T_p

s.t.

8>

>>

><

>>

: y =h

X X

i2 4 ⁺

3 5,

+j 0 8j,

j 0 8j.

BP used as a proxy for (2.9) has been well studied in mathematical signal processing and it is known that under some conditions, the solutions of (2.11) are solutions to (2.9) as well [Donoho and Huo, 2001; Elad and Bruckstein, 2002; Donoho and Elad, 2003; Gribonval and Nielsen,2003;Fuchs,2004;Donoho,2006;Candès et al.,2006a].

Let us state the most important theoretical results about BP. The following result about uniqueness is due toDossal [2012].

Theorem 2.4 If X is in general position (see Definition 2.2), BP has a unique solution for everyy 2Rⁿ.

We will denote the BP solution by ˆ^BP.

All results and definitions below can be found in Foucart and Rauhut [2013]. First, the following theorem shows that the BP solution is sparse whenp > n.

Theorem 2.5 If the BP solution is unique, then kˆ^BPk0 nand the columns of X_{supp( ˆ}BP)

are linearly independent.

The following condition is the weakest guaranteeing exact recovery by BP of all vectors supported on a setS⁰⇢[p].

Definition 2.6 The matrixX is said to satisfy thenullspace property (NSP) relative toS⁰ if 2kerX\ {0} =) k S⁰k1 <k _S0k1. (2.12) Theorem 2.7 For a matrix X 2 Rⁿ^⇥^p, every vector ⁰ 2 R^p with supp( ⁰) = S⁰ is the unique solution to BP (2.11) wheny =X ⁰ if and only ifX satisfies the NSP relative toS⁰. It is often assumed that the estimated vector is only approximately sparse. The following condition will ensure stability of BP with respect to sparsity defect.

Definition 2.8 The matrix X is said to satisfy the stable nullspace property (stable NSP) relative toS⁰ with constant⇢2(0,1)if

2kerX =) k S⁰k1⇢k _S0k1. (2.13)

(23)

Theorem 2.9 If X 2 Rⁿ^⇥^p satisfies the stable NSP relative to S⁰ with constant ⇢ 2 (0,1), then for every vector ⁰ the solution ˆ^BP to BP (2.11) when y=X ⁰ satisfies

kˆ^BP ⁰k1  2(1 +⇢)

1 ⇢ k _S⁰0k1. (Note that ⁰ is not necessarily supported on S⁰ here.)

Consequently, if ⁰ is close from being sparse, in the sense that it can be written as the sum of a vector supported on S⁰ and a small perturbation, the stable NSP guarantees that the solution to BP is close to ⁰.

Many works in compressed sensing assume a stronger condition than the NSP, called the restricted isometry property (RIP).

Definition 2.10 For a matrix X 2 Rⁿ^⇥^p and k 2 [p], its restricted isometry constant of order k, denoted k = _k(X), is the smallest > 0 such that for every vector 2 R^p with k k0 k,we have

(1 )k k²2 kX k²2 (1 + )k k²2. (2.14) In particular, for every setS⇢[p]of cardinality |S|k,one has

1 _k min(X_S^TX_S) max(X_S^TX_S)1 + _k.

We say thatX satisfies a restricted isometry property (RIP) of order k if k is small enough (with an upper bound depending on specific theoretical results).

The proof of Theorem 6.9 inFoucart and Rauhut[2013] shows the following.

Theorem 2.11 If X satisfies

2s< 1

3, (2.15)

then it satisfies the stable NSP with constant⇢= _{1 2}^2s

2s relative to every set S⇢[p]such that

|S|=s.

Since the stable NSP implies the NSP, Theorem 2.7and 2.11 together imply that under the RIP (2.15), BP recovers anys-sparse vector ⁰.

It is hard to construct fixed matrices satisfying the RIP. However, it is known that some random matrices satisfy it with high probability. In particular, the following holds¹.

Theorem 2.12 If the entries ofX 2Rⁿ^⇥^pare i.i.d.N(0,_n¹),then there exists a constantC >

0such thatXsatisfies s< with probability at least1 2e

2n

2C providedn 2C ²slog(ep/s).

1Theorem2.12actually holds more generally for subgaussian matrices.

(24)

2.3. THE LASSO 17

2.3 The Lasso

In the noisy case,Tibshirani [1996] proposed the Lasso estimator ˆ^lasso= arg min

2R^p

1

2ky X k²2+ k k1, >0, (2.16) providing a trade-oﬀ between the fit ky X k²2 and the `₁-norm of . Note that if X is in general position (see Definition2.2), the Lasso solution is unique [Tibshirani,2013].

The Lasso has been extensively studied in the literature (see Bühlmann and van de Geer [2011] and references therein). In particular, the following condition is crucial for variable selection.

Definition 2.13 For✓2(0,1),we say thatX2Rⁿ^⇥^p satisfies the✓-irrepresentable condition relative to ⁰ if

kX^T

S⁰X_S0(X_S^T0X_S0) ¹sign( _S⁰0)k1✓, (2.17) whereS⁰ = supp( ⁰).

The irrepresentable condition is strong and typically excludes cases where variables are too strongly correlated.

It has been shown that the Lasso consistently recovers the support of ⁰ in an asymptotic setting where bothp ands⁰ are allowed to grow with n.These results essentially require the irrepresentable condition, as well as some lower bounds for min(¹_nX_S^T0X_S0)and| ⁰|min,where

| ⁰|min denotes the smallest nonzero absolute component of ⁰.. For example, the following has been proved byZhao and Yu [2006].

Theorem 2.14 Assume that the noise components✏_i, i= 1, . . . , n,are i.i.d. withE(✏^2q_i )<1 for some q > 0 and that there exist constants 0  c1, c2 1, M1, M2, M3, M4 > 0 such that for every value ofn,

• _n¹kX_jk²2 M₁ for every j,

• min(_n¹X_S^T0X_S0) M₂,

• s⁰ =O(n^c¹),

• | ⁰|min M3

n⁽¹ ^c²⁾^/2.

If the✓-irrepresentable condition holds for every n, then there exists a sequence ( n)_n2N⇤ such that

n!1lim P( ˆ^lasso_n =^s ⁰) = 1.

(25)

Theorem 2.14 concerns fixed design matrices. Wainwright [2009] proved a similar result for (correlated) Gaussian design matrices, i.e. matricesX with i.i.d.N(0,⌃)rows. Theorem 3 of that work implies consistent sign recovery when⌃satisfies

k⌃_S₀_S₀(⌃_S0S⁰) ¹k1✓, ✓2(0,1), (2.18) and

min(⌃_S0S⁰) C_min>0, (2.19) provided the sample sizenscales asn=⌦(s⁰log(p s⁰)).

The irrepresentable condition is then suﬃcient for consistent support recovery by the Lasso. It turns out that it is also essentially necessary –essentially meaning here that✓in inequality (2.17) is replaced by 1.This is true even in the noiseless case, as shown by the following result [Bühlmann and van de Geer,2011, Theorem 7.1].

Theorem 2.15 If y=X ⁰ andsupp( ˆ^lasso)⇢supp( ⁰) for some < ^| ⁰^|^min

k(X^T

S0X_S0) ¹k1, then kX^T

S⁰X_S⁰(X_S^T0X_S⁰) ¹sign( _S⁰0)k11.

As for any other penalized likelihood method, the choice of the tuning parameter is crucial when applying the Lasso. Cross-validation and traditional information criteria like the Bayes Information Criterion (BIC) [Schwarz,1978] or Akaike Information Criterion (AIC) [Akaike,1998] are widely used, but they do not yield consistent support recovery in the high- dimensional case. For consistent support recovery when p grows exponentially with n, Fan and Tang[2013] proposed the Generalized Information Criterion (GIC), corresponding to

GIC( ˆ) =ky Xˆk²2+ ²log(logn) logpkˆk0 (2.20) in the Gaussian linear model. Tuning the Lasso with GIC corresponds to choosing the value of minimizingGIC( ˆ^lasso).

2.4 Thresholded Basis Pursuit

The Lasso is equivalent (see Proposition 3.2. inFoucart and Rauhut [2013]) toBasis Pursuit Denoising (BPD) [Candès et al.,2006b;Donoho et al.,2006],

min2R^p k k1

s.t. ky X k2 ✓.

(2.21)

BPD is the most natural adaptation of BP to the noisy case, since it replaces the equality constrainty =X byky X k2 ✓.Therefore the Lasso can be seen as a simple adaptation of BP to the presence of noise.

(26)

2.4. THRESHOLDED BASIS PURSUIT 19 Another way to adapt BP to the noisy case is to first solve BP, despite the presence of noise, and then set the small coeﬃcients to zero. This is Thresholded Basis Pursuit (TBP), which was introduced and studied bySaligrama and Zhao[2011]. We denote it ˆTBP

⌧ :

ˆ^TBP

⌧ =⌘_⌧( ˆ^BP), (2.22)

where⌘_⌧(x) =x1_(⌧,1)(|x|),the hard-thresholding function, is applied componentwise.

The work of Saligrama and Zhao[2011] focuses on matricesX 2Rⁿ^⇥^p with i.i.d. N(0,_n¹) entries, and relies on the fact that in this setting some RIP holds with high probability. One of their important intermediate results is the following.

Theorem 2.16 Assume that y = X( ⁰ +⇠), where X 2 Rⁿ^⇥^p has i.i.d. N(0,¹_n) entries,

02R^piss⁰-sparse, and⇠ 2R^p is a deterministic error. Then there exist constantsc₁, c₂ >0 such that with probability at least 1 e ^c¹ⁿ, the BP solution satisfies

kˆ^BP ( ⁰+⇠)k2 Ck⇠k1

ps⁰, (2.23)

provided n c22s⁰log(p/2s⁰), where the constant C depends only on the restricted isometry constant 2s⁰ of X.

Actually, this intermediate result already allows to obtain a sign consistency result for TBP, however under a strong beta-min condition (see our discussion in Section3.3.3). In order to reduce the required signal-to-noise ratio,Saligrama and Zhao [2011] consider a TBP + OLS multistage procedure, defined by the following algorithm.

Algorithm 2.17 (TBP + OLS)

Forn= 3m, we denote by (y⁽¹⁾, X⁽¹⁾)the dataset containing the first mobservations and by (y⁽²⁾, X⁽²⁾)the last 2m observations. Then for fixed ⌧ >0,

1. apply the TBP algorithm to(y⁽¹⁾, X⁽¹⁾),i.e. compute ˆ :=⌘_⌧( ˆ^BP) where ˆ^BP:= arg min

2R^p k k1 s.t. y⁽¹⁾=X⁽¹⁾ ,

2. letSˆ:={j | ˆ_j 6= 0}.Using the second dataset, compute the least-squares coeﬃcients for model S,ˆ i.e.

ˆSˆ ((X⁽²⁾_ˆ

S )^TX⁽²⁾_ˆ

S ) ¹(X⁽²⁾_ˆ

S )^Ty⁽²⁾, 3. finally, re-threshold the obtained coeﬃcients at level ⌧ :

ˆ^TBP+OLS

⌧ :=⌘⌧( ˆ).

Their main result implies the following.

(27)

Theorem 2.18 Assume that X 2Rⁿ^⇥^p has i.i.d. N(0,_n¹) entries, ⁰ 2R^p is s⁰-sparse with

↵

2p s⁰ ↵p for some ↵ >0 and | ⁰|min = 1,and y =X ⁰+✏where ✏⇠N(0, ²I_n) with

2  _log^C_n for some constant C >0. Then provided n C⁰(2s⁰) log(p/2s⁰), the TBP + OLS algorithm satisfies

nlim!1P( ˆ_⌧^TBP+OLS=^s ⁰) = 1.

The scaling they obtain for n and ² are optimal, in the sense that they are (order-wise) necessary for support recovery by any algorithm in the setting they consider.

More recently, Tardivel and Bogdan [2018] proved that the notion of identifiability is necessary and suﬃcient for TBP to recover the signs of ⁰ in an asymptotic setting whereX andsign( ⁰) are fixed, but | ⁰|min tends to +1.

Definition 2.19 A vector ⁰ 2R^p is said to beidentifiable with respect to X2Rⁿ^⇥^p if it is the unique solution to BP (2.11) wheny =X ⁰.

In other words, ⁰ is identifiable if it is correctly and uniquely identified by BP. Actually, the following result of Daubechies et al. [2010] implies that identifiability only depends on the signs of ⁰.

Lemma 2.20 A vector ⁰ 2 R^p is identifiable with respect to X 2 Rⁿ^⇥^p if and only if for every 2kerX\ {0},

|sign( ⁰)^T |<k _S0k1, whereS⁰ = supp( ⁰).

Tardivel and Bogdan[2018] considered a fixed matrix X 2 Rⁿ^⇥^p and a sequence{ ^(r)}_r2N^⇤ of vectors inR^p such that

i) there exists a sign vector✓2{1, 1,0}^p such thatsign( ^(r)) =✓for every r2N^⇤, ii) lim_r_!₊₁| ^(r)|min = +1,

iii) there existsq >0such that for every r 2N^⇤, ^| ^(r)^|^min

k ^(r)k1 q.

Let us denote by ˆ^BP(r) the BP solution when y=X ^(r)+✏,and ˆ_⌧^TBP(r) the corresponding TBP estimate. Their result is the following.

Theorem 2.21 Let X 2 Rⁿ^⇥^p and { ^(r)}r2N^⇤ be a sequence satisfying assumptions i)-iii) above. If the sign vector ✓ is identifiable with respect to X, then for every ✏2Rⁿ there exists R=R(✏)>0 such that for every r R there is a threshold⌧ >0 for which

ˆ^TBP(r)

⌧

=s ✓. (2.24)

Conversely, if for some ✏2Rⁿ andr 2N^⇤ there is a threshold ⌧ >0 for which (2.24) holds, then ✓ (and therefore every ^(r)) is identifiable with respect to X.

(28)

2.5. MISSING COVARIATES 21

2.5 Missing covariates

In practice, data are often only partially observed. Knowing how to deal with incomplete data is important for good prediction, estimation and/or model selection. For simplicity we focus on missing values inX,even though our contribution in Chapter5directly extends to missing data in bothX and y.

The process that governs the probability of data points to be missing is called themissing data mechanism. Below, missingness refers to the distribution of the missing data indicator matrix M 2Rⁿ^⇥^p whereMij = 1 if Xij is missing and Mij = 0 otherwise. One traditionally distinguishes three missing data mechanisms, originally introduced byRubin [1976] (see also Little and Rubin[2002]):

• missing completely at random (MCAR): missingness does not depend on the values of the data, missing or observed;

• missing at random (MAR): missingness depends on the observed values, but not on the components that are missing;

• missing not at random (MNAR): missingness depends on the missing values inX.

Let us briefly review the most common methods to handle missing covariates.

Complete case analysis drops all incomplete rows of X and then proceeds with a usual estimation method. Additionally to the biased estimation one gets under MAR or MNAR, this method has the big disadvantage of loosing a lot of information, especially in the high- dimensional setting. Indeed, when the numberp of covariates is large, sayp= 1000 e.g., one might have to drop an entire row because a single covariate is missing, even if the 999other ones are observed.

TheExpectation-Maximization (EM)algorithm [Dempster et al.,1977] is common to compute a maximum likelihood estimator in presence of missing values. It is an iterative method, alternating between an expectation step computing the expected log-likelihood function under the current parameter estimate, and a maximization step looking for the parameter maximiz- ing this expected log-likelihood. EM requires to specify a model for the covariates, which is not an easy task whenp is large, and typically leads to nonconvex problems.

Imputation is the task of replacing missing values by an estimate, before applying an estimation procedure. Mean imputation is arguably the simplest imputation technique, replacing a missing entry by the mean of all observed values in the corresponding column. Regression imputation techniques allow a better imputation, taking into account the correlation between covariates, but again, this requires to specify a model for the covariates. Multiple imputation procedures (see e.g.Van Buuren [2012]) allow to take into account the variability of estimation procedures which use imputed data. They consist in building several imputed datasets – typically with stochastic regression imputation – and aggregating the diﬀerent obtained estimates.

(29)

In case of high-dimensional data, Loh and Wainwright [2012] and Datta and Zou [2017]

proposed substitutes of the Lasso based on the following observation. With the parameter rescaling /n,the Lasso optimization problem (2.16) can be equivalently written

ˆ^lasso = arg min

2R^p

1 2

T

✓1 nX^TX

◆

T

✓1 nX^Ty

◆

+ k k1.

Assuming the rows of X are i.i.d. with expectation zero and covariance matrix ⌃ and y is generated by the linear model (2.1), then ¹_nX^TX and _n¹X^Ty are consistent estimators of ⌃ and :=⌃ ⁰ respectively. In case of noisy or missing covariates, Loh and Wainwright[2012]

proposed to replace them by other appropriate estimates. In case of MCAR data, where each entry is missing with probability⇢2(0,1],they use

⌃ˆ := 1

nX˜^TX˜ ⇢diag

✓1 nX˜^TX˜

◆

, (2.25)

and

ˆ := 1

nX˜^Ty, (2.26)

where

X˜ij :=

8<

:

Xij

1 ⇢ if Xij is observed,

0 otherwise. .

The obtained matrix⌃ˆ may not be positive semi-definite, and therefore the resulting problem may be nonconvex. This is why here and in Chapter 5, we call this estimator NClasso for

“nonconvex Lasso”:

ˆ^NClasso2arg min

k k1R

1 2

T⌃ˆ ^Tˆ + k k1. (2.27)

The additional constraint k k1  R is made necessary by the nonconvexity to ensure the existence of a solution. Surprisingly, Loh and Wainwright [2012] proved that nonconvexity of (2.27) is not an issue since all global minima belong to a same small neighborhood, and solving it with an algorithm based on projected gradient descent gives a solution that is close to this neighborhood. Their result holds providedR=b₀p

s⁰,whereb₀ k ⁰k2,which cannot be checked in practice. Moreover, it holds only under the MCAR assumption and concerns only the estimation error.

Later,Datta and Zou[2017] convexified problem (2.27) by replacing⌃ˆ (2.25) by the nearest positive semi-definite matrix, as measured by the elementwise maximum norm. More precisely, they define

( ˆ⌃)₊:= arg min

A 0

1maxi,jp|A_ij ⌃ˆ_ij|. (2.28) TheirConvex Conditioned Lasso (CoCoLasso) is then defined by

ˆ^CoCoLasso= arg min

2R^p

1 2

T( ˆ⌃)₊ ^Tˆ + k k1, (2.29)

(30)

2.6. THE SPARSE CORRUPTION PROBLEM 23 where ( ˆ⌃)₊ and ˆ are defined by (2.28) and (2.26) respectively. Additionally to bounding the estimation error, the authors prove sign consistency of CoCoLasso in case of MCAR data, assuming that the entries of X are bounded and under the irrepresentable condition (2.18) and the bound (2.19).

The sign consistency result of Datta and Zou[2017] seems to be the only existing result about support recovery in high-dimension in presence of missing data. Note however that the works ofLoh and Wainwright [2012] and Datta and Zou [2017] more generally deal with the issue of error-in-variables, and both works treat the missing data issue as a special case of multiplicative error. In particular, they do not provide any results of numerical experiments in the missing data setting.

2.6 The sparse corruption problem

In the linear model (2.1), one generally assumes that the error✏is dense, in the sense that all its entries are almost surely nonzero. However in some applications, the response is occasionally corrupted by a large magnitude error. This can be modeled by the following sparse corruption model,

y=X ⁰+p

n!⁰+✏, (2.30)

where!⁰ 2Rⁿ is a deterministic sparse vector of corruptions with arbitrarily large nonzero entries, and ✏ ⇠ N(0, ²In) is the usual dense noise. The scaling p

n!⁰ is used since X is generally standardized so that each column has Euclidean normkX_jk2=p

n. We will denote the support of!⁰

T⁰ := supp(!⁰), and its sparsity index

k⁰:=|T⁰|. By concatenatingX to p

nIn,(2.30) can be rewritten y =h

X p

nI_ni"

0

!⁰

# +✏,

which is simply a sparse linear model with augmented design matrix h

X p

nIn

i and sparse vector

"

0

!⁰

#

.An important line of work has therefore followed the natural approach of applying standard`₁-techniques to the pair (y,h

X p

nI_ni

) to jointly estimate ⁰ and !⁰. When✏= 0,applying BP to the pair(y,h

X p

nI_ni

)and using a tuning parameter >0

(31)

to allow a diﬀerent penalization for and ! gives the Justice Pursuit² (JP) estimator ( ˆ^JP,!ˆ^JP) = arg min

2R^p,!2Rⁿ k k1+ k!k1

s.t. y=X +p n!.

(2.31)

The first works on JP considered = 1.It was originally introduced by Wright et al.[2009]

in the context of face recognition, and further analyzed in Wright and Ma [2010] for highly correlated dictionaries. Laska et al. [2009] and Li et al. [2010] proved that if the entries of Xare i.i.d. standard Gaussian, then ^p¹_nh

X p

nIn

isatisfies some RIP with high probability, implying exact recovery of both ⁰and!⁰.However, the sparsity level assumed for!⁰ in these works does not allow a large proportion of corruptions.

In order to asymptotically allow a positive fraction of corruptions (i.e. k⁰ = O(n)), Li [2013] and Nguyen and Tran [2013a] introduced the tuning parameter 6= 1 in (2.31). Li [2013] deals with i.i.d. Gaussian designs, whereas the rows of the design matrix inNguyen and Tran[2013a] are randomly drawn from the rows of an orthogonal matrix.

When✏6= 0,Nguyen and Tran[2013b] proposed theRobust Lasso, ( ˆ₍^Rlasso_, _!₎,!ˆ₍^Rlasso_, _!₎) = arg min

2R^p,!2Rⁿ

1

2ky X k²2+ k k1+ _!k!k1. (2.32) Problem (2.32) is actually equivalent to`1-penalized Huber loss regression (see e.g.Sardy et al.

[2001] orDalalyan and Thompson[2019]). Nguyen and Tran[2013b] extended the theoretical analysis of the Lasso to the Robust Lasso for estimation error and model selection. Regarding model selection, they proved consistent sign recovery for correlated Gaussian designs, under the conditions (2.18) and (2.19). When the number of large corruptionsk⁰ is arbitrarily close ton, their result requires a number of measurementsn satisfying _logⁿ_n =⌦(s⁰log(p s⁰)).

2Name coined byLaska et al.[2009].

Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data

Thesis

Reference

Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data

Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data.

Acknowledgements

Contents

Résumé

Summary

Chapter 1

Introduction

1.1 Motivation and contribution

λ β ^

1.2 Organization of the thesis

1.3 Notation

Chapter 2

Background on sparse vector recovery

2.1 The sparse high-dimensional linear model

2.2 Basis Pursuit

2.3 The Lasso

2.4 Thresholded Basis Pursuit

2.5 Missing covariates

2.6 The sparse corruption problem