Thesis
Reference
Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data
DESCLOUX, Pascaline
Abstract
Cette thèse porte sur le problème de sélection de variables dans le modèle de régression linéaire en haute dimension. Elle complète la littérature existante sur l'estimateur Thresholded Basis Pursuit (TBP) et, plus généralement, sur l'idée sous-jacente consistant à surajuster le modèle, puis à seuiller les coefficients obtenus. Dans un premier temps, de nouvelles garanties théoriques pour la reconstruction du vecteur de signes par TBP sont démontrées.
Dans un deuxième temps, une extension de TBP, appelée Lasso-Zero, est introduite. La nouveauté réside dans l'utilisation de plusieurs dictionnaires de bruit, concaténés à la matrice de régression afin de prendre en compte la présence de bruit lors de l'étape de surajustement. Enfin, une extension robuste de Lasso-Zero est proposée pour la sélection de variables en présence de données manquantes.
DESCLOUX, Pascaline. Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data. Thèse de doctorat : Univ. Genève, 2019, no. Sc. 5430
DOI : 10.13097/archive-ouverte/unige:139014 URN : urn:nbn:ch:unige-1390143
Available at:
http://archive-ouverte.unige.ch/unige:139014
Disclaimer: layout of this document may differ from the published version.
1 / 1
Université de Genève Faculté des Sciences
Section de mathématiques Professeur Sylvain Sardy
Sparse Support Recovery with Thresholded Basis Pursuit and Lasso-Zero, and an Extension to Handle Missing Data.
Thèse
présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention statistique.
par
Pascaline Privitera-Descloux de
Sâles (FR)
Thèse no5430
Genève 2020
Acknowledgements
The realization of this thesis would have been much harder without the help and support of many people.
First of all, I would like to express my deepest gratitude to my supervisor, Sylvain Sardy, for his precious advice and guidance, and yet for making me feel like a collaborator more than a student.
I also wish to thank Claire Boyer, Sebastian Engelke, Nick Hengartner and Julie Josse, for kindly agreeing to be part of my thesis committee and for their interesting comments and questions.
Thank you to Caroline Giacobino and Parisa Mamooler, for sharing their experience and for the nice time spent together during lunch breaks.
Many thanks to Jairo Diaz, for regularly sharing ideas and for his very fast answers when I asked for pieces of code.
I would like to thank the maintainers of the Baobab HPC cluster of the University of Geneva, for making this precious tool available and for their fast assistance when needed.
I acknowledge the CUSO (Conférence Universitaire de Suisse Occidentale), both for the Doctoral Program in Statistics, which allowed me to learn a lot and meet many peers from other universities, and the transversal program, in which I gained skills beyond mathematics and statistics.
Many thanks to the secretaries and librarians from the Section de Mathématiques for their kind help.
Thank you to all of my friends at the Section de Mathématiques for our happy breaks at the Z-bar.
More broadly, many thanks to all of my friends for the good times spent together and for making me completely disconnect from my research.
I also would like to thank my parents and my brothers, for their support and for being the greatest family.
Last but not least, thank you Giuseppe. For everything.
iii
Contents
Acknowledgements iii
Résumé 1
Summary 3
1 Introduction 5
1.1 Motivation and contribution . . . 5
1.2 Organization of the thesis . . . 8
1.3 Notation . . . 9
2 Background on sparse vector recovery 11 2.1 The sparse high-dimensional linear model . . . 11
2.2 Basis Pursuit . . . 14
2.3 The Lasso . . . 17
2.4 Thresholded Basis Pursuit . . . 18
2.5 Missing covariates . . . 21
2.6 The sparse corruption problem . . . 23
3 Thresholded Basis Pursuit and Thresholded Justice Pursuit 25 3.1 Thresholded Justice Pursuit . . . 26
3.2 Sign recovery under the stable nullspace property . . . 32
3.2.1 Analysis of TBP . . . 32
3.2.2 Analysis of TJP . . . 34
3.3 Sign consistency for correlated Gaussian designs. . . 38
3.3.1 Analysis of TBP . . . 41
3.3.2 Analysis of TJP . . . 43
3.3.3 Discussion . . . 45 v
4 Lasso-Zero 49
4.1 Definition of Lasso-Zero . . . 50
4.2 First numerical experiment . . . 52
4.3 Choice of tuning parameter by QUT . . . 54
4.3.1 Quantile universal thresholding . . . 54
4.3.2 Guarantees in the low-dimensional case . . . 55
4.3.3 Approximation by a GEV distribution . . . 56
4.4 Numerical experiments . . . 57
4.4.1 Simulation settings . . . 58
4.4.2 Choice of competitors . . . 59
4.4.3 Results . . . 60
5 Handling Missing Data with Robust Lasso-Zero 63 5.1 Robust Lasso-Zero for the sparse corruption problem . . . 63
5.2 Robust Lasso-Zero for missing covariates . . . 64
5.2.1 The problem of missing data as a sparse corruption problem . . . 64
5.2.2 Selection of tuning parameters . . . 66
5.3 Numerical experiments . . . 68
5.3.1 Choice of estimators and tuning parameters . . . 68
5.3.2 Withs0-oracle tuning . . . 69
5.3.3 With automatic tuning of parameters . . . 75
5.4 Discussion . . . 82
A Gaussian random matrices 83 A.1 Bounds for singular values . . . 83
A.2 Restricted eigenvalue property. . . 84
B Supplementary numerical results 87
Bibliography 87
Résumé
La présente thèse porte sur le problème de sélection de variables dans le modèle de régression linéaire en haute dimension, c’est-à-dire, d’identifier l’ensemble des variables importantes dans le cas où le nombre de prédicteurs potentiels est supérieur à la taille de l’échantillon. Mathé- matiquement, cela revient à identifier l’emplacement des coefficients non nuls du vecteur de régression, supposé parcimonieux. Notre contribution est étroitement liée à l’estimateur Basis Pursuit seuillé (ou TBP, pourThresholded Basis Pursuit), qui cherche une solution au système y=X avec norme `1 minimale, puis met à zéro les petits coefficients de cette solution.
L’objectif principal de cette thèse est de compléter la littérature existante sur TBP et, plus généralement, l’idée sous-jacente consistant à surajuster le modèle puis seuiller les coefficients obtenus. En effet, dans des travaux précédents, TBP a été étudié soit en supposant que les prédicteurs sont i.i.d. et gaussiens, soit dans un cadre asymptotique dans lequel les dimensions du jeu de données sont fixées et les valeurs absolues des coefficients non nuls du vecteur de régression augmentent. Hors, même si l’indépendance des prédicteurs est souvent supposée (en particulier dans la littérature sur l’acquisition comprimée, ou compressed sensing), le statisticien préfère des résultats apportant des garanties dans les cas où les prédicteurs sont dépendants, ce qui reflète mieux les jeux de données qu’il rencontre en pratique. De plus, on cherche généralement à démontrer que l’estimateur proposé est consistant. Dans le contexte de la sélection de variables, cela signifie que l’estimateur retrouve l’ensemble de variables correct avec probabilité convergeant vers 1 lorsque la taille de l’échantillon tend vers l’infini (et non avec des dimensions du jeu de données fixées).
Concrètement, nous démontrons que le vecteur de signes de la solution de TBP est un esti- mateur consistant des signes du véritable vecteur de régression lorsque les lignes de la matrice de régression suivent une loi normale multivariée, dans un cadre asymptotique permettant à la taille de l’échantillon ainsi qu’au nombre de prédicteurs d’augmenter.
Puis, nous introduisons une extension de TBP que nous appelons Lasso-Zéro. La nou- veauté réside dans l’utilisation de plusieursdictionnaires de bruit, concaténés à la matrice de régression afin de prendre en compte la présence de bruit lors de l’étape de surajustement.
Nous choisissons le seuil par la méthode dequantile universal thresholding (QUT), et propo- 1
sons d’exploiter la spécificité de Lasso-Zéro – plus précisément, les coefficients associés aux dictionnaires de bruit – afin de contourner la nécessité d’estimer le niveau de bruit. Des études de simulation sont réalisées afin de démontrer l’efficacité de notre estimateur.
Nous abordons également le problème de la sélection de variables lorsque des données sont manquantes. Lorsque le nombre de prédicteurs est grand, les méthodes habituelles résultent en une importante perte d’information ou sont difficiles à appliquer car elles requièrent de spécifier un modèle pour les prédicteurs et/ou le mécanisme de données manquantes. Nous évitons cette difficulté en utilisant une seule matrice imputée naïvement et en montrant un lien entre le problème des données manquantes et le modèle de corruption parcimonieuse, qui suppose que le signal est corrompu par deux sources de bruit, l’une dense, et l’autre parcimonieuse avec des entrées potentiellement grandes. Nous proposons une extension de Lasso-Zéro, appeléeLasso- Zéro robuste, pour la sélection de variables en cas de corruption parcimonieuse, et l’appliquons au problème de données manquantes dans une étude de simulation. Nous introduisons aussi Thresholded Justice Pursuit (TJP), l’estimateur analogue à TBP pour le modèle de corruption parcimonieuse, et étendons les résultats concernant TBP à TJP.
Summary
This thesis concerns the problem of variable selection in the high-dimensional linear regression model, that is, the problem of recovering the set of important variables in cases where the number of potential predictors exceeds the sample size. Mathematically, this corresponds to identifying the locations of the nonzero entries of the regression vector, which is assumed to be sparse. Our contribution is closely related to the Thresholded Basis Pursuit (TBP) estimator, which first looks for a solution to y = X with minimal `1-norm, and then sets the small coefficients to zero.
The main objective of this thesis is to complement the existing literature on TBP and, more generally, the underlying idea consisting of overfitting the model, then threshold the obtained coefficients. Indeed, in previous works, TBP has been studied either assuming that the predictors are i.i.d. and Gaussian, or in an asymptotic framework in which the dimensions of the dataset are fixed and the absolute nonzero coefficients of the regression vector grow.
However, even though the i.i.d. case is often assumed (especially in the compressed sensing literature), statisticians prefer results providing guarantees in cases where the predictors are dependent, better reflecting the datasets they encounter in practice. Also, one generally seeks to demonstrate the consistency of the proposed estimator. In the context of variable selection, this means that the estimator recovers the correct set of variables with probability tending to 1 as the sample size tends to infinity (and not with fixed dimensions of the dataset).
Concretely, we demonstrate the sign consistency of TBP when the rows of the design matrix follow a multivariate Gaussian distribution, in an asymptotic framework allowing the sample size as well as the number of predictors to grow.
Then, we introduce an extension of TBP calledLasso-Zero. Its novelty resides in the use of several randomnoise dictionaries that are appended to the design matrix in order to take the presence of noise into account in the overfitting step. We select its tuning parameter by quantile universal thresholding, and propose to exploit the specificity of Lasso-Zero – more precisely, the coefficients associated to the noise dictionaries – to bypass the necessity of estimating the noise level. Numerical experiments are performed to demonstrate the efficiency of our estimator.
3
We also address the problem of model selection in case of missing data. When the number of predictors is large, common methods either result in an important loss of information, or are difficult to apply because they require to specify a model for the variables and/or the missing data mechanism. We circumvent this issue by using a single naively imputed matrix and by showing a connection between the problem of missing data and the sparse corruption model, which assumes that the signal is corrupted by two noise sources, one dense, the other sparse with potential large entries. We propose an extension of Lasso-Zero, calledRobust Lasso-Zero, for variable selection in case of sparse corruption, and apply it to the missing data problem in numerical experiments. We also introduce Thresholded Justice Pursuit (TJP), the analogue of TBP for the sparse corruption model, and extend theoretical results about TBP to TJP.
Chapter 1
Introduction
1.1 Motivation and contribution
Regression analysis – one of statisticians’ most frequent tasks – estimates the relationships between variables and predicts the value of an outcome y 2 R given a set of predictors.
It is often assumed that the relation between variables is linear, i.e. y = xT 0 +✏, where x= (x1, . . . , xp)T 2Rp is the vector of predictors, ✏is a zero-mean noise term and 02Rp is the vector of regression coefficients, which is generally estimated by the least-squares method.
But even though the linear model probably is the simplest one can assume, there are cases in which traditional least-squares estimation cannot be performed. Indeed, when the number of parameters exceeds the number of observations, the least-squares problem has infinitely many solutions and yields overfitted models that fail to predict future observations reliably.
Nowadays, thesehigh-dimensionaldatasets often arise in practice. Genetics provides a typical modern example, with datasets containing expression levels of thousands of genes for only a few observational units. In such cases, additional assumptions about the structure of the problem need to be made. Aiming at making models simpler and easier to interpret, a common assumption is that only a few of the predictors (orcovariates) at hand are actually relevant for predicting the outcome (or response). For example, it is generally assumed that only a few genes are related to some medical trait, such as the probability of developing a disease.
In this context, statisticians and practitioners are often interested in identifying the set of relevant covariates. This is the task ofmodel selection, or variable selection. Mathematically, assuming that only a few variables are relevant in our problem corresponds to the assumption of sparsity, meaning that most entries of 0 are zero, and the task of model selection corresponds to correctly identifying thesupport of 0,the set of indices of the nonzero entries.
With a large number of covariates, traditional search strategies like best subset or step- wise selection methods (see e.g. Hastie et al. [2009]) are computationally expensive or even intractable. The use of`1-regularization provides a very attractive alternative. More precisely,
5
0.0 0.2 0.4 0.6 0.8 1.0
−0.6−0.4−0.20.00.20.40.6
λ β ^
j lasso
τ
− τ
Figure 1.1 – Path of all Lasso solutions for a signal generated according to setting (b) in Section 4.4.1. The support S0 has size 10 and is represented by the red curves. None of the Lasso solutions recover S0, whereas the BP solution – corresponding to the limit Lasso solution as tends to0 – recovers S0 exactly if thresholded properly.
the Lasso [Tibshirani,1996] looks for a trade-offbetween the fit of the model and the`1-norm – that is, the sum of the absolute coefficients – of 0.This has the effect of shrinking the esti- mated coefficients towards zero, and has the advantage of setting some to zero. The amount of shrinkage is governed by a positive regularization parameter ,which consequently influences the size of the selected model. The simultaneous variable selection and estimation provided by the Lasso as well as the fast algorithms that have been developed to solve it [Friedman et al.,2010] contributed to the popularity of the Lasso and`1-regularization in general.
Theoretical properties of the Lasso have been extensively studied. In terms of model selection, it is known to include all important variables with high probability under fairly weak conditions (see Bühlmann and van de Geer[2011] and references therein), but it tends to include (too many) false positives [Su et al., 2017], and exact support recovery can be achieved only under a strong condition, called the irrepresentable condition [Zhao and Yu, 2006]. This difficulty to recover the correct model is generally attributed to an excessive amount of shrinkage. Some propositions have been made to solve this issue, such as the two-stage adaptive Lasso [Zou,2006], or the nonconvex SCAD penalty [Fan and Peng,2004].
Figure1.1illustrates a typical example1where the true support cannot be recovered by the
1It is a realization of the simulation setting (b) described in Section4.4.1, with 10 truly nonzero coefficients.
1.1. MOTIVATION AND CONTRIBUTION 7 Lasso. The curves represent all components ˆlasso
,j of the Lasso solution as varies. The red curves correspond to indicesj belonging to the support of 0,that we denote byS0.Looking at the solution path, it is clear that the Lasso cannot recover the true support in this case, since there is no value of for which the set of nonzero coefficients equalsS0.
On the other hand, Figure1.1shows a clear separation between the red and black curves at the limit solution of the path as tends to zero, that is, where shrinkage is the weakest.
Interestingly, in the high-dimensional case, the limit lim !0+ ˆlasso of the Lasso path is a solution to the so-called Basis Pursuit (BP) problem [Fuchs,2004]. BP has been well studied in the compressed sensing literature. It is known to exactly recover 0 in the noiseless case, provided it is sparse enough (seeFoucart and Rauhut[2013] and references therein). Setting the small coefficients of BP to zero corresponds to Thresholded Basis Pursuit (TBP), which has been introduced bySaligrama and Zhao [2011]. Thus, Figure 1.1 illustrates an example where the Lasso does not achieve support recovery, but TBP does for an appropriate choice of threshold, e.g. the one indicated horizontally.
There is only little work available on TBP.Saligrama and Zhao[2011] studied sign recovery, which is slightly stronger than support recovery, by a multistage procedure combining TBP and ordinary least-squares estimation. Their results hold for i.i.d. Gaussian covariates, a common assumption in the compressed sensing literature, in which the measurement matrix can be chosen. However, in statistical applications, covariates are typically correlated.
More recently, Tardivel and Bogdan [2018] showed that the condition of identifiability is necessary and sufficient for sign recovery by TBP, however only in an asymptotic setting where the matrix of covariates is fixed, preventing the dimension of the dataset to grow.
This thesis aims at complementing the literature on TBP and the “overfit, then thresh- old” paradigm underlying it. Firstly, we bring novel theoretical results about TBP, showing that it achieves sign recovery under the so-called stable nullspace property, and proving sign consistency for correlated Gaussian matrices. Compared to previous works on TBP, our as- sumptions are closer to the ones that are usually made in the statistical literature, permitting covariates to be correlated and the numbers of observations and covariates to grow.
Secondly, we introduce an extension of TBP, called Lasso-Zero. Apart from focusing on the limit Lasso solution at = 0 and the use of thresholding – which motivated its name –, the novelty of Lasso-Zero resides in the use of several noise dictionaries in the overfitting step, followed by the aggregation of the corresponding estimates. It aims at improving TBP in cases where the signal-to-noise ratio is low, since TBP ignores the presence of noise in the first step.
Thirdly, we also propose an extension of Lasso-Zero that handles incomplete data, a re- curring problem in practice. Causes of missing data are various: non-response to a survey question, manual error, failure of a measuring instrument, etc. Efficiently handling missing data is particularly difficult in high-dimension, since common approaches either result in an important loss of information, or require to specify a model for the (very) large set of covari-
ates. To bypass this issue, we rely on a single (maybe naively) imputed matrix, and propose a robust extension of Lasso-Zero that takes into account the error caused by imputation. This robust extension relies on the simple observation that when 0 is sparse, then so is the error induced by imputation. This naturally establishes a relation between the problem of missing covariates in high-dimensional datasets and the sparse corruption problem. Given this rela- tion, we will define Thresholded Justice Pursuit (TJP)– the analogue of TBP for the sparse corruption problem – and extend all our theoretical results to this estimator. Just as Lasso- Zero is obtained by solving TBP on augmented matrices, Robust Lasso-Zero is obtained by doing the same with TJP.
1.2 Organization of the thesis
Existing estimators and results that are relevant for the remaining of the thesis are collected in Chapter 2, as well as in the appendices. Chapters 3, 4 and 5 are dedicated to our own contributions.
We start by providing some background on the high-dimensional linear model and support recovery in Chapter 2. After introducing some generalities in Section2.1, BP, the Lasso and TBP are presented in Sections 2.2, 2.3 and 2.4 respectively, together with some theoretical results. In Sections2.5and2.6, we discuss the problems of missing data and sparse corruptions as well as some existing estimators.
Chapter3 is dedicated to sign recovery properties of TBP and TJP. Section 3.1formally defines TJP and proves that the notion of identifiability is necessary and sufficient for sign recovery, thus extending a result ofTardivel and Bogdan[2018] to TJP. In Section 3.2, TBP (TJP) is analyzed assuming the (extended) stable nullspace property, under which it is proved that sign recovery holds provided the absolute nonzero coefficients are larger than a specified lower bound. These results are used in Section 3.3 to prove sign consistency for correlated Gaussian matrices.
In Chapter 4, we introduce our novel estimator, Lasso-Zero. The idea and formal defi- nition are presented in Section 4.1. After a first small numerical experiment in Section 4.2, the crucial question of the threshold selection is discussed in Section 4.3. We opt for the quantile universal threshold [Giacobino et al.,2017], derived for Lasso-Zero in Section 4.3.1.
Some guarantees about the family-wise error rate in the low-dimensional case are given in Section4.3.2. Estimation of the quantile universal threshold using a GEV approximation is discussed in Section4.3.3. Finally, numerical experiments are presented in Section4.4.
Chapter 5 proposes a robust extension of Lasso-Zero for the sparse corruption model and focuses on missing data as a particular application. We start by formally defining Robust Lasso-Zero in Section5.1. Section5.2then establishes a relation between the sparse corruption problem and the issue of missing covariates, and proposes to use Robust Lasso-Zero in case of incomplete data. Section5.3is dedicated to numerical experiments.
1.3. NOTATION 9
1.3 Notation
[p] - the set {1, . . . , p}(where p2N⇤)
|S| - the cardinality of the set S
S - the complement of the setS ⇢[p]
S1tS2 - the union of two disjoint sets S1 and S2
1S - the indicator function of the set S, i.e. 1S(x) = 1 if x 2 S, and 1S(x) = 0otherwise
S - the restriction of the vector 2 Rp to the set of indices S ⇢ [p], or the vector ofRp obtained by setting to zero the coefficients of indexed by S (will be clear depending on the context)
supp( ) - the support of the vector 2Rp,i.e. the set {j2[p] | j 6= 0} sign( ) - the vector of signs of 2Rp,i.e.sign( )j =1R>0( j) 1R<0( j)
= ˜s - if and only ifsign( ) = sign( ˜)
k kq - ⇣Pp
j=1| j|q⌘1/q
,the`q-norm of 2Rp (q2(0,1)) k k1 - max1jp| j|,the sup-norm of 2Rp
k k0 - the number of nonzero coefficients of (the`0-(pseudo-)norm)
| |min - minj2supp( )| j|,the minimal nonzero absolute entry of
1n - the vector of ones inRn
In - the identity matrix of size n⇥n
XS - the submatrix of X 2 Rn⇥p formed by its columns indexed by S ⇢[p]
⌃T S - the submatrix of ⌃2Rp⇥p formed by its rows indexed byT ⇢[p]
and columns indexed by S ⇢[p]
A >0 - means that the symmetric matrixA is positive definite
min(A), max(A) - the smallest, respectively largest eigenvalue of the symmetric ma- trixA
min(X), max(X) - the smallest, respectively largest singular value of the matrix X diag(v) - the diagonal matrix inRp⇥p with diagonalv2Rp
diag(A) - forA2Rp⇥p,the diagonal matrix whose diagonal is the one ofA
⌘⌧ - the hard thresholding function, defined by⌘⌧( )j := j1(⌧,1)(| j|) for every j2[p]where 2Rp
logx - the natural logarithm ofx (log = loge)
f(n) =⌦(g(n)) - there is a constant K > 0 such that |f(n)| K|g(n)| for large enoughn
f(n) =O(g(n)) - there is a constantk >0such that|f(n)|k|g(n)|for large enough n.
Chapter 2
Background on sparse vector recovery
This chapter introduces the problem of sparse vector recovery – with an emphasis on sparse support recovery – in the high-dimensional linear model, and provides an overview of some existing estimators. There is a large body of literature focusing on this problem, and it is impossible to provide an exhaustive review of existing methods and the corresponding theoretical results. We focus here only on estimators that are most closely related to our contribution, namely Basis Pursuit, the Lasso, and Thresholded Basis Pursuit – all of them making use of `1-penalization. We restrict our attention to theoretical results that will be used in the next chapters or that are closely related to our own results.
We also present the problems of missing covariates and sparse corruptions in Sections2.5 and2.6 respectively, and some corresponding estimators. These problems will arise in Chap- ter5.
2.1 The sparse high-dimensional linear model
The linear model assumes that aresponse vectory2Rndepends on adesign matrix X2Rn⇥p through the equation
y =X 0+✏, (2.1)
where 0 2Rp is an unknown vector of coefficients and✏2Rnis a noise term with expectation zero. We will make the classical assumption that✏⇠N(0, 2In).The matrixX can be fixed or random. Giveny andX, the statistician’s task is to make inference about 0.
When the number n of observations exceeds the number p of parameters to estimate, the linear model is said to be high-dimensional. High-dimensionality causes difficulties. In particular, traditional estimators like the least-squares estimator are non longer well-defined.
Indeed, it is well known that
ˆ2 arg min
2Rp ky X k22 () XTXˆ =XTy, 11
so the least-squares problem has infinitely many solutions whenever the kernel of X is non- trivial, which is necessarily the case whenp > n.
The identifiability issue is an even more striking illustration of the problem we are facing whenp > n: even if✏= 0 in (2.1), the systemy =X has infinitely many solutions, and it is therefore impossible to recover 0 even in the noiseless case. The only way to circumvent this is to impose restrictions on 0 by making further assumptions on its structure. The most common assumption made in the statistical literature is thesparsity assumption, namely that most entries of 0 are zero. Under the sparsity assumption, estimation of 0 can be made with three different goals in mind:
• prediction: minimizing the prediction errorE((˜y x˜Tˆ)2),where(˜y,x)˜ 2R⇥Rp repre- sents a future observation,
• estimation: minimizing the estimation error E(kˆ 0k),
• support recovery: recover the support of 0,that is, the locations of its nonzero coeffi- cients, denoted
S0 := supp( 0) ={j2[p] | j06= 0}. The cardinality of S0 is denoted
s0 :=|S0|
and is called the sparsity index of 0. Sign recovery, i.e. sign( ˆ) = sign( 0), clearly implies support recovery.
This thesis focuses on the problem of support recovery. In a statistical regression setting, where each column ofXcorresponds to apredictor orcovariate, support recovery corresponds tomodel selectionorvariable selection. But linear regression is not the only application where the sparse linear model is useful. For example, the segmentation problem can be formulated as a sparse (low-dimensional) linear model. In this context,y = f0+✏, wheref0 2 Rn is a piecewise constant vector, that is,fj+10 =fj0 for most j 2[n 1].It is desired to recover the jump locations, that is, allj2[n 1]for whichfj+10 6=fj0.The piecewise constant signal can be written
f0 =f101n+X 0,
where X 2 Rn⇥(n 1) is given by Xij := 1{i>j} and j0 := fj+10 fj0. Then the set of jump locations is preciselysupp( 0).
The performance of a support estimatorSˆ⇢[p]can be measured by various criteria:
• probability of exact support recovery:
P( ˆS =S0); (2.2)
2.1. THE SPARSE HIGH-DIMENSIONAL LINEAR MODEL 13
• true positive rate:
TPR :=E(TPP), (2.3)
where
TPP := |S0\Sˆ|
|S0| (2.4)
is the true positive proportion;
• family-wise error rate:
FWER =P(|Sˆ\S0|>0); (2.5)
• false discovery rate:
FDR :=E(FDP), (2.6)
where
FDP := |S0\Sˆ|
max{|Sˆ|,1} (2.7)
is the false discovery proportion.
TheTPRis a measure of the power of the model selection procedure, since it corresponds to the average proportion of truly non-zero coefficients that are detected. TheFWERand the FDR,on the other hand, are measures of the propensity to make false discoveries. TheFWER is the probability of making at least one false discovery, whereas theFDR corresponds to the average proportion of false discoveries among all discoveries. Both quantities are related by the inequality
FDRFWER. (2.8)
Indeed,FWER =E(U),whereU :=1{|S\Sˆ 0|>0}.By definition (2.7),FDP1,and FDP = 0 when U = 0. So FDP U, hence FDR = E(FDP) E(U) = FWER. Moreover, equality holds in (2.8) whenS0 =;,since in this case FDP =1{|S|>0}ˆ .
Inequality (2.8) implies that any procedure controlling theFWERat level↵ also controls theFDR at level ↵. Controlling the FWER generally leads to very conservative procedures, which is one of the reasons whyBenjamini and Hochberg[1995] introduced the less conservative FDR.Some recent work on linear regression has aimed at controlling the FDR [Bogdan et al., 2015;Barber and Candès,2015;Candès et al.,2018].
Under the sparsity assumption, it would be natural to look for an estimator ˆwith small
`0(-pseudo)-norm, defined by
k k0:=|supp( )|. In the noiseless case (✏= 0), the problem
min2Rp k k0
s.t. y=X
(2.9)
looks for the sparsest satisfying y = X , whereas in the noisy case (✏ 6= 0), the so-called best subset problem,
min2Rp
1
2ky X k22+ k k0, >0 (2.10)
looks for the best trade-offbetween the fitky X k22and the model complexity. However, both problems are combinatorial by nature and are known to be NP-hard [Natarajan,1995]. One of the greatest advances of statistics and signal processing has been to relax and convexify the
`0-penalty by replacing it by the `1-norm. Convexity of the obtained optimization problems allowed to rely on the well developed theory of convex optimization to derive theoretical results.
Moreover, the development of fast algorithms to solve these kind of problems undoubtedly contributed to the popularity of`1 estimators.
The next two sections present the two best-known`1 problems, namely Basis Pursuit and the Lasso, corresponding to the`1 counterparts of problems (2.9) and (2.10) respectively. Let us first state here an assumption and a slight abuse of language that we will make throughout the thesis.
Remarks 2.1
a) In the remaining of the thesis we will assume that rankX = n. This way, any system y=X admits at least one solution.
b) For ease of language, we will saythe Basis Pursuit, or Lasso, solution, even though unique- ness is not always guaranteed. When uniqueness is crucial, it will be made explicit. Note that whenX is in general position (see Definition 2.2below), the Basis Pursuit and Lasso solution are unique.
Definition 2.2 A matrixX2Rn⇥p is ingeneral position if for every(u1, . . . , up)2{ 1,1}p and for everyk <min{n, p},all affine subspaces of Rn of dimension k contain at most k+ 1 pointsujXj.
Examples 2.3
a) When p n, if the columns of X are linearly independent, then X is clearly in general position.
b) Any random matrix X with absolutely continuous density is almost surely in general po- sition [Tibshirani,2013].
2.2 Basis Pursuit
This section focuses on the noiseless case, i.e. when y = X 0.Chen et al. [2001] introduced the convex relaxation of (2.9),
min2Rp k k1
s.t. y=X ,
(2.11)
2.2. BASIS PURSUIT 15 called the Basis Pursuit (BP) problem. BP can be recast as a linear program, as it is equivalent to solving
+min, 2Rp 1Tp ++1Tp
s.t.
8>
>>
>>
><
>>
>>
>>
: y =h
X X
i2 4 +
3 5,
+j 0 8j,
j 0 8j.
BP used as a proxy for (2.9) has been well studied in mathematical signal processing and it is known that under some conditions, the solutions of (2.11) are solutions to (2.9) as well [Donoho and Huo, 2001; Elad and Bruckstein, 2002; Donoho and Elad, 2003; Gribonval and Nielsen,2003;Fuchs,2004;Donoho,2006;Candès et al.,2006a].
Let us state the most important theoretical results about BP. The following result about uniqueness is due toDossal [2012].
Theorem 2.4 If X is in general position (see Definition 2.2), BP has a unique solution for everyy 2Rn.
We will denote the BP solution by ˆBP.
All results and definitions below can be found in Foucart and Rauhut [2013]. First, the following theorem shows that the BP solution is sparse whenp > n.
Theorem 2.5 If the BP solution is unique, then kˆBPk0 nand the columns of Xsupp( ˆBP)
are linearly independent.
The following condition is the weakest guaranteeing exact recovery by BP of all vectors supported on a setS0⇢[p].
Definition 2.6 The matrixX is said to satisfy thenullspace property (NSP) relative toS0 if 2kerX\ {0} =) k S0k1 <k S0k1. (2.12) Theorem 2.7 For a matrix X 2 Rn⇥p, every vector 0 2 Rp with supp( 0) = S0 is the unique solution to BP (2.11) wheny =X 0 if and only ifX satisfies the NSP relative toS0. It is often assumed that the estimated vector is only approximately sparse. The following condition will ensure stability of BP with respect to sparsity defect.
Definition 2.8 The matrix X is said to satisfy the stable nullspace property (stable NSP) relative toS0 with constant⇢2(0,1)if
2kerX =) k S0k1⇢k S0k1. (2.13)
Theorem 2.9 If X 2 Rn⇥p satisfies the stable NSP relative to S0 with constant ⇢ 2 (0,1), then for every vector 0 the solution ˆBP to BP (2.11) when y=X 0 satisfies
kˆBP 0k1 2(1 +⇢)
1 ⇢ k S00k1. (Note that 0 is not necessarily supported on S0 here.)
Consequently, if 0 is close from being sparse, in the sense that it can be written as the sum of a vector supported on S0 and a small perturbation, the stable NSP guarantees that the solution to BP is close to 0.
Many works in compressed sensing assume a stronger condition than the NSP, called the restricted isometry property (RIP).
Definition 2.10 For a matrix X 2 Rn⇥p and k 2 [p], its restricted isometry constant of order k, denoted k = k(X), is the smallest > 0 such that for every vector 2 Rp with k k0 k,we have
(1 )k k22 kX k22 (1 + )k k22. (2.14) In particular, for every setS⇢[p]of cardinality |S|k,one has
1 k min(XSTXS) max(XSTXS)1 + k.
We say thatX satisfies a restricted isometry property (RIP) of order k if k is small enough (with an upper bound depending on specific theoretical results).
The proof of Theorem 6.9 inFoucart and Rauhut[2013] shows the following.
Theorem 2.11 If X satisfies
2s< 1
3, (2.15)
then it satisfies the stable NSP with constant⇢= 1 22s
2s relative to every set S⇢[p]such that
|S|=s.
Since the stable NSP implies the NSP, Theorem 2.7and 2.11 together imply that under the RIP (2.15), BP recovers anys-sparse vector 0.
It is hard to construct fixed matrices satisfying the RIP. However, it is known that some random matrices satisfy it with high probability. In particular, the following holds1.
Theorem 2.12 If the entries ofX 2Rn⇥pare i.i.d.N(0,n1),then there exists a constantC >
0such thatXsatisfies s< with probability at least1 2e
2n
2C providedn 2C 2slog(ep/s).
1Theorem2.12actually holds more generally for subgaussian matrices.
2.3. THE LASSO 17
2.3 The Lasso
In the noisy case,Tibshirani [1996] proposed the Lasso estimator ˆlasso= arg min
2Rp
1
2ky X k22+ k k1, >0, (2.16) providing a trade-off between the fit ky X k22 and the `1-norm of . Note that if X is in general position (see Definition2.2), the Lasso solution is unique [Tibshirani,2013].
The Lasso has been extensively studied in the literature (see Bühlmann and van de Geer [2011] and references therein). In particular, the following condition is crucial for variable selection.
Definition 2.13 For✓2(0,1),we say thatX2Rn⇥p satisfies the✓-irrepresentable condition relative to 0 if
kXT
S0XS0(XST0XS0) 1sign( S00)k1✓, (2.17) whereS0 = supp( 0).
The irrepresentable condition is strong and typically excludes cases where variables are too strongly correlated.
It has been shown that the Lasso consistently recovers the support of 0 in an asymptotic setting where bothp ands0 are allowed to grow with n.These results essentially require the irrepresentable condition, as well as some lower bounds for min(1nXST0XS0)and| 0|min,where
| 0|min denotes the smallest nonzero absolute component of 0.. For example, the following has been proved byZhao and Yu [2006].
Theorem 2.14 Assume that the noise components✏i, i= 1, . . . , n,are i.i.d. withE(✏2qi )<1 for some q > 0 and that there exist constants 0 c1, c2 1, M1, M2, M3, M4 > 0 such that for every value ofn,
• n1kXjk22 M1 for every j,
• min(n1XST0XS0) M2,
• s0 =O(nc1),
• | 0|min M3
n(1 c2)/2.
If the✓-irrepresentable condition holds for every n, then there exists a sequence ( n)n2N⇤ such that
n!1lim P( ˆlasson =s 0) = 1.
Theorem 2.14 concerns fixed design matrices. Wainwright [2009] proved a similar result for (correlated) Gaussian design matrices, i.e. matricesX with i.i.d.N(0,⌃)rows. Theorem 3 of that work implies consistent sign recovery when⌃satisfies
k⌃S0S0(⌃S0S0) 1k1✓, ✓2(0,1), (2.18) and
min(⌃S0S0) Cmin>0, (2.19) provided the sample sizenscales asn=⌦(s0log(p s0)).
The irrepresentable condition is then sufficient for consistent support recovery by the Lasso. It turns out that it is also essentially necessary –essentially meaning here that✓in inequality (2.17) is replaced by 1.This is true even in the noiseless case, as shown by the following result [Bühlmann and van de Geer,2011, Theorem 7.1].
Theorem 2.15 If y=X 0 andsupp( ˆlasso)⇢supp( 0) for some < | 0|min
k(XT
S0XS0) 1k1, then kXT
S0XS0(XST0XS0) 1sign( S00)k11.
As for any other penalized likelihood method, the choice of the tuning parameter is crucial when applying the Lasso. Cross-validation and traditional information criteria like the Bayes Information Criterion (BIC) [Schwarz,1978] or Akaike Information Criterion (AIC) [Akaike,1998] are widely used, but they do not yield consistent support recovery in the high- dimensional case. For consistent support recovery when p grows exponentially with n, Fan and Tang[2013] proposed the Generalized Information Criterion (GIC), corresponding to
GIC( ˆ) =ky Xˆk22+ 2log(logn) logpkˆk0 (2.20) in the Gaussian linear model. Tuning the Lasso with GIC corresponds to choosing the value of minimizingGIC( ˆlasso).
2.4 Thresholded Basis Pursuit
The Lasso is equivalent (see Proposition 3.2. inFoucart and Rauhut [2013]) toBasis Pursuit Denoising (BPD) [Candès et al.,2006b;Donoho et al.,2006],
min2Rp k k1
s.t. ky X k2 ✓.
(2.21)
BPD is the most natural adaptation of BP to the noisy case, since it replaces the equality constrainty =X byky X k2 ✓.Therefore the Lasso can be seen as a simple adaptation of BP to the presence of noise.
2.4. THRESHOLDED BASIS PURSUIT 19 Another way to adapt BP to the noisy case is to first solve BP, despite the presence of noise, and then set the small coefficients to zero. This is Thresholded Basis Pursuit (TBP), which was introduced and studied bySaligrama and Zhao[2011]. We denote it ˆTBP
⌧ :
ˆTBP
⌧ =⌘⌧( ˆBP), (2.22)
where⌘⌧(x) =x1(⌧,1)(|x|),the hard-thresholding function, is applied componentwise.
The work of Saligrama and Zhao[2011] focuses on matricesX 2Rn⇥p with i.i.d. N(0,n1) entries, and relies on the fact that in this setting some RIP holds with high probability. One of their important intermediate results is the following.
Theorem 2.16 Assume that y = X( 0 +⇠), where X 2 Rn⇥p has i.i.d. N(0,1n) entries,
02Rpiss0-sparse, and⇠ 2Rp is a deterministic error. Then there exist constantsc1, c2 >0 such that with probability at least 1 e c1n, the BP solution satisfies
kˆBP ( 0+⇠)k2 Ck⇠k1
ps0, (2.23)
provided n c22s0log(p/2s0), where the constant C depends only on the restricted isometry constant 2s0 of X.
Actually, this intermediate result already allows to obtain a sign consistency result for TBP, however under a strong beta-min condition (see our discussion in Section3.3.3). In order to reduce the required signal-to-noise ratio,Saligrama and Zhao [2011] consider a TBP + OLS multistage procedure, defined by the following algorithm.
Algorithm 2.17 (TBP + OLS)
Forn= 3m, we denote by (y(1), X(1))the dataset containing the first mobservations and by (y(2), X(2))the last 2m observations. Then for fixed ⌧ >0,
1. apply the TBP algorithm to(y(1), X(1)),i.e. compute ˆ :=⌘⌧( ˆBP) where ˆBP:= arg min
2Rp k k1 s.t. y(1)=X(1) ,
2. letSˆ:={j | ˆj 6= 0}.Using the second dataset, compute the least-squares coefficients for model S,ˆ i.e.
ˆSˆ ((X(2)ˆ
S )TX(2)ˆ
S ) 1(X(2)ˆ
S )Ty(2), 3. finally, re-threshold the obtained coefficients at level ⌧ :
ˆTBP+OLS
⌧ :=⌘⌧( ˆ).
Their main result implies the following.
Theorem 2.18 Assume that X 2Rn⇥p has i.i.d. N(0,n1) entries, 0 2Rp is s0-sparse with
↵
2p s0 ↵p for some ↵ >0 and | 0|min = 1,and y =X 0+✏where ✏⇠N(0, 2In) with
2 logCn for some constant C >0. Then provided n C0(2s0) log(p/2s0), the TBP + OLS algorithm satisfies
nlim!1P( ˆ⌧TBP+OLS=s 0) = 1.
The scaling they obtain for n and 2 are optimal, in the sense that they are (order-wise) necessary for support recovery by any algorithm in the setting they consider.
More recently, Tardivel and Bogdan [2018] proved that the notion of identifiability is necessary and sufficient for TBP to recover the signs of 0 in an asymptotic setting whereX andsign( 0) are fixed, but | 0|min tends to +1.
Definition 2.19 A vector 0 2Rp is said to beidentifiable with respect to X2Rn⇥p if it is the unique solution to BP (2.11) wheny =X 0.
In other words, 0 is identifiable if it is correctly and uniquely identified by BP. Actually, the following result of Daubechies et al. [2010] implies that identifiability only depends on the signs of 0.
Lemma 2.20 A vector 0 2 Rp is identifiable with respect to X 2 Rn⇥p if and only if for every 2kerX\ {0},
|sign( 0)T |<k S0k1, whereS0 = supp( 0).
Tardivel and Bogdan[2018] considered a fixed matrix X 2 Rn⇥p and a sequence{ (r)}r2N⇤ of vectors inRp such that
i) there exists a sign vector✓2{1, 1,0}p such thatsign( (r)) =✓for every r2N⇤, ii) limr!+1| (r)|min = +1,
iii) there existsq >0such that for every r 2N⇤, | (r)|min
k (r)k1 q.
Let us denote by ˆBP(r) the BP solution when y=X (r)+✏,and ˆ⌧TBP(r) the corresponding TBP estimate. Their result is the following.
Theorem 2.21 Let X 2 Rn⇥p and { (r)}r2N⇤ be a sequence satisfying assumptions i)-iii) above. If the sign vector ✓ is identifiable with respect to X, then for every ✏2Rn there exists R=R(✏)>0 such that for every r R there is a threshold⌧ >0 for which
ˆTBP(r)
⌧
=s ✓. (2.24)
Conversely, if for some ✏2Rn andr 2N⇤ there is a threshold ⌧ >0 for which (2.24) holds, then ✓ (and therefore every (r)) is identifiable with respect to X.
2.5. MISSING COVARIATES 21
2.5 Missing covariates
In practice, data are often only partially observed. Knowing how to deal with incomplete data is important for good prediction, estimation and/or model selection. For simplicity we focus on missing values inX,even though our contribution in Chapter5directly extends to missing data in bothX and y.
The process that governs the probability of data points to be missing is called themissing data mechanism. Below, missingness refers to the distribution of the missing data indicator matrix M 2Rn⇥p whereMij = 1 if Xij is missing and Mij = 0 otherwise. One traditionally distinguishes three missing data mechanisms, originally introduced byRubin [1976] (see also Little and Rubin[2002]):
• missing completely at random (MCAR): missingness does not depend on the values of the data, missing or observed;
• missing at random (MAR): missingness depends on the observed values, but not on the components that are missing;
• missing not at random (MNAR): missingness depends on the missing values inX.
Let us briefly review the most common methods to handle missing covariates.
Complete case analysis drops all incomplete rows of X and then proceeds with a usual estimation method. Additionally to the biased estimation one gets under MAR or MNAR, this method has the big disadvantage of loosing a lot of information, especially in the high- dimensional setting. Indeed, when the numberp of covariates is large, sayp= 1000 e.g., one might have to drop an entire row because a single covariate is missing, even if the 999other ones are observed.
TheExpectation-Maximization (EM)algorithm [Dempster et al.,1977] is common to com- pute a maximum likelihood estimator in presence of missing values. It is an iterative method, alternating between an expectation step computing the expected log-likelihood function under the current parameter estimate, and a maximization step looking for the parameter maximiz- ing this expected log-likelihood. EM requires to specify a model for the covariates, which is not an easy task whenp is large, and typically leads to nonconvex problems.
Imputation is the task of replacing missing values by an estimate, before applying an esti- mation procedure. Mean imputation is arguably the simplest imputation technique, replacing a missing entry by the mean of all observed values in the corresponding column. Regression imputation techniques allow a better imputation, taking into account the correlation between covariates, but again, this requires to specify a model for the covariates. Multiple imputation procedures (see e.g.Van Buuren [2012]) allow to take into account the variability of estima- tion procedures which use imputed data. They consist in building several imputed datasets – typically with stochastic regression imputation – and aggregating the different obtained estimates.
In case of high-dimensional data, Loh and Wainwright [2012] and Datta and Zou [2017]
proposed substitutes of the Lasso based on the following observation. With the parameter rescaling /n,the Lasso optimization problem (2.16) can be equivalently written
ˆlasso = arg min
2Rp
1 2
T
✓1 nXTX
◆
T
✓1 nXTy
◆
+ k k1.
Assuming the rows of X are i.i.d. with expectation zero and covariance matrix ⌃ and y is generated by the linear model (2.1), then 1nXTX and n1XTy are consistent estimators of ⌃ and :=⌃ 0 respectively. In case of noisy or missing covariates, Loh and Wainwright[2012]
proposed to replace them by other appropriate estimates. In case of MCAR data, where each entry is missing with probability⇢2(0,1],they use
⌃ˆ := 1
nX˜TX˜ ⇢diag
✓1 nX˜TX˜
◆
, (2.25)
and
ˆ := 1
nX˜Ty, (2.26)
where
X˜ij :=
8<
:
Xij
1 ⇢ if Xij is observed,
0 otherwise. .
The obtained matrix⌃ˆ may not be positive semi-definite, and therefore the resulting problem may be nonconvex. This is why here and in Chapter 5, we call this estimator NClasso for
“nonconvex Lasso”:
ˆNClasso2arg min
k k1R
1 2
T⌃ˆ Tˆ + k k1. (2.27)
The additional constraint k k1 R is made necessary by the nonconvexity to ensure the existence of a solution. Surprisingly, Loh and Wainwright [2012] proved that nonconvexity of (2.27) is not an issue since all global minima belong to a same small neighborhood, and solving it with an algorithm based on projected gradient descent gives a solution that is close to this neighborhood. Their result holds providedR=b0p
s0,whereb0 k 0k2,which cannot be checked in practice. Moreover, it holds only under the MCAR assumption and concerns only the estimation error.
Later,Datta and Zou[2017] convexified problem (2.27) by replacing⌃ˆ (2.25) by the nearest positive semi-definite matrix, as measured by the elementwise maximum norm. More precisely, they define
( ˆ⌃)+:= arg min
A 0
1maxi,jp|Aij ⌃ˆij|. (2.28) TheirConvex Conditioned Lasso (CoCoLasso) is then defined by
ˆCoCoLasso= arg min
2Rp
1 2
T( ˆ⌃)+ Tˆ + k k1, (2.29)
2.6. THE SPARSE CORRUPTION PROBLEM 23 where ( ˆ⌃)+ and ˆ are defined by (2.28) and (2.26) respectively. Additionally to bounding the estimation error, the authors prove sign consistency of CoCoLasso in case of MCAR data, assuming that the entries of X are bounded and under the irrepresentable condition (2.18) and the bound (2.19).
The sign consistency result of Datta and Zou[2017] seems to be the only existing result about support recovery in high-dimension in presence of missing data. Note however that the works ofLoh and Wainwright [2012] and Datta and Zou [2017] more generally deal with the issue of error-in-variables, and both works treat the missing data issue as a special case of multiplicative error. In particular, they do not provide any results of numerical experiments in the missing data setting.
2.6 The sparse corruption problem
In the linear model (2.1), one generally assumes that the error✏is dense, in the sense that all its entries are almost surely nonzero. However in some applications, the response is occasionally corrupted by a large magnitude error. This can be modeled by the following sparse corruption model,
y=X 0+p
n!0+✏, (2.30)
where!0 2Rn is a deterministic sparse vector of corruptions with arbitrarily large nonzero entries, and ✏ ⇠ N(0, 2In) is the usual dense noise. The scaling p
n!0 is used since X is generally standardized so that each column has Euclidean normkXjk2=p
n. We will denote the support of!0
T0 := supp(!0), and its sparsity index
k0:=|T0|. By concatenatingX to p
nIn,(2.30) can be rewritten y =h
X p
nIni"
0
!0
# +✏,
which is simply a sparse linear model with augmented design matrix h
X p
nIn
i and sparse vector
"
0
!0
#
.An important line of work has therefore followed the natural approach of applying standard`1-techniques to the pair (y,h
X p
nIni
) to jointly estimate 0 and !0. When✏= 0,applying BP to the pair(y,h
X p
nIni
)and using a tuning parameter >0
to allow a different penalization for and ! gives the Justice Pursuit2 (JP) estimator ( ˆJP,!ˆJP) = arg min
2Rp,!2Rn k k1+ k!k1
s.t. y=X +p n!.
(2.31)
The first works on JP considered = 1.It was originally introduced by Wright et al.[2009]
in the context of face recognition, and further analyzed in Wright and Ma [2010] for highly correlated dictionaries. Laska et al. [2009] and Li et al. [2010] proved that if the entries of Xare i.i.d. standard Gaussian, then p1nh
X p
nIn
isatisfies some RIP with high probability, implying exact recovery of both 0and!0.However, the sparsity level assumed for!0 in these works does not allow a large proportion of corruptions.
In order to asymptotically allow a positive fraction of corruptions (i.e. k0 = O(n)), Li [2013] and Nguyen and Tran [2013a] introduced the tuning parameter 6= 1 in (2.31). Li [2013] deals with i.i.d. Gaussian designs, whereas the rows of the design matrix inNguyen and Tran[2013a] are randomly drawn from the rows of an orthogonal matrix.
When✏6= 0,Nguyen and Tran[2013b] proposed theRobust Lasso, ( ˆ(Rlasso, !),!ˆ(Rlasso, !)) = arg min
2Rp,!2Rn
1
2ky X k22+ k k1+ !k!k1. (2.32) Problem (2.32) is actually equivalent to`1-penalized Huber loss regression (see e.g.Sardy et al.
[2001] orDalalyan and Thompson[2019]). Nguyen and Tran[2013b] extended the theoretical analysis of the Lasso to the Robust Lasso for estimation error and model selection. Regarding model selection, they proved consistent sign recovery for correlated Gaussian designs, under the conditions (2.18) and (2.19). When the number of large corruptionsk0 is arbitrarily close ton, their result requires a number of measurementsn satisfying lognn =⌦(s0log(p s0)).
2Name coined byLaska et al.[2009].