Lemma 6.9.1. Let A be a matrix in R^{n⇥n}. Then
kH(A)kop2kAkop.
Proof. From the linearity of H, we have that

kH(A)kop kAkop+kdiag(A)kop, where

kdiag(A)k^{op}= max

i |Aii| kAk^{op}.

Lemma 6.9.2. For any random matrix W with independent columns, we have
kH(W^{>}W)kop2 W^{>}W E W^{>}W _{op}.

Proof. Since E W^{>}W is a diagonal matrix, it follows that
H(W^{>}W) =H W^{>}W E W^{>}W .
The result follows from Lemma 6.9.1.

Lemma 6.9.3. Letu2S^{p} ^{1} andv 2S^{n} ^{1}, andW 2R^{p⇥n}a matrix with i.i.d. centered
Gaussian entries of variance at most ^{2}. Then, for some c, C >0

8t 2 , P 1

pnW^{>}uv^{>}

op

t

!

e ^{cnt/} ,
and

E 1

pnW^{>}uv^{>}

2

op

!

C ^{2}.
Proof. We can easily check that

p1

nW^{>}uv^{>}

op

1

pnkW^{>}uk2.

Since kuk^{2} = 1, we have that W^{>}u is Gaussian with mean 0 and covariance matrix

2In. We conclude using a tail inequality for quadratic forms of sub-Gaussian random variables using the fact that t 2 , see, e.g., Hsu et al. (2012). The inequality in expectation is immediate by integration of the tail function.

Lemma 6.9.4. Let W 2 R^{p⇥n} be a matrix with i.i.d. centered Gaussian entries of
variance at most ^{2}. For some c, C, C^{0} >0 we have

8t C ^{2}

✓ 1_

rp n

◆

, P

✓1

nkH(W^{>}W)k^{op} t

◆

e ^{cnt/} ^{2}

⇣
1^_{p}^{tn}2

⌘

, and

E

✓1

nkH(W^{>}W)k^{2}op

◆

C^{0} ^{4}(1 +p/n).

6.9. APPENDIX: TECHNICAL LEMMAS 147 Proof. Using Lemma 6.9.2, we get

P

✓1

nkH(W^{>}W)k^{op} t

◆

P

✓1

nkW^{>}W E(W^{>}W)k^{op} t/2

◆ . Now based on Theorem 4.6.1 in Vershynin (2018), we get moreover that

P

✓1

nkH(W^{>}W)k^{op} t ^{2}

◆

9^{n}2e ^{cnt(1}^{^}^{tn/p)},
for some c >0. For t C(1_p

p/n) ^{2} with C large enough, we get ct(1^tn/p ^{2})
4 ^{2}log 9, hence

P

✓1

nkH(W^{>}W)k^{op} t

◆

e ^{c}^{0}^{nt/} ^{2}^{(1}^{^}^{tn/p} ^{2}^{)},
for somec^{0} >0. The result in expectation is immediate by integration.

Lemma 6.9.5. For any x2{ 1,1}^{n} and y2R^{n}, we have
1

n|x sign(y)|2 x pn y

2

. Proof. It is enough to observe that if xi 2{ 1,1}, then

|x_{i} sign(y_{i})|= 21(x_{i} 6=sign(y_{i}))2|x_{i} p
ny_{i}|^{2}.

### Part III

### Adaptive robust estimation

149

### Chapter 7

### Adaptive robust estimation in sparse vector model

For the sparse vector model, we consider estimation of the target vector, of its `2-norm and of the noise variance. We construct adaptive estimators and establish the optimal rates of adaptive estimation when adaptation is considered with respect to the triplet

"noise level – noise distribution – sparsity". We consider classes of noise distributions with polynomially and exponentially decreasing tails as well as the case of Gaussian noise. The obtained rates turn out to be diﬀerent from the minimax non-adaptive rates when the triplet is known. A crucial issue is the ignorance of the noise variance. More-over, knowing or not knowing the noise distribution can also influence the rate. For example, the rates of estimation of the noise variance can diﬀer depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution.

Estimation of noise variance in our setting can be viewed as an adaptive variant of robust estimation of scale in the contamination model, where instead of fixing the "nominal"

distribution in advance we assume that it belongs to some class of distributions.

Based on Comminges et al. (2018): Comminges, L., Collier, O., Ndaoud, M., and Tsybakov, A. B. (2018). Adaptive robust estimation in sparse vector model. arXiv preprint arXiv:1802.04230v3.

### 7.1 Introduction

This paper considers estimation of the unknown sparse vector, of its`_{2}-norm and of the
noise level in the sparse sequence model. The focus is on construction of estimators that
are optimally adaptive in a minimax sense with respect to the noise level, to the form
of the noise distribution, and to the sparsity.

We consider the model defined as follows. Let the signal✓ = (✓_{1}, . . . ,✓_{d})be observed
with noise of unknown magnitude >0:

Y_{i} =✓_{i}+ ⇠_{i}, i= 1, . . . , d. (7.1)
The noise random variables ⇠1, . . . ,⇠d are assumed to be i.i.d. and we denote by P⇠

the unknown distribution of ⇠1. We assume throughout that the noise is zero-mean,
E(⇠1) = 0, and that E(⇠_{1}^{2}) = 1, since needs to be identifiable. We denote by P✓,P_{⇠},

the distribution of Y = (Y1, . . . , Yd) when the signal is ✓, the noise level is and the 151

distribution of the noise variables isP_{⇠}. We also denote by E_{✓,P}_{⇠}_{,} the expectation with
respect to P✓,P_{⇠}, .

We assume that the signal ✓ iss-sparse,i.e., k✓k0 =

Xd i=1

1_{✓}

i6=0 s,

where s 2 {1, . . . , d} is an integer. Set ⇥s = {✓ 2 R^{d}|k✓k^{0} s}. We consider the
problems of estimating ✓ under the `2 loss, estimating the variance ^{2}, and estimating
the `_{2}-norm

k✓k2 =⇣X^{d}

i=1

✓_{i}^{2}⌘1/2

.

The classical Gaussian sequence model corresponds to the case where the noise ⇠i is standard Gaussian (P⇠=N(0,1)) and the noise level is known. Then, the optimal rate of estimation of ✓ under the `2 loss in a minimax sense on the class ⇥s is p

slog(ed/s) and it is attained by thresholding estimators Donoho et al. (1992). Also, for the Gaussian sequence model with known , minimax optimal estimator of the norm k✓k2 as well as the corresponding minimax rate are available from Collier et al. (2017) (see Table 1).

In this chapter, we study estimation of the three objects ✓, k✓k2, and ^{2} in the
following two settings.

(a) The distribution of ⇠i and the noise level are both unknown. This is the main
setting of our interest. For the unknown distribution of ⇠i, we consider two types
of assumptions. EitherP_{⇠} belongs to a class Ga,⌧, i.e., for some a,⌧ >0,

P⇠ 2Ga,⌧ iﬀ E(⇠1) = 0, E(⇠_{1}^{2}) = 1 and 8t 2, P |⇠1|> t 2e ^{(t/⌧)}^{a},
(7.2)
which includes for example sub-Gaussian distributions (a = 2), or to a class of
distributions with polynomially decaying tailsPa,⌧,i.e.,for some⌧ >0anda 2,

P⇠ 2P^{a,⌧} iﬀ E(⇠1) = 0, E(⇠_{1}^{2}) = 1 and 8t 2, P |⇠1|> t)⇣⌧
t

⌘a

. (7.3)
We propose estimators of ✓, k✓k2, and ^{2} that are optimal in non-asymptotic
minimax sense on these classes of distributions and the sparsity class ⇥_{s}. We
establish the corresponding non-asymptotic minimax rates. They are given in
the second and third columns of Table 1. We also provide the minimax optimal
estimators.

(b) Gaussian noise ⇠i and unknown . The results on the non-asymptotic minimax
rates are summarized in the first column of Table 1. Notice an interesting eﬀect –
the rates of estimation of ^{2} and of the normk✓k^{2} when the noise is Gaussian are
faster than the optimal rates when the noise is sub-Gaussian. This can be seen by
comparing the first column of Table 1 with the particular casea= 2 of the second
column corresponding to sub-Gaussian noise.

Some comments about Table 1 and additional details are in order.

7.1. INTRODUCTION 153
Gaussian noise Noise in class G^{a,⌧}, Noise in class P^{a,⌧},

model

pslog^{1}^{a}(ed/s) p

s(d/s)^{a}^{1}

✓ p

slog(ed/s)

known Donoho et al. (1992)

unknown Verzelen (2012) unknown unknown

k✓k2

q

slog(1 + ^{p}_{s}^{d})^d^{1/4} p

slog^{1}^{a}(ed/s)^d^{1/4} p

s(d/s)^{1}^{a} ^d^{1/4}

known Collier et al. (2017) known known

q

slog(1 + ^{p}_{s}^{d})_q _{s}

1+log_{+}(s^{2}/d)

pslog^{1}^{a}(ed/s) p

s(d/s)^{a}^{1}

unknown unknown unknown

p1

d _ s

d(1 + log_{+}(s^{2}/d))

p1 d _ s

dlog^{2}^{a}

✓ed s

◆ 1

pd _⇣s d

⌘1 ^{2}_{a}
2

Table 7.1: Optimal rates of convergence.

• The diﬀerence between the minimax rates for estimation of ✓ and estimation of
the `2-norm k✓k2 turns out to be specific for the pure Gaussian noise model. It
disappears for the classes G^{a,⌧} and P^{a,⌧}. This is somewhat unexpected since G^{2,⌧}

is the class of sub-Gaussian distributions, and it turns out that k✓k2 is estimated optimally at diﬀerent rates for sub-Gaussian and pure Gaussian noise. Another conclusion is that if the noise is not Gaussian and is unknown, the minimax rate for k✓k2 does not have an elbow between the "dense" (s >p

d) and the "sparse"

(sp

d) zones.

• For the problem of estimation of variance ^{2} with known distribution of the noise
P⇠, we consider a more general setting than (b) mentioned above. We show that
when the noise distribution is exactly known (and satisfies a rather general
as-sumption, not necessarily Gaussian - can have polynomial tails), then the rate of
estimation of ^{2} can be as fast as max⇣

p1
d,_{d}^{s}⌘

, which is faster than the optimal rate max⇣

p1

d,_{d}^{s}log ^{ed}_{s} ⌘

for the class of sub-Gaussian noise. In other words, the phenomenon of improved rate is not due to the Gaussian character of the noise but rather to the fact that the noise distribution is known.

• Our findings show that there is a dramatic diﬀerence between the behavior of optimal estimators of ✓ in the sparse sequence model and in the sparse linear regression model with "well spread" regressors. It is known from Gautier and Tsybakov (2013); Belloni et al. (2014) that in sparse linear regression with "well spread" regressors (that is, having positive variance), the rates of estimating✓ are the same for the noise with sub-Gaussian and polynomial tails. We show that the situation is quite diﬀerent in the sparse sequence model, where the optimal rates are much slower and depend on the polynomial index of the noise.

• The rates shown in Table 1 for the classesGa,⌧ andPa,⌧ are achieved on estimators that are adaptive to the sparsity index s. Thus, knowing or not knowing s does not influence the optimal rates of estimation when the distribution of ⇠ and the noise level are unknown.

We conclude this section by a discussion of related work. Chen et al. (2018) explore
the problem of robust estimation of variance and of covariance matrix under Hubers’s
contamination model. As explained in Section 7.4 below, this problem has similarities
with estimation of noise level in our setting. The main diﬀerence is that instead of
fixing in advance the Gaussian nominal distribution of the contamination model we
as-sume that it belongs to a class of distributions, such as (7.2) or (7.3). Therefore, the
corresponding results in Section 7.4 can be viewed as results on robust estimation of
scale where, in contrast to the classical setting, we are interested in adaptation to the
unknown nominal law. Another aspect of robust estimation of scale is analyzed by Wei
and Minsker (2017) who consider classes of distributions similar toPa,⌧ rather than the
contamination model. The main aim in Wei and Minsker (2017) is to construct
esti-mators having sub-Gaussian deviations under weak moment assumptions. Our setting
is diﬀerent in that we consider the sparsity class⇥_{s} of vectors ✓ and the rates that we
obtain depend on s. Estimation of variance in sparse linear model is discussed in Sun
and Zhang (2012) where some upper bounds for the rates are given. We also mention
the recent paper Golubev and Krymova (2017) that deals with estimation of variance
in linear regression in a framework that does not involve sparsity, as well as the work
on estimation of signal-to-noise ratio functionals in settings involving sparsity Verzelen
and Gassiat (2018); Guo et al. (2018) and not involving sparsity Janson et al. (2017).

Papers Collier et al. (2018); Carpentier and Verzelen (2019) discuss estimation of other functionals than the`2-normk✓k2in the sparse vector model when the noise is Gaussian with unknown variance.

Notation. For x > 0, let bxc denote the maximal integer smaller than x. For
a finite set A, we denote by |A| its cardinality. Let infTˆ denote the infimum over all
estimators. The notationC,C^{0},c,c^{0} will be used for positive constants that can depend
only a and ⌧ and can vary from line to line.