In order to prove Theorem 4.2.1, we use the following result from Butucea et al. (2018).

Consider the set of binary vectors

A={⌘2{0,1}^{p} : |⌘|0 s}

and assume that we are given a family {P⌘,⌘ 2 A} where each P⌘ is a probability distribution on a measurable space (X,U). We observe X drawn from P⌘ with some unknown⌘= (⌘1, . . . ,⌘p)2Aand we consider the Hamming risk of a selector⌘ˆ= ˆ⌘(X):

sup

⌘2A

E⌘|⌘ˆ ⌘|

where E⌘ is the expectation w.r.t. P⌘. We call the selector any estimator with values
in {0,1}^{p}. Let ⇡ be a probability measure on {0,1}^{p} (a prior on ⌘). We denote by E^{⇡}
the expectation with respect to ⇡. Then the following result is proved in Butucea et al.

(2018)

Theorem 4.8.1. Butucea et al. (2018) Let ⇡ be a product on p Bernoulli measures with
parameter s^{0}/p where s^{0} 2(0, s]. Then,

inf⌘ˆ sup

⌘2AE⌘|⌘ˆ ⌘| inf

Tˆ2[0,1]^{p}E^{⇡}E⌘

Xp i=1

|Tˆi ⌘i| 4s^{0}exp⇣ (s s^{0})^{2}
2s

⌘

, (4.29)
where inf⌘ˆ is the infimum over all selectors and infTˆ2[0,1]^{p} is the infimum over all
esti-mators Tˆ = ( ˆT1, . . . ,Tˆp) with values in [0,1]^{p}.

Proof of Theorem 4.2.1. Let⇥(p, s, a)a subset of ⌦^{p}_{s,a} defined as

⇥(p, s, a) ={ 2⌦^{p}_{s,a} : i =a, 8i2S }.

Since any 2⇥(p, s, a)can be written as =a⌘ , there is a one-to-one correspondence between A and ⇥(p, s, a). Hence,

inf⌘ˆ sup

⌘2AE⌘|⌘ˆ ⌘|= inf

ˆ

⌘ sup

2⇥(p,s,a)

E |⌘ˆ ⌘ |.

4.8. APPENDIX: PROOFS 89
Using this remark and Theorem 4.8.1 we obtain that, for all s^{0} 2(0, s],

inf⌘ˆ sup

where ⇡ a product on p Bernoulli measures with parameter s^{0}/p. Thus, to finish the
proof it remains to show that

ˆinf

Here,' is the density of Gaussian distribution inR^{n}with i.i.d. zero-mean and variance

2 components. By the Bayesian version of the Neyman-Pearson lemma, the infimum
in (4.30) is attained forT˜i =T_{i}^{⇤} given by the formula

where ⇠ is a standard Gaussian random vector in R^{n} independent of X_{i}. Notice now
that " := _{kX}^{X}^{i}^{T}^{⇠}

ik is a standard Gaussian random variable and it is independent of kXik
since Xi ⇠N(0,I^{n}). Combining the above arguments we find that

We conclude the proof by using the fact that the function u! ^{+}^{(n,p,u,a,}_{u} ^{)} is decreasing
for u >0 (cf. Butucea et al. (2018)), so that +(n, p, s^{0}, a, ) ^{s}_{s}^{0} +(n, p, s, a, ).

Proof of Theorem 4.3.1. In view of Theorem 4.2.1 with s^{0} = s/2, it is suﬃcient to
bound += +(n, p, s, a, )from below. We have

+ (p s)P( " t(⇣)).

We will use the following bound for the tails of standard Gaussian distribution: For
some c^{0} >0,

8y 2/3, P(" y) c^{0}exp( y^{2}/2)

y .

We also recall that the densityfnof a chi-squared distribution withn degrees of freedom has the form Combining the above remarks we get

+ (p s)

4.8. APPENDIX: PROOFS 91 Using the change of variable v = u⇣

1 + _{4}^{a}^{2}2

⌘ and the assumptions of the theorem we get

where the second inequality uses the conditiona p

2 to guarantee that ^{2}_{3}⇣

Proposition 3.1 from Inglot (2010) implies that, for some absolute constantc >0,
Z _{1}

n

fn 1(u)du > c

(indeed, n is very close to the median of a chi-squared random variable with n 1 degrees of freedom). Combining the above inequalities we obtain

+ C

Proof of Theorem 4.3.2. In view of Theorem 4.2.2, it is suﬃcient to bound from above the expression

(n, p, s, a, ) = (p s)P( " t(⇣)) +sP " (ak⇣k t(⇣))_{+} .

Introducing the event D={ak⇣k t(⇣)} we get

P " (ak⇣k t(⇣))_{+} P({ " ak⇣k t(⇣)}\D) + 1

2P(D^{c}).
Using the assumption on n2 we obtain

P(D^{c}) =P

Here, k⇣k^{2} is a chi-squared random variable with n2 degrees of freedom. Lemma 4.4.2
implies

1

2P(D^{c})e ^{n}^{24}^{2}.
Thus, to finish the proof it remains to show that

(p s)P( " t(⇣)) +sP({ " ak⇣k t(⇣)}\D)2p

where f_{n}_{2}(·) is the density of chi-squared distribution with n_{2} degrees of freedom and
bn2 is the corresponding normalizing constant, cf. (4.31). Using again the bound
P(" y)e ^{y}^{2}^{2}, 8y >0,and the inequality

4.8. APPENDIX: PROOFS 93 Proof of Lemma 4.4.1. Recall that the density of a Student random variableZ with k degrees of freedom is given by:

fZ(t) =c^{⇤}_{k}
It is easy to check that the derivative ofg has the form

g^{0}(t) =

The lemma follows since, in view of (4.33), there exist two positive constantsc and C
such thatcc^{⇤}_{k} C for all k 1.

Proof of Lemma 4.5.1. It is not hard to check that the random variable |^{u}^{>}^{V}|

kuk is
-sub-Gaussian for any fixed u 2 R^{n}. Also, any -sub-Gaussian random ⇣ variable
satisfies P(|⇣| t) 2e ^{t}

2

2 2 for all t > 0. Therefore, we have the following bound for the conditional probability:

To bound the last probability, we apply the following inequality (Wegkamp, 2003, Propo-sition 2.6).

Using this lemma with Z_{i} =U_{i}^{2}, µ_{i} ⌘1, x= 3/4, and v^{2} = _{1}^{4} we find
P kUk p

n/2 e

9n 32 41, which together with (4.34) proves the lemma.

Proof of Proposition 4.5.1. Under the assumptions of the proposition, the columns
of matrix X have the covariance matrix Ip. Without loss of generality, we may assume
that this covariance matrix is ^{1}_{2}I^{p} and replace by ^{p}_{2}. We next define the event

A={the design matrix X satisfies the W RE(s,20) condition},

where theW RE condition is defined in Bellec et al. (2018). It is easy to check that the
assumptions of Theorem 8.3 in Bellec et al. (2018) are fulfilled, with⌃= ^{1}_{2}I^{p},= ^{1}_{2} and
n_{1} C_{0}slog(2p/s) for some C_{0} > 0 large enough. Using Theorem 8.3 in Bellec et al.

(2018) we get

P(A^{c})3e ^{C}^{0}^{s}^{log 2p/s},

for some C^{0} >0. Now, in order to prove the proposition, we use the bound
P ⇣

kˆ k^{2} ^{2 2}⌘

P ⇣n

kˆ k^{2} ^{2 2}o

\A⌘

+P(A^{c}).
Under the assumption n1 C0slog(ep/s)/ ^{2}, we have

P ⇣n

kˆ k^{2} ^{2 2}o

\A⌘

P

✓⇢

kˆ k^{2} C0 2slogep/s

n1 \A

◆
.
By choosingC0 large enough, and using Proposition 4 from Comminges et al. (2018) we
get that, for some C^{00}>0,

P ⇣n

kˆ k^{2} ^{2 2}o

\A⌘

C^{00}⇣

e ^{s}log(2p/s)/C^{00}+e ^{n}^{1}^{/C}^{00}⌘
.

Recalling that n1 C0slog(2p/s) and comibinig the above inequalities we obtain the
result of the proposition with C_{1} = 2C^{00}+ 3 and C_{2} =C^{0} ^1/C^{00}^C_{0}/C^{00}.

Proof of Proposition 4.6.1. We apply Theorem 6 in Lecué and Lerasle (2017). Thus, it is enough to check that items 1-5 of Assumption 6 in Lecué and Lerasle (2017) are satisfied. Item 1 is immediate since |I| = n1 |O| n1/2, and |O| c0slog(ep/s).

To check item 2, we first note that the random variable x^{>}_{1}t isktk X-sub-Gaussian for
any t 2 R^{p}. It follows from the standard properties of sub-Gaussian random variables
(Vershynin, 2012, Lemma 5.5) that, for some C > 0,

E|x^{>}_{1}t|^{d} ^{1/d} Cktkp

d, 8t 2R^{p},8d 1.

On the other hand, since the elements ofx1 are centered random variables with variance 1,

E|x^{>}_{1}t|^{2 1}^{/2} =ktk, 8t2R^{p}. (4.35)
Combining the last two displays proves item 2. Item 3 holds since we assume that
E(|⇠i|^{q}^{0}) ^{q}^{0}, i2I, with q0 = 2 +q. To prove item 4, we use (4.35) and the fact that,
for some C >0,

E|x^{>}_{1}t| Cktk, 8t 2R^{p},

4.8. APPENDIX: PROOFS 95 due to Marcinkiewicz-Zygmund inequality (Petrov, 1995, page 82). Finally we have that, for somec >0,

Thus, all conditions of Theorem 6 in Lecué and Lerasle (2017) are satisfied. Application of this theorem yields the result.

Proof of Lemma 4.6.1. We first prove that for all i2I and 1j p,
E ("^{(i)}_{j} )^{2}1{A⇤} CK ^{2}

n , (4.36)

where C > 0 depends only on the sub-Gaussian constant X. Indeed, the components
of"^{(i)} have the form
C¯ depends only on the sub-Gaussian constant X. Using these remarks we obtain from
the last display that

E ("^{(i)}_{j} )^{2}1{A⇤} 2( ¯C+ 2) ^{2}

q .

Asq =bn/Kc this yields (4.36).

Next, the definition of the median immediately implies that {|M ed("j)| t}✓

Since the number of outliers |O| does not exceed bK/4c there are at least K^{0} := K
bK/4cblocks that contain only observations from I. Without loss of generality, assume
that these blocks are indexed by 1, . . . , K^{0}. Hence

P (|M ed("j)| t)P

✓XK^{0}
i=1

1_{{|}_{"}^{(i)}

j | t}\A⇤

K 4

◆

+P (A^{c}_{⇤}). (4.37)
Note that using (4.36) we have, for all i= 1, . . . , K^{0},

P ⇣

{|"^{(i)}_{j} | t}\A⇤

⌘E ("^{(i)}_{j} )^{2}1{A⇤} /t^{2} CK ^{2}
t^{2}n 1

5.

The last inequality is granted by a choice of large enough constant c4 in the definition
of t. Thus, introducing the notation ⇣_{i} =1_{{|"}(i)

j | t}\A⇤ we obtain P

✓XK^{0}
i=1

1_{{|}_{"}^{(i)}

j | t}\A⇤

K 4

◆

P

✓XK^{0}
i=1

(⇣i E (⇣i)) K 4

K^{0}
5

◆

P

✓XK^{0}
i=1

(⇣i E (⇣i)) K 20

◆

e ^{c}^{5}^{K} (4.38)
where the last inequality is an application of Hoeﬀding’s inequality. Combining (4.37)
and (4.38) proves the lemma.

### Chapter 5

### Interplay of minimax estimation and minimax support recovery under sparsity

In this chapter, we study a new notion of scaled minimaxity for sparse estimation in high-dimensional linear regression model. We present more optimistic lower bounds than the one given by the classical minimax theory and hence improve on existing results. We recover sharp results for the global minimaxity as a consequence of our study. Fixing the scale of the signal-to-noise ratio, we prove that the estimation error can be much smaller than the global minimax error. We construct a new optimal estimator for the scaled minimax sparse estimation. An optimal adaptive procedure is also described.

Based on Ndaoud (2019): Ndaoud, M. (2019). Interplay of minimax estimation and minimax support recovery under sparsity. ALT 2019.