brevityE =E✓,P_{⇠}, we find

d. Finally, using again (7.28) we get, fork = 1,2,

where the last inequality follows from the same argument as in the proof of Theorem 7.2.1. These remarks together with (7.56) imply

E |ˆ^{2} ^{2}| C

We conclude the proof by boundingPs j=1 2

j in the same way as in the end of the proof of Theorem 7.2.1.

### 7.6 Appendix: Proofs of the lower bounds

### Proof of Theorems 7.4.3 and 7.4.4 and part (ii) of Proposition 7.4.2

Since we have `(t) `(A)1_{t>A} for any A > 0, it is enough to prove the theorems for
the indicator loss`(t) =1_{t>A}. This remark is valid for all the proofs of this section and
will not be further repeated.

(i) We first prove the lower bounds with the rate1/p

din Theorems 7.4.3 and 7.4.4.

Letf0 :R![0,1)be a probability density with the following properties: f0 is
contin-uously diﬀerentiable, symmetric about 0, supported on[ 3/2,3/2], with variance 1 and
finite Fisher information I_{f}_{0} =R

(f_{0}^{0}(x))^{2}(f_{0}(x)) ^{1}dx. The existence of such f_{0} is shown
in Lemma 7.7.7. Denote by F0 the probability distribution corresponding to f0. Since
F0 is zero-mean, with variance 1 and supported on [ 3/2,3/2] it belongs to Ga,⌧ with
any ⌧ >0,a >0, and to Pa,⌧ with any ⌧ >0,a 2. Define P_{0} =P_{0,F}_{0}_{,1}, P_{1} =P_{0,F}_{0}_{,} _{1}

where I(t) is the Fisher information corresponding to the density f0(x/t)/t, that is
I(t) = t ^{2}If0. It follows that h^{2} ¯cc^{2}_{0}/d where ¯c > 0 is a constant. This and (7.57)

imply that forc_{0} small enough we haveH(P_{1},P_{0})1/2. Finally, choosing such a small
c0 and using Theorem 2.2(ii) in Tsybakov (2008) we obtain

infTˆ
(ii) We now prove the lower bound with the rate _{d}^{s}log^{2/a}(ed/s)in Theorem 7.4.3. It
is enough to conduct the proof fors s0 wheres0 >0is an arbitrary absolute constant.

Indeed, for ss0 we have _{d}^{s}log^{2/a}(ed/s)C/p

d where C > 0is an absolute constant and thus Theorem 7.4.3 follows already from the lower bound with the rate1/p

dproved in item (i). Therefore, in the rest of this proof we assume without loss of generality that s 32.

We take P⇠ =U whereU is the Rademacher distribution, that is the uniform
distri-bution on{ 1,1}. Clearly,U 2G^{a,⌧}. Let 1, . . . , d be i.i.d. Bernoulli random variables
with probability of success P( 1 = 1) = _{2d}^{s}, and let ✏1, . . . ,✏d be i.i.d. Rademacher
random variables that are independent of ( 1, . . . , d). Denote by µ the distribution of
(↵ _{1}✏_{1}, . . . ,↵ _{d}✏_{d}) where ↵ = (⌧/2) log^{1/a}(ed/s). Note that µ is not necessarily
sup-ported on ⇥s = {✓ 2 R^{d}|k✓k^{0} s} as the number of nonzero components of a vector
drawn fromµcan be larger than s. Therefore, we consider a restricted to ⇥s version of
µdefined by

¯

µ(A) = µ(A\⇥s)

µ(⇥s) (7.58)

for all Borel subsets A of R^{d}. Finally, we introduce two mixture probability measures
P^{µ} =

Z

P✓,U,1µ(d✓) and P^{µ}^{¯} =
Z

P✓,U,1µ(d✓).¯ (7.59)
Notice that there exists a probability measure P˜ 2G^{a,⌧} such that

P^{µ}=P_{0,}P ,˜ 0 (7.60)
But this inequality immediately follows from the fact that for t 2 the probability in
(7.62) is smaller than

P(✏= 1, = 1)1

(⌧/2) log^{1/a}(ed/s)>t 1 s
4d1

⌧log^{1/a}(ed/s)>t e ^{(t/⌧)}^{a}. (7.63)

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 177 Now, for any estimatorTˆ and any u >0 we have

sup

P⇠2Ga,⌧

sup

>0 sup

k✓k0s

P✓,P⇠,

⇣ Tˆ

2 1 u⌘

maxn

P_{0,}P ,˜ 0(|Tˆ ^{2}_{0}| 0^{2}u),
Z

P_{✓,U,1}(|Tˆ 1| u)¯µ(d✓)o
maxn

Pµ(|Tˆ ^{2}_{0}| 0^{2}u),Pµ¯(|Tˆ 1| 0^{2}u)o

(7.64)

where the last inequality uses (7.60). Write _{0}^{2} = 1 + 2 where = ^{⌧}_{16d}^{2}^{s}log^{2/a}(ed/s)
and chooseu = / _{0}^{2} /(1 +⌧^{2}/8). Then, the expression in (7.64) is bounded from
below by the probability of error in the problem of distinguishing between two simple
hypothesesP^{µ} and P^{µ}^{¯}, for which Theorem 2.2 in Tsybakov (2008) yields

maxn

P^{µ}(|Tˆ ^{2}_{0}| ),P^{µ}^{¯}(|Tˆ 1| )o 1 V(P^{µ},P^{µ}^{¯})

2 (7.65)

where V(P^{µ},P^{µ}^{¯}) is the total variation distance between P^{µ} and P^{µ}^{¯}. The desired lower
bound follows from (7.65) and Lemma 7.7.5 for anys 32.

(iii) Finally, we prove the lower bound with the rate ⌧^{2}(s/d)^{1 2/a} in Theorem 7.4.4.

Again, we do not consider the cases32since in this case the rate1/p

d is dominating
and Theorem 7.4.4 follows from item (i) above. For s 32, the proof uses the same
argument as in item (ii) above but we choose ↵ = (⌧/2)(d/s)^{1/a}. Then the variance of

↵ ✏+⇠ is equal to

2

0 = 1 + ⌧^{2}(s/d)^{1 2/a}

8 .

Furthermore, with this definition of _{0}^{2} there exists P˜ 2 Pa,⌧ such that (7.60) holds.

Indeed, analogously to (7.62) we now have, for all t 2,

P ↵ ✏+⇠ > t 0 P(✏= 1, = 1)1

(⌧/2)(d/s)^{1/a}>t 1 s
4d1

⌧(d/s)^{1/a}>t (t/⌧)^{a}. (7.66)
To finish the proof, it remains to repeat the argument of (7.64) and (7.65) with =

⌧^{2}(s/d)^{1 2/a}

16 .

### Proof of Theorem 7.3.1

We argue similarly to the proof of Theorems 7.4.3 and 7.4.4, in particular, we set↵ =
(⌧/2) log^{1/a}(ed/s) when proving the bound on the class Ga,⌧, and ↵ = (⌧/2)(d/s)^{1/a}
when proving the bound onP^{a,⌧}. In what follows, we only deal with the class G^{a,⌧} since
the proof for Pa,⌧ is analogous. Consider the measures µ µ,¯ P^{µ}, P^{µ}^{¯} and P˜ defined in

Section 7.6. Similarly to (7.64), for any estimator Tˆ and any u >0we have sup

P⇠2Ga,⌧

sup

>0 sup

k✓k0s

P✓,P⇠, |Tˆ k✓k2| u maxn

P_{0,}P ,˜ 0(|Tˆ| ^{0}u),
Z

P✓,U,1(|Tˆ k✓k^{2}| u)¯µ(d✓)o
maxn

P^{µ}(|Tˆ| ^{0}u),P^{µ}^{¯}(|Tˆ k✓k^{2}| ^{0}u)o
maxn

Pµ(|Tˆ| 0u),Pµ¯(|Tˆ|< _{0}u,k✓k2 2 _{0}u)o
minB max Pµ(B),Pµ¯(B^{c}) µ(¯ k✓k2 <2 _{0}u)

minB

P^{µ}(B) +P^{µ}^{¯}(B^{c})
2

¯

µ(k✓k2 <2 0u)

2 (7.67)

where 0 is defined in (7.61),U denotes the Rademacher law andminB is the minimum over all Borel sets. The third line in the last display is due to (7.60) and to the inequality

0 1. SinceminB P^{µ}(B) +P^{µ}^{¯}(B^{c}) = 1 V(P^{µ},P^{µ}^{¯}), we get
sup

P⇠2Ga,⌧

sup

>0

sup

k✓k0s

P_{✓,P}_{⇠}_{,} |Tˆ k✓k2|/ u 1 V(P^{µ},P^{µ}^{¯}) µ(¯ k✓k^{2} <2 0u)

2 (7.68).

Consider first the case s 32. Setu= ^{↵}_{4}^{p}_{0}^{s}. Then (7.94) and (7.97) imply that
V(P^{µ},P^{µ}^{¯})e ^{3s}^{16}, µ(¯ k✓k2 <2 0u)2e ^{16}^{s} ,

which, together with (7.68) and the fact that s 32 yields the result.

Let now s <32. Then we set u= _{8}^{↵}^{p}^{p}_{2}^{s}

0. It follows from (7.95) and (7.98) that
1 V(P^{µ},P^{µ}^{¯}) µ(¯ k✓k^{2} <2 0u) P⇣

B d, s

2d = 1⌘

= s 2

⇣1 s 2d

⌘d 1

.

It is not hard to check that the minimum of the last expression over all integerss, dsuch that 1s < 32, s d, is bounded from below by a positive number independent of d.

We conclude by combining these remarks with (7.68).

### Proof of part (ii) of Proposition 7.3.2

The lower bound corresponding to the sparse regime (i.e. sp

d) is already proven in Collier et al. (2017) for known variance . Hence, we only focus on the dense regime where we may assume without loss of generality that s p

d for d large enough. The proof is inspired by ideas from Cai and Jin (2010) even if their original proof does not apply in our setting. In what follows,we will use the Fourier transform defined for any integrable function f as

f(t) =ˆ Z

R

e ^{itx}f(x)dx. (7.69)

In the following, C is an absolute constant whose value may change from line to line.

We denote by ^{2} the density of N(0, ^{2}). Moreover, we set ✏ = _{2d}^{s} 1/2, ⌧ =
p↵log(es^{2}/d) with ↵ large enough, and', c0 are defined in Lemma 7.7.9.

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 179 First, we build some probability distributions on ⇥s. If 1, . . . , d

iid⇠ B(✏), we define
µi, i= 1,2 respectively the distribution of( 1X_{1}^{(i)}, . . . , dX_{d}^{(i)}) where

X_{1}^{(1)}, . . . , X_{d}^{(1)} ^{iid}⇠ ( _{'}⇤g_{1}d )^{d}, X_{1}^{(2)}, . . . , X_{d}^{(2)} ^{iid}⇠ (g_{2}d )^{d} (7.70)
where g1, g2 are density functions given by Lemma 7.7.9 and is the Lebesgue
mea-sure. Then, we consider the probability distributions P_{1} := P_{µ}_{1}_{,N}_{(0,1),1} and P_{2} :=

P_{µ}_{2}_{,}_{N}_{(0,1),}^{p}_{1+'} whose density functions are respectively

f1 = (1 ✏) 1+✏ 1+'⇤g1, f2 = (1 ✏) 1+'+✏ 1+'⇤g2. (7.71) Then,µ¯1 and µ¯2 defined by

¯

µi(A) = µ_{i} A\⇥_{s}
µi ⇥s

(7.72) are supported on ⇥s. Now, using Theorem 2.15 in Tsybakov (2008) and the fact that

`(t) l(a)1_{t>a} for any a >0, we get
infTˆ

sup

✓2⇥s, >0

E_{✓,N}_{(0,1),} `⇣

⇤N(0,1)(s, d) ^{1} Tˆ k✓k^{2} ⌘ `(v)

2 (1 V^{0}), (7.73)
where for somev, w2R,

V^{0} =V(P_{1},P_{2}) + ¯µ_{1}(k✓k2 w+ 2 ^{⇤}_{N}_{(0,1)}v) + ¯µ_{2}(k✓k2 w). (7.74)
Decomposing in particular the total-variation distance, we have

V^{0} 2V(¯µ_{1}, µ_{1}) + 2V(¯µ_{2}, µ_{2}) +p

2(P_{1},P_{2}) (7.75)
+µ1

⇥k✓k^{2} w+ 2 ^{⇤}_{N}_{(0,1)}v⇤
+µ2

⇥k✓k^{2} w⇤

. (7.76)

The first two terms in the right-hand side are upper bounded using Lemma 7.7.5 by
4e ^{3s}^{16}. Then, we choose

v =

pc+ 2u p c

2 ^{⇤}_{N}_{(0,1)} , w=p

c, (7.77)

where

c=m_{2}+ m_{1} m_{2}

4 , u= m_{1} m_{2}

4 , m_{i} =µ_{i} k✓k^{2}2 . (7.78)
Moreover, since by definition,

m1 = s 2

Z

x^{2}g1⇤ '(x)dx, m2 = s
2

Z

x^{2}g2(x)dx, (7.79)
Lemma 7.7.9 implies that

m1+m2 ^{1}s

⌧^{2} , m1 m2 = c0s

2⌧^{2}, (7.80)

so that

pc+ 2u p

c= 2u

pc+ 2u+p

c C m1 m2

pm1+m2

C ps

⌧ , (7.81)

and v, and thus `(v), is lower bounded by a positive absolute constant. Finally, by Markov’s and von Bahr-Esseen’s inequalities von Bahr and Esseen (1965), we have

µ1 k✓k^{2}2 c+ 2u =µ1 k✓k^{2}2 m1 u µ1 k✓k^{2}2 m1
5/4

u^{5/4} (7.82)

s
2u^{5/4}

Z

x^{2} ✏
Z

x^{2}g1⇤ '(x)dx ^{5/4}g1⇤ '(x)dx (7.83)

Cs
u^{5/4}

h Z |x|^{5/2}g1⇤ ^{'}(x)dx+⇣

✏ Z

x^{2}g1⇤ ^{'}(x)dx⌘5/4i

, (7.84)
and using Lemma 7.7.9 again, this is smaller than C 2s/(⌧^{3/4}u^{5/4}) C⌧^{7/4}s ^{1/4}.
Ap-plying similar arguments for the second probability and because s p

d with d large enough, we get that

µ1 k✓k2 w+ 2 ^{⇤}_{N}_{(0,1)}v +µ2 k✓k2 w C

s^{1/5}. (7.85)

We conclude by applying Lemma 7.7.10 to bound ^{2}(Pµ¯1,Pµ¯2) = (1 + ^{2}(f1, f2))^{d} 1,
so that ford large enough, V^{0} 1/2.

### Proof of part (ii) of Proposition 7.3.3 and part (ii) of Proposi-tion 7.3.4

We argue similarly to the proof of Theorems 7.4.3 and 7.4.4, in particular, we set ↵ =
(⌧/2) log^{1/a}(ed/s) when proving the bound on the class G^{a,⌧}, and ↵ = (⌧/2)(d/s)^{1/a}
when proving the bound on Pa,⌧. In what follows, we only deal with the classGa,⌧ since
the proof for Pa,⌧ is analogous. Without loss of generality we assume that = 1.

To prove the lower bound with the rate _{exp}(s, d), we only need to prove it forssuch
that ( _{exp}(s, d))^{2} c0

pd/log^{2/a}(ed) with any small absolute constant c0 >0, since the
rate is increasing with s.

Consider the measures µ µ¯, P^{µ}, P^{µ}^{¯} defined in Section 7.6 with 0 = 1. Let ⇠1 be
distributed with c.d.f. F0 defined in item (i) of the proof of Theorems 7.4.3 and 7.4.4.

Using the notation as in the proof of Theorems 7.4.3 and 7.4.4, we define P˜ as the
distribution of ⇠˜1 = 1⇠1 +↵ 1✏1 with ^{2}1 = (1 + ↵^{2}s/(2d)) ^{1} where now 1 is the
Bernoulli random variable with P( 1 = 1) = _{2d}^{s}(1 +↵^{2}s/(2d)) ^{1}. By construction,
E⇠˜1 = 0 andE⇠˜_{1}^{2} = 1. Since the support ofF0 is in[ 3/2,3/2]one can check as in item
(ii) of the proof of Theorems 7.4.3 and 7.4.4 thatP˜ 2Ga,⌧. Next, analogously to (7.67)
- (7.68) we obtain that, for any u >0,

sup

P_{⇠}2Ga,⌧

sup

k✓k0s

P✓,P_{⇠},1 |Tˆ k✓k^{2}| u 1 V(P^{µ}^{¯}, P_{0,}P ,1˜ ) µ(¯ k✓k^{2} <2u)

2 .

Let P0 and P1 denote the distributions of (⇠1, . . . ,⇠d) and of ( 1⇠1, . . . , 1⇠d), respec-tively. Acting as in item (i) of the proof of Theorems 7.4.3 and 7.4.4 and using the bound

|1 1|↵^{2}s/d= ⌧^{2}
4

s

dlog^{2/a}(ed/s)Cc0/p
d

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 181

We conclude by repeating the argument after (7.68) in the proof of Theorem 7.3.1 and choosing c0 >0 small enough to guarantee that the right hand side of the last display is positive.

### Proof of part (ii) of Proposition 7.4.2

The lower bound with the rate 1/p

d follows from the argument as in item (i) of the
proof of Theorems 7.4.3 and 7.4.4 if we replace there F0 by the standard Gaussian
distribution. The lower bound with the rate _{d(1+log}^{s}_{+}_{(s}2/d)) follows from Lemma 7.7.8
and the lower bound for estimation of k✓k2 in Proposition 7.3.2.

### Proof of Proposition 7.4.3

Assume that ✓= 0, = 1 and set

⇠i =p 3✏iui,

where the ✏i’s and the ui are independent, with Rademacher and uniform distribution on[0,1] respectively. Then note that

E_{0,P}_{⇠}_{,1} ˆ_{⇤}^{2} 1 ^{2} E_{0,P}_{⇠}_{,1}(ˆ^{2}_{⇤}) 1 ^{2} =⇣
This and (7.86) prove the proposition.

### 7.7 Appendix: Technical lemmas

### Lemmas for the upper bounds

Lemma 7.7.1. Let z1, . . . , zd implying (7.87). Finally, (7.88) follows by integrating (7.89).

Lemma 7.7.2. Let z1, . . . , zdiid

Proof. Using the definition of Pa,⌧ we get that, for any t 2, P z(d j+1) t ⇣ed

This proves (7.90). The proof of (7.91) is analoguous to that of (7.88).

7.7. APPENDIX: TECHNICAL LEMMAS 183

The proof is simple and we omit it.

### Lemmas for the lower bounds

For two probability measures P1 and P2 on a measurable space (⌦,U), we denote by V(P1,P2) the total variation distance between P1 and P2:

V(P_{1},P_{2}) = sup

B2U|P_{1}(B) P_{2}(B)|.

Lemma 7.7.4(Deviations of the binomial distribution). LetB(d, p)denote the binomial random variable with parameters d and p2(0,1). Then, for any >0,

P B(d, p) p

Inequality (7.92) is a combination of formulas (3) and (10) on pages 440–441 in Shorack and Wellner (2009). Inequality (7.93) is formula (6) on page 440 in Shorack and Wellner (2009).

Lemma 7.7.5. Let P^{µ} and P^{µ}^{¯} be the probability measures defined in (7.59). The total
variation distance between these two measures satisfies

V(Pµ,Pµ¯)P⇣

Combining this inequality with (7.92) we obtain (7.94). To prove (7.95), we use again (7.96) and notice thatP⇣

Lemma 7.7.6. Let µ¯ be defined in (7.58) with some ↵>0.Then

where the last inequality follows from (7.93). Next, inspection of the proof of Lemma 7.7.5
yields thatµ(B)¯ µ(B) +e ^{3s}^{16} for any Borel setB. Taking hereB ={k✓k^{2} ↵p

s/2} and using (7.99) proves (7.97). To prove (7.98), it suﬃces to note thatµ⇣

k✓k2 < ^{↵}_{4}^{p}^{p}_{2}^{s}⌘

= P⇣

B d,_{2d}^{s} < _{32}^{s}⌘
.

Lemma 7.7.7. There exists a probability density f0 : R ! [0,1) with the following properties: f0is continuously diﬀerentiable, symmetric about 0, supported on[ 3/2,3/2], with variance 1 and finite Fisher information If0 =R

(f_{0}^{0}(x))^{2}(f0(x)) ^{1}dx.

Proof. Let K : R ! [0,1) be any probability density, which is continuously
diﬀeren-tiable, symmetric about 0, supported on[ 1,1], and has finite Fisher informationIK, for
example, the densityK(x) = cos^{2}(⇡x/2)1_{|x|1}. Definef0(x) = [Kh(x+ (1 ")) +Kh(x
(1 "))]/2whereh >0and"2(0,1)are constants to be chosen, andKh(u) =K(u/h)/h.

Clearly, we have If0 < 1 since IK < 1. It is straightforward to check that the vari-ance of f0 satisfies R

x^{2}f0(x)dx = (1 ")^{2} +h^{2 2}_{K} where K^{2} = R

7.7. APPENDIX: TECHNICAL LEMMAS 185

Lemma 7.7.9. If c_{0} is small enough, then there exist two density functions such that
1. max R

Proof. In the following, C denotes an absolute constant whose value may change from line to line. We define

g_{1}(x) =

where

c1,n = 2n^{2}+ 5n+ 6, c2,n = 4n^{2} 8n 8, c3,n = 2n^{2}+ 3n+ 3. (7.105)
Direct computations show that g1 is a density function, and the first part of this proof
is dedicated to proving that g2 is a density too, ifc0 is small enough.

First note that j is bounded on [ 2⌧,2⌧] so that ˆh is well defined. Then, we can Using integration by part, we have

Z 2⌧

The choices of theci,n’s make the first three parts vanish, hence

|g(x)| Cc0

7.7. APPENDIX: TECHNICAL LEMMAS 187

for c0 small enough. Finally, combining (7.114), (7.117) and (7.119) yields that g2 is positive on R. Furthermore, which yields the first desired property. From the same computations, we get

Z

which is exactly the second property. Furthermore, we have in particular from (7.114)
and the fact that g C⌧ on[ ⌧ ^{1},⌧ ^{1}] that