Appendix: Proofs of the lower bounds

Dans le document The DART-Europe E-theses Portal (Page 188-0)

brevityE =E✓,P, we find

d. Finally, using again (7.28) we get, fork = 1,2,

where the last inequality follows from the same argument as in the proof of Theorem 7.2.1. These remarks together with (7.56) imply

E |ˆ2 2|  C

We conclude the proof by boundingPs j=1 2

j in the same way as in the end of the proof of Theorem 7.2.1.

7.6 Appendix: Proofs of the lower bounds

Proof of Theorems 7.4.3 and 7.4.4 and part (ii) of Proposition 7.4.2

Since we have `(t) `(A)1t>A for any A > 0, it is enough to prove the theorems for the indicator loss`(t) =1t>A. This remark is valid for all the proofs of this section and will not be further repeated.

(i) We first prove the lower bounds with the rate1/p

din Theorems 7.4.3 and 7.4.4.

Letf0 :R![0,1)be a probability density with the following properties: f0 is contin-uously differentiable, symmetric about 0, supported on[ 3/2,3/2], with variance 1 and finite Fisher information If0 =R

(f00(x))2(f0(x)) 1dx. The existence of such f0 is shown in Lemma 7.7.7. Denote by F0 the probability distribution corresponding to f0. Since F0 is zero-mean, with variance 1 and supported on [ 3/2,3/2] it belongs to Ga,⌧ with any ⌧ >0,a >0, and to Pa,⌧ with any ⌧ >0,a 2. Define P0 =P0,F0,1, P1 =P0,F0, 1

where I(t) is the Fisher information corresponding to the density f0(x/t)/t, that is I(t) = t 2If0. It follows that h2  ¯cc20/d where ¯c > 0 is a constant. This and (7.57)

imply that forc0 small enough we haveH(P1,P0)1/2. Finally, choosing such a small c0 and using Theorem 2.2(ii) in Tsybakov (2008) we obtain

infTˆ (ii) We now prove the lower bound with the rate dslog2/a(ed/s)in Theorem 7.4.3. It is enough to conduct the proof fors s0 wheres0 >0is an arbitrary absolute constant.

Indeed, for ss0 we have dslog2/a(ed/s)C/p

d where C > 0is an absolute constant and thus Theorem 7.4.3 follows already from the lower bound with the rate1/p

dproved in item (i). Therefore, in the rest of this proof we assume without loss of generality that s 32.

We take P =U whereU is the Rademacher distribution, that is the uniform distri-bution on{ 1,1}. Clearly,U 2Ga,⌧. Let 1, . . . , d be i.i.d. Bernoulli random variables with probability of success P( 1 = 1) = 2ds, and let ✏1, . . . ,✏d be i.i.d. Rademacher random variables that are independent of ( 1, . . . , d). Denote by µ the distribution of (↵ 11, . . . ,↵ dd) where ↵ = (⌧/2) log1/a(ed/s). Note that µ is not necessarily sup-ported on ⇥s = {✓ 2 Rd|k✓k0  s} as the number of nonzero components of a vector drawn fromµcan be larger than s. Therefore, we consider a restricted to ⇥s version of µdefined by

¯

µ(A) = µ(A\⇥s)

µ(⇥s) (7.58)

for all Borel subsets A of Rd. Finally, we introduce two mixture probability measures Pµ =

Z

P✓,U,1µ(d✓) and Pµ¯ = Z

P✓,U,1µ(d✓).¯ (7.59) Notice that there exists a probability measure P˜ 2Ga,⌧ such that

Pµ=P0,P ,˜ 0 (7.60) But this inequality immediately follows from the fact that for t 2 the probability in (7.62) is smaller than

P(✏= 1, = 1)1

(⌧/2) log1/a(ed/s)>t 1  s 4d1

log1/a(ed/s)>t e (t/⌧)a. (7.63)

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 177 Now, for any estimatorTˆ and any u >0 we have

sup

P2Ga,⌧

sup

>0 sup

k✓k0s

P✓,P,

⇣ Tˆ

2 1 u⌘

maxn

P0,P ,˜ 0(|Tˆ 20| 02u), Z

P✓,U,1(|Tˆ 1| u)¯µ(d✓)o maxn

Pµ(|Tˆ 20| 02u),Pµ¯(|Tˆ 1| 02u)o

(7.64)

where the last inequality uses (7.60). Write 02 = 1 + 2 where = 16d2slog2/a(ed/s) and chooseu = / 02 /(1 +⌧2/8). Then, the expression in (7.64) is bounded from below by the probability of error in the problem of distinguishing between two simple hypothesesPµ and Pµ¯, for which Theorem 2.2 in Tsybakov (2008) yields

maxn

Pµ(|Tˆ 20| ),Pµ¯(|Tˆ 1| )o 1 V(Pµ,Pµ¯)

2 (7.65)

where V(Pµ,Pµ¯) is the total variation distance between Pµ and Pµ¯. The desired lower bound follows from (7.65) and Lemma 7.7.5 for anys 32.

(iii) Finally, we prove the lower bound with the rate ⌧2(s/d)1 2/a in Theorem 7.4.4.

Again, we do not consider the cases32since in this case the rate1/p

d is dominating and Theorem 7.4.4 follows from item (i) above. For s 32, the proof uses the same argument as in item (ii) above but we choose ↵ = (⌧/2)(d/s)1/a. Then the variance of

↵ ✏+⇠ is equal to

2

0 = 1 + ⌧2(s/d)1 2/a

8 .

Furthermore, with this definition of 02 there exists P˜ 2 Pa,⌧ such that (7.60) holds.

Indeed, analogously to (7.62) we now have, for all t 2,

P ↵ ✏+⇠ > t 0 P(✏= 1, = 1)1

(⌧/2)(d/s)1/a>t 1  s 4d1

⌧(d/s)1/a>t (t/⌧)a. (7.66) To finish the proof, it remains to repeat the argument of (7.64) and (7.65) with =

2(s/d)1 2/a

16 .

Proof of Theorem 7.3.1

We argue similarly to the proof of Theorems 7.4.3 and 7.4.4, in particular, we set↵ = (⌧/2) log1/a(ed/s) when proving the bound on the class Ga,⌧, and ↵ = (⌧/2)(d/s)1/a when proving the bound onPa,⌧. In what follows, we only deal with the class Ga,⌧ since the proof for Pa,⌧ is analogous. Consider the measures µ µ,¯ Pµ, Pµ¯ and P˜ defined in

Section 7.6. Similarly to (7.64), for any estimator Tˆ and any u >0we have sup

P2Ga,⌧

sup

>0 sup

kk0s

P✓,P, |Tˆ k✓k2| u maxn

P0,P ,˜ 0(|Tˆ| 0u), Z

P✓,U,1(|Tˆ k✓k2| u)¯µ(d✓)o maxn

Pµ(|Tˆ| 0u),Pµ¯(|Tˆ k✓k2| 0u)o maxn

Pµ(|Tˆ| 0u),Pµ¯(|Tˆ|< 0u,k✓k2 2 0u)o minB max Pµ(B),Pµ¯(Bc) µ(¯ k✓k2 <2 0u)

minB

Pµ(B) +Pµ¯(Bc) 2

¯

µ(k✓k2 <2 0u)

2 (7.67)

where 0 is defined in (7.61),U denotes the Rademacher law andminB is the minimum over all Borel sets. The third line in the last display is due to (7.60) and to the inequality

0 1. SinceminB Pµ(B) +Pµ¯(Bc) = 1 V(Pµ,Pµ¯), we get sup

P2Ga,⌧

sup

>0

sup

kk0s

P✓,P, |Tˆ k✓k2|/ u 1 V(Pµ,Pµ¯) µ(¯ k✓k2 <2 0u)

2 (7.68).

Consider first the case s 32. Setu= 4p0s. Then (7.94) and (7.97) imply that V(Pµ,Pµ¯)e 3s16, µ(¯ k✓k2 <2 0u)2e 16s ,

which, together with (7.68) and the fact that s 32 yields the result.

Let now s <32. Then we set u= 8pp2s

0. It follows from (7.95) and (7.98) that 1 V(Pµ,Pµ¯) µ(¯ k✓k2 <2 0u) P⇣

B d, s

2d = 1⌘

= s 2

⇣1 s 2d

d 1

.

It is not hard to check that the minimum of the last expression over all integerss, dsuch that 1s < 32, s d, is bounded from below by a positive number independent of d.

We conclude by combining these remarks with (7.68).

Proof of part (ii) of Proposition 7.3.2

The lower bound corresponding to the sparse regime (i.e. sp

d) is already proven in Collier et al. (2017) for known variance . Hence, we only focus on the dense regime where we may assume without loss of generality that s p

d for d large enough. The proof is inspired by ideas from Cai and Jin (2010) even if their original proof does not apply in our setting. In what follows,we will use the Fourier transform defined for any integrable function f as

f(t) =ˆ Z

R

e itxf(x)dx. (7.69)

In the following, C is an absolute constant whose value may change from line to line.

We denote by 2 the density of N(0, 2). Moreover, we set ✏ = 2ds  1/2, ⌧ = p↵log(es2/d) with ↵ large enough, and', c0 are defined in Lemma 7.7.9.

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 179 First, we build some probability distributions on ⇥s. If 1, . . . , d

iid⇠ B(✏), we define µi, i= 1,2 respectively the distribution of( 1X1(i), . . . , dXd(i)) where

X1(1), . . . , Xd(1) iid⇠ ( '⇤g1d )d, X1(2), . . . , Xd(2) iid⇠ (g2d )d (7.70) where g1, g2 are density functions given by Lemma 7.7.9 and is the Lebesgue mea-sure. Then, we consider the probability distributions P1 := Pµ1,N(0,1),1 and P2 :=

Pµ2,N(0,1),p1+' whose density functions are respectively

f1 = (1 ✏) 1+✏ 1+'⇤g1, f2 = (1 ✏) 1+'+✏ 1+'⇤g2. (7.71) Then,µ¯1 and µ¯2 defined by

¯

µi(A) = µi A\⇥s µis

(7.72) are supported on ⇥s. Now, using Theorem 2.15 in Tsybakov (2008) and the fact that

`(t) l(a)1t>a for any a >0, we get infTˆ

sup

✓2⇥s, >0

E✓,N(0,1), `⇣

N(0,1)(s, d) 1 Tˆ k✓k2 ⌘ `(v)

2 (1 V0), (7.73) where for somev, w2R,

V0 =V(P1,P2) + ¯µ1(k✓k2 w+ 2 N(0,1)v) + ¯µ2(k✓k2 w). (7.74) Decomposing in particular the total-variation distance, we have

V0  2V(¯µ1, µ1) + 2V(¯µ2, µ2) +p

2(P1,P2) (7.75) +µ1

⇥k✓k2 w+ 2 N(0,1)v⇤ +µ2

⇥k✓k2 w⇤

. (7.76)

The first two terms in the right-hand side are upper bounded using Lemma 7.7.5 by 4e 3s16. Then, we choose

v =

pc+ 2u p c

2 N(0,1) , w=p

c, (7.77)

where

c=m2+ m1 m2

4 , u= m1 m2

4 , mii k✓k22 . (7.78) Moreover, since by definition,

m1 = s 2

Z

x2g1'(x)dx, m2 = s 2

Z

x2g2(x)dx, (7.79) Lemma 7.7.9 implies that

m1+m21s

2 , m1 m2 = c0s

2⌧2, (7.80)

so that

pc+ 2u p

c= 2u

pc+ 2u+p

c C m1 m2

pm1+m2

C ps

⌧ , (7.81)

and v, and thus `(v), is lower bounded by a positive absolute constant. Finally, by Markov’s and von Bahr-Esseen’s inequalities von Bahr and Esseen (1965), we have

µ1 k✓k22 c+ 2u =µ1 k✓k22 m1  u  µ1 k✓k22 m1 5/4

u5/4 (7.82)

 s 2u5/4

Z

x2 ✏ Z

x2g1'(x)dx 5/4g1'(x)dx (7.83)

 Cs u5/4

h Z |x|5/2g1'(x)dx+⇣

✏ Z

x2g1'(x)dx⌘5/4i

, (7.84) and using Lemma 7.7.9 again, this is smaller than C 2s/(⌧3/4u5/4)  C⌧7/4s 1/4. Ap-plying similar arguments for the second probability and because s p

d with d large enough, we get that

µ1 k✓k2 w+ 2 N(0,1)v +µ2 k✓k2 w  C

s1/5. (7.85)

We conclude by applying Lemma 7.7.10 to bound 2(Pµ¯1,Pµ¯2) = (1 + 2(f1, f2))d 1, so that ford large enough, V0 1/2.

Proof of part (ii) of Proposition 7.3.3 and part (ii) of Proposi-tion 7.3.4

We argue similarly to the proof of Theorems 7.4.3 and 7.4.4, in particular, we set ↵ = (⌧/2) log1/a(ed/s) when proving the bound on the class Ga,⌧, and ↵ = (⌧/2)(d/s)1/a when proving the bound on Pa,⌧. In what follows, we only deal with the classGa,⌧ since the proof for Pa,⌧ is analogous. Without loss of generality we assume that = 1.

To prove the lower bound with the rate exp(s, d), we only need to prove it forssuch that ( exp(s, d))2 c0

pd/log2/a(ed) with any small absolute constant c0 >0, since the rate is increasing with s.

Consider the measures µ µ¯, Pµ, Pµ¯ defined in Section 7.6 with 0 = 1. Let ⇠1 be distributed with c.d.f. F0 defined in item (i) of the proof of Theorems 7.4.3 and 7.4.4.

Using the notation as in the proof of Theorems 7.4.3 and 7.4.4, we define P˜ as the distribution of ⇠˜1 = 11 +↵ 11 with 21 = (1 + ↵2s/(2d)) 1 where now 1 is the Bernoulli random variable with P( 1 = 1) = 2ds(1 +↵2s/(2d)) 1. By construction, E⇠˜1 = 0 andE⇠˜12 = 1. Since the support ofF0 is in[ 3/2,3/2]one can check as in item (ii) of the proof of Theorems 7.4.3 and 7.4.4 thatP˜ 2Ga,⌧. Next, analogously to (7.67) - (7.68) we obtain that, for any u >0,

sup

P2Ga,⌧

sup

kk0s

P✓,P,1 |Tˆ k✓k2| u 1 V(Pµ¯, P0,P ,1˜ ) µ(¯ k✓k2 <2u)

2 .

Let P0 and P1 denote the distributions of (⇠1, . . . ,⇠d) and of ( 11, . . . , 1d), respec-tively. Acting as in item (i) of the proof of Theorems 7.4.3 and 7.4.4 and using the bound

|1 1|↵2s/d= ⌧2 4

s

dlog2/a(ed/s)Cc0/p d

7.6. APPENDIX: PROOFS OF THE LOWER BOUNDS 181

We conclude by repeating the argument after (7.68) in the proof of Theorem 7.3.1 and choosing c0 >0 small enough to guarantee that the right hand side of the last display is positive.

Proof of part (ii) of Proposition 7.4.2

The lower bound with the rate 1/p

d follows from the argument as in item (i) of the proof of Theorems 7.4.3 and 7.4.4 if we replace there F0 by the standard Gaussian distribution. The lower bound with the rate d(1+logs+(s2/d)) follows from Lemma 7.7.8 and the lower bound for estimation of k✓k2 in Proposition 7.3.2.

Proof of Proposition 7.4.3

Assume that ✓= 0, = 1 and set

i =p 3✏iui,

where the ✏i’s and the ui are independent, with Rademacher and uniform distribution on[0,1] respectively. Then note that

E0,P,1 ˆ2 1 2 E0,P,12) 1 2 =⇣ This and (7.86) prove the proposition.

7.7 Appendix: Technical lemmas

Lemmas for the upper bounds

Lemma 7.7.1. Let z1, . . . , zd implying (7.87). Finally, (7.88) follows by integrating (7.89).

Lemma 7.7.2. Let z1, . . . , zdiid

Proof. Using the definition of Pa,⌧ we get that, for any t 2, P z(d j+1) t ⇣ed

This proves (7.90). The proof of (7.91) is analoguous to that of (7.88).

7.7. APPENDIX: TECHNICAL LEMMAS 183

The proof is simple and we omit it.

Lemmas for the lower bounds

For two probability measures P1 and P2 on a measurable space (⌦,U), we denote by V(P1,P2) the total variation distance between P1 and P2:

V(P1,P2) = sup

B2U|P1(B) P2(B)|.

Lemma 7.7.4(Deviations of the binomial distribution). LetB(d, p)denote the binomial random variable with parameters d and p2(0,1). Then, for any >0,

P B(d, p) p

Inequality (7.92) is a combination of formulas (3) and (10) on pages 440–441 in Shorack and Wellner (2009). Inequality (7.93) is formula (6) on page 440 in Shorack and Wellner (2009).

Lemma 7.7.5. Let Pµ and Pµ¯ be the probability measures defined in (7.59). The total variation distance between these two measures satisfies

V(Pµ,Pµ¯)P⇣

Combining this inequality with (7.92) we obtain (7.94). To prove (7.95), we use again (7.96) and notice thatP⇣

Lemma 7.7.6. Let µ¯ be defined in (7.58) with some ↵>0.Then

where the last inequality follows from (7.93). Next, inspection of the proof of Lemma 7.7.5 yields thatµ(B)¯ µ(B) +e 3s16 for any Borel setB. Taking hereB ={k✓k2 ↵p

s/2} and using (7.99) proves (7.97). To prove (7.98), it suffices to note thatµ⇣

k✓k2 < 4pp2s

= P⇣

B d,2ds < 32s⌘ .

Lemma 7.7.7. There exists a probability density f0 : R ! [0,1) with the following properties: f0is continuously differentiable, symmetric about 0, supported on[ 3/2,3/2], with variance 1 and finite Fisher information If0 =R

(f00(x))2(f0(x)) 1dx.

Proof. Let K : R ! [0,1) be any probability density, which is continuously differen-tiable, symmetric about 0, supported on[ 1,1], and has finite Fisher informationIK, for example, the densityK(x) = cos2(⇡x/2)1|x|1. Definef0(x) = [Kh(x+ (1 ")) +Kh(x (1 "))]/2whereh >0and"2(0,1)are constants to be chosen, andKh(u) =K(u/h)/h.

Clearly, we have If0 < 1 since IK < 1. It is straightforward to check that the vari-ance of f0 satisfies R

x2f0(x)dx = (1 ")2 +h2 2K where K2 = R

7.7. APPENDIX: TECHNICAL LEMMAS 185

Lemma 7.7.9. If c0 is small enough, then there exist two density functions such that 1. max R

Proof. In the following, C denotes an absolute constant whose value may change from line to line. We define

g1(x) =

where

c1,n = 2n2+ 5n+ 6, c2,n = 4n2 8n 8, c3,n = 2n2+ 3n+ 3. (7.105) Direct computations show that g1 is a density function, and the first part of this proof is dedicated to proving that g2 is a density too, ifc0 is small enough.

First note that j is bounded on [ 2⌧,2⌧] so that ˆh is well defined. Then, we can Using integration by part, we have

Z 2⌧

The choices of theci,n’s make the first three parts vanish, hence

|g(x)| Cc0

7.7. APPENDIX: TECHNICAL LEMMAS 187

for c0 small enough. Finally, combining (7.114), (7.117) and (7.119) yields that g2 is positive on R. Furthermore, which yields the first desired property. From the same computations, we get

Z

which is exactly the second property. Furthermore, we have in particular from (7.114) and the fact that g C⌧ on[ ⌧ 1,⌧ 1] that

Dans le document The DART-Europe E-theses Portal (Page 188-0)