Supplementary Material for “LASSO-type estimators for Semiparametric Nonlinear Mixed-Effects Models Estimation”

(1)

(will be inserted by the editor)

Supplementary Material for “LASSO-type estimators for Semiparametric Nonlinear Mixed-Effects Models Estimation”

Ana Arribas-Gil · Karine Bertin · Cristian Meza · Vincent Rivoirard

the date of receipt and acceptance should be inserted later

1 Theoretical results for the LASSO-type estimator

1.1 Assumptions

As usual, assumptions on the dictionary are necessary to obtain oracle results for LASSO- type procedures. We refer the reader to van de Geer and B¨uhlmann (2009) for a good review of different assumptions considered in the literature for LASSO-type estimators and con- nections between them. The dictionary approach aims at extending results for orthonormal bases. Actually, our assumptions express the relaxation of the orthonormality property. To describe them, we introduce the following notation. Forl∈N, we denote

νmin(l) =min

|J|≤l min

λ∈R^M λJ6=0

||f_λ_J||²_n

||λJ||²_`

2

and νmax(l) =max

|J|≤l max

λ∈R^M λJ6=0

||f_λ_J||²_n

||λJ||²_`

2

,

where|| · ||_`

2 is the l₂ norm inR^M. The notation λJ means that for any k∈ {1, . . . ,M}, (λJ)k=λk ifk∈J and(λJ)k=0 otherwise. Previous quantities correspond to the “restricted” eigenvalues of the Gram matrixG= (G_j,_j0)with coefficients

G_j,_j0=1 n

n

∑

i=1

b²_iϕj(xi)ϕj⁰(xi).

Assuming thatνmin(l)andνmax(l)are close to 1 means that every set of columns ofGwith cardinality less thanlbehaves like an orthonormal system. We also consider the restricted

Ana Arribas-Gil

Departamento de Estad´ıstica, Universidad Carlos III de Madrid, E-mail: ana.arribas@uc3m.es

Karine Bertin·Cristian Meza

CIMFAV-Facultad de Ingenier´ıa, Universidad de Valpara´ıso, E-mail: karine.bertin@uv.cl, E-mail: cristian.meza@uv.cl Vincent Rivoirard

CEREMADE, CNRS-UMR 7534, Universit´e Paris Dauphine, E-mail: vincent.rivoirard@dauphine.fr

(2)

correlations

δl,l⁰= max

|J|≤l

|J⁰|≤l⁰ J∩J⁰=/0

max

λ,λ⁰∈R^M λJ6=0,λ⁰

J06=0

hf_λ_J,f_λ0 J0i

||λJ||`₂||λ_J⁰0||`₂

,

wherehf,gi= ¹_n∑ⁿi=1b²_if(xi)g(xi). Small values of δl,l⁰ means that two disjoint sets of columns ofG with cardinality less than l andl⁰ span nearly orthogonal spaces. We will use the following assumption considered in Bickel et al (2009).

Assumption 1 For some integer1≤s≤M/2, we have

νmin(2s)>δs,2s. (A1(s))

Oracle inequalities of the Dantzig selector were established under this assumption in the parametric linear model by Cand`es and Tao (2007) and for density estimation by Bertin et al (2011). It was also considered by Bickel et al (2009) for nonparametric regression and for the LASSO estimate.

Let us denote

κs=p νmin(2s)

1− δs,2s

νmin(2s)

>0, µs= δs,2s

p

νmin(2s). We will say thatλ∈R^Msatisfies the Dantzig constraints if for all j=1, . . . ,M

(Gλ)j−βˆj

≤r_n,_j, (1) where

βˆj=1 n

n i=1

∑

b_iϕj(xi)Yi.

We denoteDthe set ofλthat satisfies (1). The classical use of Karush-Kuhn-Tucker conditions shows that the LASSO estimator ˆλ∈D, so it satisfies the Dantzig constraint. Finally, we assume in the sequel

M≤exp(n^δ),

forδ <1. Therefore, if kϕjk_n is bounded by a constant independent of nandM, then r_n,j=o(1)and oracle inequalities established below are meaningful.

1.2 Oracle inequalities

We obtain the following oracle inequalities.

Theorem 1 Letτ>2. With probability at least1−M^1−τ/2, for any integer s<n/2such that (A1(s)) holds, we have for anyα>0,

||fˆ−f||²_n≤ inf

λ∈R^M

inf

J₀⊂{1,...,M}

|J₀|=s

(

||f_λ−f||²_n+α

1+2µs

κs

2

Λ(λ,J₀^c)²

s +16s

1 α+ 1

κs²

r²_n

)

(2) where

r_n= sup

j=1,...,M

rn,j,

(3)

Λ(λ,J₀^c) =||λ_JC 0||_`₁+

||λ||ˆ _`₁− ||λ||_`₁

+

2 ,

for any x∈Rx₊:=max(x,0)and|| · ||_`

1 is the l₁norm inR^M.

Theorem 2 Letτ>2. With probability at least1−M^1−τ/2, for any integer s<n/2such that (A1(s)) holds, we have for anyα>0,

||fˆ−f||²_n≤ inf

λ∈D inf

J₀⊂{1,...,M}

|J₀|=s







||f_λ−f||²_n+α

1+2µs

κs

2||λ_JC

0||_`₁+||λˆ_JC 0||_`₁

s +32s

1 α+ 1

κ_s²

r_n²





 .

(3) Similar oracle inequalities were established by Bunea et al (2006), Bunea et al (2007a), Bunea et al (2007b), or van de Geer (2010). But in these works, the functions of the dictionary are assumed to be bounded by a constant independent ofMandn. Let us comment the right-hand side of inequalities (2) and (3) of Theorems 1 and 2. The first term is an approximation term which measures the closeness between f and f_λ and that can vanish if f is a linear combination of the functions of the dictionary. The second term can be considered as a bias term. In both theorems, the term||λ_JC

0||_`₁corresponds to the cost of havingλ with a support different ofJ0. For a givenλ, this term can be minimized by choosingJ0as the set of largest coordinates ofλ. Note that if the functionf has a sparse expansion on the dictionary, that is f=f_λ whereλis a vector withsnon-zero coordinates, then by choosingJ0as the set of thesnon-zero coordinates, the approximation term and the term||λ_JC

0||_`₁ vanish.

In Theorem 1, the term

||λˆ||_`

1− ||λ||_`

1

+ will be smaller as the`1-norm of the LASSO estimator is small and this term is equal to 0 if||λˆ||_`₁≤ ||λ||_`₁, which is frequently the case.

In Theorem 2, given a vectorλ such that f_λ approximates well f, the term||λˆ_J^C

0||`₁ will be small if the LASSO estimator selects the largest coordinates ofλ. The last term can be viewed as a variance term corresponding to the estimation of f as linear combination ofs functions of the dictionary (see Bertin et al (2011) for more details). Finally, the parameter αcalibrates the weights given for the bias and variance terms.

The following section deals with estimation of sparse functions.

1.3 The support property of the LASSO estimate

Letτ>2. In this section, we apply the LASSO procedure with ˜r_n,jinstead ofr_n,j, with

˜

r_n,_j=σkϕjk_n

rτ˜logM

n , τ˜>τ.

We assume that the regression functionf can be decomposed on the dictionary: there exists λ^∗∈R^Msuch that

f=

M

∑

j=1

λj^∗ϕj.

We denoteS^∗the support ofλ^∗: S^∗=

j∈ {1, . . . ,M}: λ^∗_j 6=0 ,

(4)

and bys^∗ the cardinal ofS^∗. We still consider the LASSO estimate ˆλ and, similarly, we denote ˆSthe support of ˆλ:

Sˆ=n

j∈ {1, . . . ,M}: λˆj6=0 o

. One goal of this section is to show that with high probability, we have:

Sˆ⊂S^∗. We have the following result.

Theorem 3 We define

ρ(S^∗) =max

k∈S^∗max

j6=k

|<ϕj,ϕk>| kϕ_jk_nkϕ_kk_n and we assume that there exists c∈(0,1/3)such that

s^∗ρ(S^∗)≤c.

If we have √

τ˜+√

√ τ τ˜−√

τ

≤1−c 2c , then

PSˆ⊂S^∗ ≥1−2M^1−τ/2.

A similar result was established by Bunea (2008) in a slightly less general model. However, her result is based on strong assumptions on the dictionary, namely each function is bounded by a constantL(see Assumption (A2)(a) in Bunea (2008)). This assumption is mild when considering dictionaries only based on Fourier bases. It is no longer the case when wavelets are considered and Bunea’s assumption is satisfied only in the case whereLdepends onM andnon the one hand and is very large on the other hand. SinceLplays a main role in the definition of the tuning parameters of the method, with too rough values forL, the procedure cannot achieve satisfying numerical results for moderate values ofn even if asymptotic theoretical results of the procedure are good. In the setting of this paper, where we aim at providing calibrated statistical procedures, we avoid such assumptions.

Finally, we have the following corollary.

Corollary 1 We suppose that A1(s^∗)is satisfied and that there exists c∈(0,1/3)such that s^∗ρ(S^∗)≤c.

If we have √

τ˜+√

√ τ τ˜−√

τ

≤1−c 2c , then, with probability at least1−4M^1−τ/2,

||fˆ−f||²_n≤32s^∗r˜_n² κs^∗

, where

˜

r_n= sup

j=1,...,M

˜ r_n,j.

This corollary is a simple consequence of Theorem 2 withλ=λ^∗andJ₀=S^∗. Taking λ=λ^∗implies that the approximation term vanishes. TakingJ₀=S^∗implies that the bias term vanishes since the support of the LASSO estimator is included in the the support of λ^∗. In this case, assuming that sup_jkϕjkn<∞, the rate of convergence is the classical rate

s^∗logM

n .

(5)

2 The proofs

2.1 Preliminary lemma

Lemma 1 For1≤j≤M, we consider the eventAj=

|V_j|<rn,j where Vj=¹_n∑ⁿ_i=1biϕj(xi)εi. Then,

P(Aj)≥1−M^−τ/2. Proof of Lemma 1:We have

P Aj^c

≤P

√n|V_j|/(σkϕjk_n)≥√

nr_n,j/(σkϕjk_n)

≤P

|Z| ≥p τlogM

≤M^−τ/2

whereZis a standard normal variable.

2.2 Proof of Theorem 1

Letλ∈R^MandJ₀such that|J₀|=s. We have kf_λ−fk²_n=kfˆ−fk²_n+kf_λ−fˆk²_n+2

n

n i=1

∑

b²_i fˆ(xi)−f(xi)

f_λ(xi)−fˆ(xi) .

We havekf_λ−fˆk²_n=kf_∆k²_nwhere∆=λ−λ. Moreoverˆ A=2

n

∑

i=1

b²_i fˆ(xi)−f(xi)

f_λ(xi)−fˆ(xi)

=2

M

∑

j=1

(λj−λˆj) h

(Gλ)ˆ j−βj

i ,

where

βj=1 n

n i=1

∑

b²_iϕj(xi)f(xi).

Since ˆλ satisfies the Dantzig constraint, we have with probability at least 1−M^1−τ/2, for any j∈ {1, . . . ,M},

|(Gλ)ˆ j−βj| ≤ |(Gλ)ˆ j−βˆj|+|βˆj−βj| ≤2rn,j

and|A| ≤4rnk∆k₁. This implies that

kfˆ−fk²_n≤ kf_λ−fk²_n+4rnk∆k₁− kf_∆k²_n.

Moreover using Lemma 1 and Proposition 1 of Bertin et al (2011) (where the normk · k₂is replaced byk · kn), we obtain that

||∆_JC

0||_`₁− ||∆_J₀||_`₁

+≤2||λ_JC 0||_`₁+

||λ||ˆ _`₁− ||λ||_`₁

+ (4)

and

||f_∆||_n≥κs||∆_J₀||_`₂− µs

p|J₀|

||∆_JC

0||_`₁− ||∆_J₀||_`₁

+

≥κs||∆J₀||_`

2−2 µs

p|J₀|Λ(λ,J₀^c).

(6)

Note that Proposition 1 of Bertin et al (2011) is obtained using Lemma 2 and Lemma 3 of Bertin et al (2011). In our context, Lemma 2 and Lemma 3 can be proved in the same way by replacing the normk · k₂ byk · k_n and by consideringP_J₀₁ as the projector on the linear space spanned by(ϕj(x1), . . . ,ϕj(xn))j∈J₀₁.

Now following the same lines as Theorem 2 of Bertin et al (2011), replacingκJ₀ byκs

andµJ₀byµs, we obtain the result of the theorem.

2.3 Proof of Theorem 2 We consider ˆλ^Ddefined by

λˆ^D=argmin_λ∈_R^M||λ||`₁ such thatλsatisfies the Dantzig constraint (1).

Denote by ˆf^Dthe estimator f_ˆ

λ^D. Following the same lines as in the proof of Theorem 1, it can be obtained that, with probability at least 1−M^1−τ/2, for any integers<n/2 such that (A1(s)) holds, we have for anyα>0,

||fˆ^D−f||²_n≤ inf

λ∈R^M

J₀⊂{1,...,M}inf

|J₀|=s

(

||f_λ−f||²_n+α

1+2µs

κs

2

Λ(λ,J₀^c)²

s +16s

1 α+ 1

κs²

r²_n

) ,

where here

Λ(λ,J₀^c) =||λ_J^C

0||_`₁+

||λˆ^D||_`₁− ||λ||_`₁

+

2 .

If the infimum is only taken over the vectorsλthat satisfy the Dantzig constraint, then, with the same probability we have

||fˆ^D−f||²_n≤ inf

λ∈D inf

J₀⊂{1,...,M}

|J₀|=s

(

||f_λ−f||²_n+α

1+2µs

κs

2kλ_J^C

0k²_l

1

s +16s 1

α+ 1 κ_s²

r_n²

) . (5) Following the same lines as the proof of Theorem 1, replacingλ by ˆλ^D, we obtain, with probability at least 1−M^1−τ/2,

kfˆ−fk²_n≤ kfˆ^D−fk²_n+4rnk∆k1− kf_∆k²_n,

with∆=λˆ−λˆ^D. Applying (4) where ˆλplays the role ofλand ˆλ^Dthe role of ˆλ, the vector

∆satisfies

||∆_JC

0||_`₁− ||∆_J₀||_`₁

+≤2||λˆ_JC 0||_`₁.

Following the same lines as in the proof of Theorem 1, we obtain that for each J0 ⊂ {1, . . . ,M}such that|J₀|=s

||fˆ−f||²_n≤







||fˆ^D−f||²_n+α

1+2µs

κs

2kλˆ_J^C

0k²_l

1

s +16s 1

α+ 1 κ_s²

r²_n







. (6)

Finally, (5) and (6) imply the theorem.

(7)

2.4 Proof of Theorem 3

We first state the following lemma.

Lemma 2 We have for any u∈R^M,

crit(λˆ +u)−crit(λˆ)≥

M

∑

k=1

u_kϕk

2

n

.

Proof of Lemma 2:Since for anyλ, crit(λ) =1

n

∑

i=1

(yi−bif_λ(xi))²+2

M

∑

j=1

˜ rn,j|λj|,

crit(λˆ+u)−crit(λ) =ˆ 1 n

n

∑

i=1

y_i−b_i

M

∑

k=1

λˆkϕk(xi)−b_i

M

∑

k=1

u_kϕk(xi)

!2

+2

M

∑

j=1

˜

r_n,_j|λˆj+u_j|

−1 n

n

∑

i=1

y_i−b_i

M

∑

k=1

λˆkϕk(xi)

!2

−2

M

∑

j=1

˜ r_n,j|λˆj|

= 1 n

n

∑

i=1

b²_i

M

∑

k=1

u_kϕk(xi)

!2

+2

M

∑

j=1

˜ r_n,j

|λˆj+u_j| − |λˆj|

−2 n

n i=1

∑

y_i−b_i

M k=1

∑

λˆkϕk(xi)

! b_i

M k=1

∑

u_kϕk(xi)

= 1 n

n

∑

i=1

b²_i

M

∑

k=1

u_kϕk(xi)

!2

+2

M

∑

j=1

˜ rn,j

|λˆj+uj| − |λˆj| +2

n

n i=1

∑

b²_i

M

∑

j=1

λˆjϕj(xi)

M k=1

∑

u_kϕk(xi)−2 n

n i=1

∑

b_iy_i

M k=1

∑

u_kϕk(xi)

= 1 n

n i=1

∑

b²_i

M

∑

k=1

ukϕk(xi)

!2

+2

M

∑

j=1

˜ rn,j

|λˆj+uj| − |λˆj|

+2 n

n i=1

∑

M

∑

k=1

ukϕk(xi) b²_i

M j=1

∑

λˆjϕj(xi)−biyi

! .

Since ˆλminimizesλ7−→crit(λ), we have for anyk, 0= 2

n

∑

i=1

ϕk(xi) b²_i

M

∑

j=1

λˆjϕj(xi)−biyi

!

+2˜rn,ks(λˆk),

where|s(λˆk)| ≤1 ands(λˆk) =sign(λˆk)if ˆλk6=0. So, 2

n

∑

i=1 M

∑

k=1

u_kϕk(xi) b²_i

M

∑

j=1

λˆjϕj(xi)−biyi

!

=−2

M

∑

k=1

u_kr˜n,ks(λˆk)

(8)

and

crit(λˆ+u)−crit(λˆ) = 1 n

n

∑

i=1

b²_i

M

∑

k=1

u_kϕk(xi)

!2

+2

M

∑

j=1

˜ rn,j

|λˆj+u_j| − |λˆj|

−2

M

∑

k=1

u_kr˜n,ks(λˆk)

= 1 n

n

∑

i=1

b²_i

M

∑

k=1

u_kϕk(xi)

!2

+2

M

∑

j=1

˜ rn,j

|λˆj+u_j| − |λˆj| −u_js(λˆj)

≥ 1 n

n

∑

i=1

b²_i

M

∑

k=1

u_kϕk(xi)

!2

,

which proves the result.

Now, still withs^∗=card(S^∗), we consider forµ∈R^s

∗

critS^∗(µ) =1 n

n

∑

i=1

y_i−b_i

∑

j∈S^∗

µjϕj(xi)

!2

+2

∑

j∈S^∗

˜ r_n,j|µj|,

and

µ˜=arg min

µ∈R^s^∗

critS^∗(µ).

Then we set

S = ^\

j/∈S^∗

( 1 n

n

∑

i=1

y_ib_iϕj(xi)−

∑

k∈S^∗

µ˜k<ϕj,ϕk>

<r˜n,j

)

and we state the following lemma.

Lemma 3 On the setS, the non-zero coordinates ofλˆ are included into S^∗.

Proof of Lemma 3:Recall that ˆλ is a minimizer ofλ7−→crit(λ). Using standard convex analysis arguments, this is equivalent to say that for any 1≤ j≤M,







1

n∑ⁿi=1y_ib_iϕj(xi)−∑^Mk=1λˆk<ϕj,ϕk> =r˜_n,jsign(λˆj) if ˆλj6=0,

1

n∑ⁿ_i=1yibiϕj(xi)−∑^M_k=1λˆk<ϕj,ϕk>

≤r˜n,j if ˆλj=0.

Similarly, onS, we have











1

n∑ⁿ_i=1yibiϕj(xi)−∑k∈S^∗µ˜k<ϕj,ϕ_k> =r˜n,jsign(µ˜j) ifj∈S^∗and ˜µj6=0,

¹_n∑ⁿ_i=1yibiϕj(xi)−∑k∈S^∗µ˜k<ϕj,ϕ_k>

≤r˜n,j ifj∈S^∗and ˜µj=0,

¹_n∑ⁿ_i=1yibiϕj(xi)−∑k∈S^∗µ˜k<ϕj,ϕ_k>

<r˜n,j ifj∈/S^∗.

So, onS, the vector ˆµ such ˆµj=µ˜jif j∈S^∗and ˆµj=0 if j∈/S^∗is also a minimizer of λ7−→crit(λ). Using Lemma 2, we have for any 1≤i≤n:

M

∑

k=1

(λˆk−µˆk)ϕk(xi) =0.

(9)

So, forj∈/S^∗,

1 n

n

∑

i=1

yibiϕj(xi)−

M

∑

k=1

λˆk<ϕj,ϕk>

<r˜n,j.

Therefore, onS, the non-zero coordinates of ˆλare included intoS^∗. Lemma 3 shows that we just need to prove that

P{S} ≥1−2M^1−τ/2

P{S^c} ≤

∑

j/∈S^∗

P (

1 n

n

∑

i=1

yibiϕj(xi)−

∑

k∈S^∗

µ˜k<ϕj,ϕ_k>

≥r˜n,j

)

≤A+B,

with

A=

∑

j∈S/ ^∗

P (

1 n

n

∑

i=1

[yibiϕj(xi)−E(yibiϕj(xi))]

≥rn,j

)

=

∑

j∈S/ ^∗

P (

1 n

n i=1

∑

εibiϕj(xi)

≥rn,j

)

=

∑

j∈S/ ^∗

P V_j

≥rn,j

(see Lemma 1) and B=P



 [

j/∈S^∗

( 1 n

n

∑

i=1

E(yib_iϕj(xi))−

∑

k∈S^∗

µ˜k<ϕj,ϕk>

≥r˜n,j−rn,j

)



=P



 [

j/∈S^∗

(

<ϕj,f_λ∗>−

∑

k∈S^∗

µ˜k<ϕj,ϕk>

≥r˜n,j−rn,j

)



=P



 [

j/∈S^∗

(

∑

k∈S^∗

(λ_k^∗−µ˜k)<ϕj,ϕk>

≥r˜n,j−rn,j

)



≤P



 [

j/∈S^∗

(

ρ(S^∗)kϕ_jk_n

∑

k∈S^∗

|λ_k^∗−µ˜k| kϕ_kk_n≥r˜n,j−rn,j

)



since

ρ(S^∗) =max

k∈S^∗max

j6=k

|<ϕj,ϕk>| kϕ_jk_nkϕ_kk_n . Using notation of Lemma 3, we have:

kf_λ∗−fµˆk²_n=k

∑

k∈S^∗

(λ_k^∗−µˆk)ϕkk²_n

=

∑

k∈S^∗

(λ_k^∗−µˆk)²kϕkk²_n+

∑

k∈S^∗

∑

j∈S^∗,j6=k

(λ_k^∗−µˆk)(λ^∗_j−µˆj)<ϕj,ϕk>,

(10)

and

∑

k∈S^∗

(λ_k^∗−µˆk)²kϕkk²_n ≤ kf_λ∗−f_µ_ˆk²_n+ρ(S^∗)

∑

k∈S^∗

∑

j∈S^∗,j6=k

|λk^∗−µˆk|kϕkk_n× |λj^∗−µˆj|kϕjk_n

≤ kf_λ∗−fµˆk²_n+ρ(S^∗)

∑

k∈S^∗

|λ_k^∗−µˆk|kϕkk_n

!2

.

Finally,

k∈S

∑

^∗

|λ_k^∗−µˆk|kϕ_kkn

!2

≤s^∗

∑

k∈S^∗

(λ_k^∗−µˆk)²kϕ_kk²_n

≤s^∗



kf_λ∗−fµˆk²_n+ρ(S^∗)

∑

k∈S^∗

|λ_k^∗−µˆk|kϕkk_n

!2

,

which shows that

k∈S

∑

^∗

|λ_k^∗−µˆk|kϕ_kk_n

!2

≤ s^∗

1−ρ(S^∗)s^∗kf_λ∗−f_µ_ˆk²_n. Now,

1 n

n

∑

i=1

y_i−b_i

∑

j∈S^∗

µ˜jϕj(xi)

!2

+2

∑

j∈S^∗

˜ rn,j|µ˜j| ≤ 1

n

∑

i=1

y_i−b_i

∑

j∈S^∗

λ^∗_jϕj(xi)

!2

+2

∑

j∈S^∗

˜ rn,j|λ^∗_j|.

So,

k

∑

j∈S^∗

µ˜jϕjk²_n−2 n

n

∑

i=1

b_iy_i

∑

j∈S^∗

µ˜jϕj(xi) +2

∑

j∈S^∗

˜ rn,j|µ˜j| ≤ k

∑

j∈S^∗

λ^∗jϕjk²_n−2 n

n i=1

∑

b_iy_i

∑

j∈S^∗

λ^∗jϕj(xi) +2

∑

j∈S^∗

˜ r_n,j|λ^∗j|,

and using previous notation, kf_µ_ˆk²_n−2

n

∑

i=1

b_iy_i

∑

j∈S^∗

µ˜jϕj(xi) +2

∑

j∈S^∗

˜ r_n,j|µ˜j| ≤ kf_λ∗k²_n−2

n

∑

i=1

biyi

∑

j∈S^∗

λ^∗_jϕj(xi) +2

∑

j∈S^∗

˜ rn,j|λ^∗_j|.

(11)

Therefore,

kf_λ∗−fµˆk²_n=kfµˆk²_n+kf_λ∗k²_n−2<fµˆ,f_λ∗>

≤2kf_λ∗k²_n−2<fµˆ,f_λ∗>+2 n

n i=1

∑

biyi

∑

j∈S^∗

(µ˜j−λ^∗_j)ϕj(xi) +2

∑

j∈S^∗

˜

rn,j(|λ^∗_j| − |µ˜j|)

= 2 n

n

∑

i=1

b_iy_i(fµˆ(xi)−f_λ∗(xi))−2 n

n

∑

i=1

b²_if_λ∗(xi)(f_µ_ˆ(xi)−f_λ∗(xi)) +2

∑

j∈S^∗

˜

rn,j(|λ^∗_j| − |µ˜j|)

= 2 n

n

∑

i=1

b_i(yi−E(yi))(fµˆ(xi)−f_λ∗(xi)) +2

∑

j∈S^∗

˜

rn,j(|λ^∗_j| − |µ˜j|)

= 2 n

n i=1

∑

b_iεi(fµˆ(xi)−f_λ∗(xi)) +2

∑

j∈S^∗

˜

r_n,_j(|λ^∗j| − |µ˜j|)

=2

M

∑

j=1

V_j(µˆj−λ^∗_j) +2

∑

j∈S^∗

˜

rn,j(|λ^∗_j| − |µ˜j|).

Now let us assume that for any j∈S^∗,Vj<rn,j.Then, kf_λ∗−f_µ_ˆk²_n<2

∑

j∈S^∗

(rn,j+r˜n,j)|µˆj−λ_j^∗|

<2σ rlogM

n (√ τ+

√ τ)˜

∑

j∈S^∗

kϕ_jk_n|µˆj−λ_j^∗|.

So,

k∈S

∑

^∗

|λ_k^∗−µˆk|kϕ_kk_n<2σ rlogM

n (√ τ+

√

τ˜) s^∗ 1−ρ(S^∗)s^∗ and for any j∈/S^∗,

ρ(S^∗)kϕjk_n

∑

k∈S^∗

|λk^∗−µˆk|kϕkk_n<2σ rlogM

n kϕjk_n(√ τ+√

˜

τ) ρ(S^∗)s^∗ 1−ρ(S^∗)s^∗

< 2σc(√

τ+√ τ)˜ 1−c

rlogM n kϕjk_n

<(√ τ˜−√

τ)σ rlogM

n kϕjk_n

<r˜n,j−rn,j. Therefore,

B≤

∑

j∈S^∗

P V_j

≥rn,j

and using Lemma 1, sinceP{S^c} ≤A+B,

P{S} ≥1−2M^1−τ/2.

(12)

2.5 Proof of Corollary 1

First note thatλ^∗ satisfies the Dantzig constraint (1) wherern,j is replaced by ˜rn,j with probability larger than 1−M^1−˜^τ/2. On the event ˆS⊂S^∗, we haveλ_(S^∗∗)^C=λˆ_(S∗)^C=0, then applying Theorem 2, we obtain that for anyα>0

||fˆ−f||²_n≤32s^∗ 1

α+ 1 κ_s²∗

˜ r_n²,

which implies the result of the theorem.

References

Bertin K, Le Pennec E, Rivoirard V (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincar´e 47:43–74

Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Statist 37(4):1705–1732

Bunea F (2008) Consistent selection via the Lasso for high dimensional approximating regression models. In: Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, Inst. Math. Stat. Collect., vol 3, Inst. Math. Statist., Beachwood, OH, pp 122–137

Bunea F, Tsybakov AB, Wegkamp MH (2006) Aggregation and sparsity vial1 penalized least squares. In: Learning theory, Lecture Notes in Comput. Sci., vol 4005, Springer, Berlin, pp 379–391

Bunea F, Tsybakov A, Wegkamp M (2007a) Sparsity oracle inequalities for the Lasso. Elec- tronic Journal of Statistics 1:169–194

Bunea F, Tsybakov AB, Wegkamp MH (2007b) Aggregation for Gaussian regression. The Annals of Statistics 35(4):1674–1697

Cand`es EJ, Tao T (2007) The Dantzig selector: statistical estimation whenpis much larger thann. The Annals of Statistics 35(6):2313–2351

van de Geer S (2010) `1-regularization in high-dimensional statistical models. In: Pro- ceedings of the International Congress of Mathematicians. Volume IV, Hindustan Book Agency, New Delhi, pp 2351–2369

van de Geer SA, B¨uhlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics 3:1360–1392