Online Appendix to the Article “Choice of V for V -Fold Cross-Validation in Least-Squares Density Estimation”

(1)

Online Appendix to the Article “Choice of V for V -Fold Cross-Validation in Least-Squares Density Estimation”

Sylvain Arlot sylvain.arlot@math.u-psud.fr

Laboratoire de Math´ematiques d’Orsay

Univ. Paris-Sud, CNRS, Universit´e Paris-Saclay 91405 Orsay, France

Matthieu Lerasle mlerasle@unice.fr

CNRS

Univ. Nice Sophia Antipolis LJAD CNRS UMR 7351 06100 Nice France

Editor:Xiaotong Shen

This appendix is organized as follows. The first section (called Section B, for consistency with the numbering of the article) gives complementary computations of variances. Then, results concerning hold-out penalization are detailed in Section D, with the proof of the oracle inequality stated in Section 8.2 (Theorem 12) and an exact computation of the variance. Section E provides complements on the computational aspects stated in Section 7.

In particular, we state and analyse the basic algorithm for computing the V-fold criteria and we give the proof of Proposition 8. A useful concentration inequality is recalled in Section F. Finally, some simulation results are detailed in Section G, as a supplement to the ones of Section 6.

Appendix B. Additional Variance Computations

Proposition 17 Let(ψ_λ)λ∈Λm1 and(ψ_λ)λ∈Λm2 be two finite orthonormal families of vectors of L⁴(µ). Assume that B satisfies (Reg) and, for any m∈ {m₁, m2}, let

C_id(m) =Pnγ(bsm) +E

pen_id(m) . Then, with the notation of Theorem 6,

Var C_id(m₁)

= 2(n−1)

n³ β(m₁, m₁) + 2 nVar

1− 1

n

s_m₁(ξ) + 1

2nΨ_m₁(ξ)

.

We also have

Var C_id(m₁)− C_id(m₂)

= 2(n−1)

n³ B(m₁, m₂) + 2

nVar

1− 1 n

s_m₁(ξ)−s_m₂(ξ) + 1

2n Ψ_m₁(ξ)−Ψ_m₂(ξ)

.

(2)

Proof Simply notice that

Var C_id(m1)

= Var Pnγ(bsm1) .

Therefore, from (57), the variance ofC_id(m1) is the one of

− 1 n²

X

16i,j6n

Um1(ξi, ξj)−

n

X

i=1

2sm1(ξi)

n .

so that, by Lemma 16, Var C_id(m₁)

= 2(n−1)

n³ β(m₁, m₁) + 1

n³ Var Ψ_m₁(ξ)−2s_m₁(ξ) + 4

n²

n

X

i=1

Cov Ψ_m₁(ξ)−2s_m₁(ξ), s_m₁(ξ) + 4

nVar s_m₁(ξ)

= 2(n−1)

n³ β(m₁, m₁) + 2 nVar

1− 1

n

s_m₁(ξ) + 1

nΨ_m₁(ξ)

. The variance of the increments follows from the same computations.

B.1 Evaluation of the Terms in the Variance Formula

The following proposition gives a formula for the terms appearing in Theorem 6 and Propo- sition 17 which does not depend on the basis (ψ_λ)λ∈Λm.

Proposition 18 For any m₁, m₂∈ M_n, we have β(m₁, m₂) =nCov sb_m₁(ξ),bs_m₂(ξ)

−(n+ 1) Cov s_m₁(ξ), s_m₂(ξ) B(m₁, m₂) =nVar (bs_m₁−bs_m₂)(ξ)

−(n+ 1) Var (s_m₁−s_m₂)(ξ)

, (59)

where ξ denotes a copy of ξ₁, independent of ξ

JnK. Proof By definition, we have

β(m1, m2) = X

λ∈Λm1

X

λ⁰∈Λm2

Cov(ψλ(ξ1), ψλ⁰(ξ1))²

= X

λ∈Λm1

X

λ⁰∈Λm2

P(ψλψ_λ⁰)−P ψλP ψ_λ⁰2

= X

λ∈Λm1

X

λ⁰∈Λ_m₂

P(ψ_λψ_λ⁰)2

−2 X

λ∈Λm1

X

λ⁰∈Λ_m₂

P ψ_λP ψ_λ⁰P(ψ_λψ_λ⁰)

+ X

λ∈Λ_m₁

X

λ⁰∈Λ_m₂

(P ψ_λP ψ_λ⁰)²

= X

λ∈Λ_m₁

X

λ⁰∈Λm2

P(ψ_λψ_λ⁰)2

−2P(s_m₁s_m₂) +ks_m₁k²ks_m₂k².

(3)

Now, by Eq. (31), we have Cov sbm1(ξ),bsm2(ξ)

= 1 n²

X

16i,j6n

X

λ∈Λm1

X

λ⁰∈Λm2

Cov ψ_λ(ξ_i)ψ_λ(ξ), ψ_λ⁰(ξ_j)ψ_λ⁰(ξ)

= 1 n

X

λ∈Λm1

X

λ⁰∈Λ_m₂

P(ψ_λψ_λ⁰)2

−(P ψ_λP ψ_λ⁰)² + n−1

n

X

λ∈Λm1

X

λ⁰∈Λm2

P(ψ_λψ_λ⁰)−P ψ_λP ψ_λ⁰

P ψ_λP ψ_λ⁰

= 1 n

X

λ∈Λ_m₁

X

λ⁰∈Λ_m₂

P(ψ_λψ_λ⁰)2

− 1

nks_m₁k²ks_m₂k²+n−1

n Cov sm1(ξ), sm2(ξ) .

It follows that X

λ∈Λ_m₁

X

λ⁰∈Λm2

P(ψ_λψ_λ⁰)2

=nCov sb_m₁(ξ),bs_m₂(ξ)

+ks_m₁k²ks_m₂k²

−(n−1) Cov s_m₁(ξ), s_m₂(ξ) . Thus,

β(m₁, m₂) =nCov bs_m₁(ξ),bs_m₂(ξ)

−(n+ 1) Cov s_m₁(ξ), s_m₂(ξ) . Eq. (59) follows.

B.2 Evaluation of the Variance in the Regular Histogram Case

The following lemma gives the value of the terms appearing in Theorem 6 for two nested regular histogram models.

Lemma 19 Let m₁ = Λ_m₁ and m₂ = Λ_m₂ be two regular partitions of R, as defined by Example 1 in Section 3.2, so that for i∈ {1,2}, for any λ∈m_i, µ(λ) =d⁻¹_m_i. We assume thatm2 is a subpartition ofm1, that is, any element of m2 is a subset of an element of m1. For any m^? ∈ {m₁, m₂}, we define

Tm^?(x) = X

λ∈m^?

(ψλ(x)−P ψλ)² = sup

t∈Bm?

t(x)−P t2

where we recall thatBm^? ={t∈S_m^?/ktk61}and for anyλ∈m₁∪m₂,ψ_λ = (µ(λ))^−1/21λ. Then, we have

β(m1, m2) =dm1ks_m₂k²−2P(sm1sm2) +ks_m₁k²ks_m₂k²=P(Tm1sm2) (60) and B(m1, m2) =P Tm1(sm1 −sm2) + (Tm2 −Tm1)sm2

= (d_m₂−d_m₁)ks_m₂k²+d_m₁ks_m₁ −s_m₂k²

−2 VarP(sm1 −sm2)− ks_m₁ −sm2k⁴ .

(4)

Proof On the one hand, by definition, β(m1, m2)

= X

λ∈m1

X

λ⁰∈m2

E

h

ψ_λ(ξ₁)−P ψ_λ

ψ_λ⁰(ξ₁)−P ψ_λ⁰i2

= X

λ∈m1

X

λ⁰∈m2

P(ψλψλ⁰)2

−2P(ψλψλ⁰)P ψλP ψλ⁰+ (P ψλ)²(P ψλ⁰)²

= X

λ∈m₁

X

λ⁰∈m₂

P(ψλψλ⁰)2

−2P

X

λ∈m₁

(P ψλ)ψλ

| {z }

=sm1

X

λ∈m₂

(P ψλ)ψλ

| {z }

=sm2

+ X

λ∈m₁

(P ψλ)²

| {z }

=ks_m₁k²

X

λ∈m₂

(P ψλ)²

| {z }

=ks_m₂k²

.

For computing the first term, we use thatψλψλ⁰ = 0 ifλ∩λ⁰ =∅ and m2 is a subpartition of m₁, so that

X

λ∈m₁

X

λ⁰∈m₂

P(ψλψ_λ⁰)2

= X

λ∈m₁

X

λ⁰∈m₂ λ⁰⊂λ

P(ψλψ_λ⁰)2

= X

λ∈m1

1 µ(λ)

X

λ⁰∈m2

λ⁰⊂λ

(P ψ_λ⁰)² =d_m₁ X

λ⁰∈m2

(P ψ_λ⁰)² =d_m₁ks_m₂k²

hence

β(m₁, m₂) =d_m₁ks_m₂k²−2P(s_m₁s_m₂) +ks_m₁k²ks_m₂k² . On the other hand, by definition of T_m,

P(T_m₁s_m₂) = X

λ∈m₁

X

λ⁰∈m2

P (ψ_λ−P ψ_λ)²ψ_λ⁰P(ψ_λ⁰)

= X

λ∈m1

X

λ⁰∈m₂

P(ψ²_λψλ⁰)(P ψλ⁰)−2P(ψλψλ⁰)(P ψλ)(P ψλ⁰) + (P ψλ)²(P ψλ⁰)²

=P X

λ∈m₁

ψ²_λ

| {z }

=dm1

X

λ⁰∈m₂

(P ψ_λ⁰)ψ_λ⁰

| {z }

=sm2

!

−2P(sm1sm2) +ks_m₁k²ks_m₂k²

which proves Eq. (60) since P(s_m₂) =ks_m₂k².

Now, we remark that Eq. (60) also gives formulas for β(m_i, m_i), i∈ {1,2}, since m_i is a subpartition of itself. So, the second formula forβ(mi, mj) in Eq. (60) yields

B(m1, m2) =P Tm1sm1+Tm2sm2 −2Tm1sm2

(5)

=P T_m₁(s_m₁ −s_m₂) + (T_m₂ −T_m₁)s_m₂ . Similarly, the first formula for β(mi, mj) in Eq. (60) gives

B(m1, m2)

=dm1 ks_m₁k²− ks_m₂k²

+ (dm2 −dm1)ks_m₂k²−2P (sm1 −sm2)²

+ ks_m₁k²− ks_m₂k²2

= (d_m₂ −d_m₁)ks_m₂k²+d_m₁ks_m₁ −s_m₂k²−2 Var_P(s_m₁ −s_m₂)− ks_m₁−s_m₂k⁴ , where we used thatP(sm) =ks_mk² and ks_m₁−sm2k²=ks_m₁k²− ks_m₂k².

Appendix C. Results on MCCV and Some Other Cross-Validation Criteria

We prove here the results stated in Section 8.1. Note that we here prove slightly more general results (Theorems 23 and 24), from which Theorems 9 and 10 are corollaries. In particular, we do not always restrict to MCCV criteria: we always assume (SameSize) and (Ind) hold true, but we sometimes do not need to have (MCCV) satisfied.

C.1 Preliminary Computations

Our proofs rely on a simple closed-form formula for cross-validation criteria. Let us start by the hold-out criterion. Let T ⊂JnKwith|T|=n−p, independent fromDn. Then,

crit_HO(m, T) =P_n^(T^c⁾γ sb^(T_m⁾

= bs^(T_m⁾

2−2P_n^(T^c⁾ bs^(T_m⁾

=

bs^(T_m⁾−sm

2+ks_mk²+ 2 D

bs^(T_m⁾−sm, sm

E

−2

P_n^(T^c⁾−P

sb^(T_m⁾−s_m

−2P

bs^(T_m⁾−s_m

−2P_n^(T^c⁾(s_m)

=

bs^(T_m⁾−sm

2−2

P_n^(T^c⁾−P

bs^(T_m⁾−sm

−2P_n^(T^c⁾(sm) +ks_mk² (61) where the last equality uses that

P

bs^(T_m⁾−sm

= D

bs^(T_m⁾−sm, s E

= D

bs^(T_m⁾−sm, sm

E

sinces_m is the orthogonal projection in L²(µ) ofs_m ontoS_m and bs^(Tm⁾−s_m∈S_m. The last two terms in the right-hand side of Eq. (61) can be rewritten as

−2P_n^(T^c⁾(s_m) +ks_mk² =−2

P_n^(T^c⁾−P

(s_m)−2P(s_m) +ks_mk²

=−2

P_n^(T^c⁾−P

(s_m)− ks_mk² sinceks_mk² =P(s_m). For the first two terms, we write that

ksb^(T_m⁾−smk²−2

P_n^(T^c⁾−P

sb^(T_m⁾−sm

(6)

= X

λ∈Λm

h

(P_n^(T⁾−P)(ψλ)2

−2

P_n^(T^c⁾−P

(ψλ)

P_n^(T⁾−P

(ψλ) i

= X

λ∈Λm

1 (n−p)²

X

16i,j6n

1i∈T, j∈T ψλ(ξi)−P ψλ

ψλ(ξj)−P ψλ

− 2 p(n−p)

X

16i,j6n

1i∈T^c, j∈T ψ_λ(ξi)−P ψ_λ

ψ_λ(ξj)−P ψ_λ

= X

16i,j6n

1j∈T

n−p 1i∈T

n−p−21i∈T^c

p

U_m(ξ_i, ξ_j)

where we recall that for any x, y∈ X, Um(x, y) = X

λ∈Λm

ψλ(x)−P ψλ

ψλ(y)−P ψλ

= X

λ∈Λm

ψλ(x)ψλ(y)−sm(x)−sm(y) +ks_mk² is defined by Eq. (56), and that U_m(x, x) = Ψ_m(x)−2s_m(x) +ks_mk².

Therefore, Eq. (61) can be rewritten as critHO(m, T) = X

16i,j6n

1^j∈T n−p

1i∈T

n−p−21i∈T^c

p

Um(ξi, ξj)

−2 P_n^(T^c⁾−P

(sm)− ks_mk²

= X

16i,j6n

ω^HO_i,j (T)Um(ξi, ξj) +

n

X

i=1

σ^HO_i (T) sm(ξi)−P(sm)

− ks_mk² (62) with

ω^HO_i,j (T) = 1j∈T

n−p 1i∈T

n−p −21i∈T^c

p

σ^HO_i (T) = −2

p 1^i∈T^c . As a consequence, under assumption (SameSize),

critCV m,(Tj)_16j6K

= X

16i,j6n

ωi,jUm(ξi, ξj) +

n

X

i=1

σi sm(ξi)−P(sm)

− ks_mk² (63) with

ω_i,j = 1 B

B

X

K=1

1j∈T_K

n−p

1i∈T_K

n−p −21i∈T_K^c

p

σi= −2 pB

B

X

K=1

1i∈T_K^c .

Note that Eq. (63) is consistent with previously obtained formulas. For V-fold cross- validation, under assumption (Reg), Eq. (63) holds with

ωi,j =ω^VF_i,j := 1 n²







V

V−1 ifiand j belong to the same block

−

V V−1

2

otherwise

(7)

σi=σ_i^VF:= −2 n ,

which can also be obtained from the combination of Eq. (8) in Lemma 1 and Eq. (58). For the leave-p-out, Eq. (63) holds with

ωi,j =ω_i,j^LPO:=

( ₁

n(n−p) ifi6=j

−(n−p+1)

n(n−1)(n−p) otherwise σ_i =σ_i^LPO := −2

n ,

which can also be obtained from Eq. (10) in Lemma 1 and Eq. (58).

Using that Um(x, x) = Ψm(x)−2sm(x) +ks_mk², Eq. (63) can be rewritten as critC m,(T_j)_16j6K

=

n

X

i=1

ω_i,i

!

D_m− ks_mk²

− ks_mk²+

n

X

i=1

ω_i,i Ψ_m(ξ_i)− D_m +

n

X

i=1

(−2ω_i,i+σi) sm(ξi)−P sm

+ X

16i6=j6n

ωi,jUm(ξi, ξj) . Using (SameSize) we have

n

X

i=1

ω_i,i= 1 B(n−p)²

B

X

K=1 n

X

i=1

1i∈T_K = 1 n−p ,

and we get

critC m,(Tj)_16j6K

= D_m− ks_mk²

n−p − ks_mk²+

n

X

i=1

ωi,i Ψm(ξi)− D_m +

n

X

i=1

(−2ω_i,i+σ_i) s_m(ξ_i)−P s_m

+ X

16i6=j6n

ω_i,jU_m(ξ_i, ξ_j) .

(64)

C.2 Concentration Inequalities

In the proof of Theorem 9 in Section C.3, given formula (64) for the cross-validation criterion, we need concentration inequalities for the three random sums appearing in Eq. (64).

These are stated and proved in three lemmas below.

Concentration of Pn

i=1ω_i,i(Ψ_m(ξ_i)− D_m).

Lemma 20 Assume that (SameSize), (Ind) and (H1) hold true. Then, for any x >0, an event of probability at least 1−2e^−x exists on which the following holds true: for any ∈(0,1],

n

X

i=1

ωi,i Ψm(ξi)− D_m

6 D_m

n−p +5x(n+A) 3(n−p)² .

(8)

Proof By (Ind), conditionally to (ω_i,i)_16i6n,Pn

i=1ω_i,i Ψ_m(ξ_i)− D_m

is a sum of independent real-valued random variables. So, we can apply Bernstein’s inequality.

First, for any i∈JnK, using (SameSize), ωi,i = 1

B

X

K=1

1i∈T_K

(n−p)² 6 1 (n−p)² and using Eq. (49),

kΨ_mk_∞6kU_mk_∞62

b²_m+ks_mk² ,

so that

ωi,iΨm(ξi)6 max

16i6nωi,i× kΨ_mk_∞6 2

b²_m+ks_mk² (n−p)² almost surely.

Second, using (SameSize), we have

n

X

i=1

ω_i,i² 6 max

16i6nω_i,i×

n

X

i=1

ω_i,i 6 1 (n−p)³ and using Eq. (49) again,

E

Ψm(ξi)²

6kΨ_mk_∞×P(Ψm) =kΨ_mk_∞× D_m62

b²_m+ks_mk² D_m , so that

n

X

i=1

ω²_i,iE

Ψm(ξi)² 6

2

b²_m+ks_mk² D_m (n−p)³ .

Then, by Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10), conditionally to (ω_i,i)_16i6n, an event of probability at least 1−2e^−x exists on which

n

X

i=1

62 v u u t x

b²_m+ks_mk² D_m (n−p)³ +

2

b²_m+ks_mk² (n−p)²

x 3 6 D_m

n−p + 2

3 +1

x

b²_m+ks_mk² (n−p)² 6 D_m

n−p + 5 3

x(n+A) (n−p)²

for any∈(0,1], where we used thatb²_m 6nby (H1), and thatks_mk² 6ksk² 6ksk∞6A.

The result follows by integrating this conditional concentration inequality with respect to (ω_i,i)_16i6n.

(9)

Concentration of Pn

i=1(−2ω_i,i+σ_i)(s_m(ξ_i)−P s_m).

Lemma 21 Assume that (SameSize)and (Ind) hold true. Then, for anyx >0, an event of probability at least 1−e^−x exists on which the following holds true: for any∈(0,1],

n

X

i=1

(−2ω_i,i+σ_i) s_m(ξ_i)−P s_m−s_m⁰(ξ_i)−P s_m⁰ 6ks_m−s_m⁰k²+R²¹_n(x, , π^∗, A)

(65)

where the remainder term depends on the additional assumption that we make. If (H2) holds true, then

R²¹_n (x, , π^∗, A) := 16Ax 3

1

(n−p)² + π^? p

.

If (H1) and (H2⁰) hold true, then, some numerical constant κ >0 exists such that R²¹_n (x, , π^∗, A) := κ

"

Ax 1

(n−p)² +π^? p

+x²n

1

(n−p)² +π^? p

2# .

Before proving Lemma 21, let us introduce some useful notation: given a sequenceT₁, . . . , T_B of subsets of JnK, for everyi, j∈JnK, we define

π_i= 1 B

B

X

K=1

1i∈T_K^c π_i,j = 1 B

B

X

K=1

1i∈T_K^c1j∈T_K^c and π^∗= max

i=1,...,nπ_i .

Note that, assuming (SameSize), we have

06πi,j 6min(πi, πj)6π^∗61

n

X

i=1

πi =p

n

X

i=1

π_i,j=pπ_j 6pπ^∗ and X

16i,j6n

π_i,j =p² .

(66)

Proof of Lemma 21 By (Ind), conditionally to (−2ω_i,i+σ_i)_16i6n,

n

X

i=1

ω_i,i Ψ_m(ξ_i)− D_m

is a sum of independent real-valued random variables. So, we can apply Bernstein’s inequality.

First, we notice that for every i∈JnK,

−2ω_i,i+σi= 1 B

B

X

K=1

−2

(n−p)²1i∈TK− 2 p1i /∈T_K

=−2 1

(n−p)²(1−πi) + πi

p

hence

|−2ω_i,i+σi|= 2 1

(n−p)²(1−πi) +πi

p

62 1

(n−p)² +π^∗ p

(10)

since 06π_i 6π^∗ 61. So, for every i∈JnK, (−2ω_i,i+σi) sm(ξi)−s_m⁰(ξi)

62 1

(n−p)² +π^∗ p

ks_m−s_m⁰k_∞ almost surely. Second,

n

X

i=1

(−2ω_i,i+σ_i)²62 1

(n−p)² +π^∗ p

n

X

i=1

|−2ω_i,i+σ_i|

= 2 1

(n−p)² +π^∗ p

2

1 n−p + 1

68 1

(n−p)² +π^∗ p

and

E h

s_m(ξ)−s_m⁰(ξ)2i

6ksk_∞ks_m−s_m⁰k² 6Aks_m−s_m⁰k² so that

n

X

i=1

E

(−2ω_i,i+σi) sm(ξ)−sm⁰(ξ)2 68A

1

(n−p)² + π^∗ p

ks_m−sm⁰k² .

Then, by Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10), conditionally to (−2ω_i,i+σ_i)_16i6n, an event of probability at least 1−e^−x exists on which

n

X

i=1

(−2ω_i,i+σi) sm(ξi)−P sm−s_m⁰(ξi)−P s_m⁰

6R⁰(m, m⁰)

R⁰(m, m⁰) :=

s 16xA

1

(n−p)² +π^∗ p

ks_m−s_m⁰k²+2xks_m−s_m⁰k_∞ 3

1

(n−p)² +π^∗ p

. Since 1−2e^−x is deterministic, the same inequality holds unconditionally on an event of probability at least 1−2e^−x.

We now upperbound R⁰(m, m⁰), differently depending on the assumption we make. On the one hand, if (H2) holds true,

ks_m−s_m⁰k_∞6ks_mk_∞+ks_m⁰k_∞62A and we get

R⁰(m, m⁰)6 s

16Ax 1

(n−p)² +π^∗ p

ks_m−s_m⁰k²+4Ax 3

1

(n−p)² +π^∗ p

6ks_m−s_m⁰k²+16Ax 3

1

(n−p)² +π^? p

for any∈(0,1], which proves Eq. (65). On the other hand, if (H1) and (H2⁰) hold true, sm−sm⁰ ∈Sm⁰⁰ withm⁰⁰∈ {m, m⁰}, so that

ks_m−s_m⁰k_∞6b_m⁰⁰ks_m−s_m⁰k6√

nks_m−s_m⁰k

(11)

and we get R⁰(m, m⁰)6

s 16xA

1

(n−p)² +π^∗ p

ks_m−sm⁰k²+2x√

nks_m−sm⁰k 3

1

(n−p)² +π^∗ p

6ks_m−s_m⁰k²+ 1

"

8Ax 1

(n−p)² +π^? p

+2

9x²n 1

(n−p)² +π^? p

2#

for any ∈(0,1], which proves Eq. (65) with κ= 8.

Concentration of P

16i6=j6nω_i,jU_m(ξ_i, ξ_j).

Lemma 22 Suppose that assumptions (SameSize), (Ind) and (H1) hold true. Then, an absolute constantκ >0exists such that, for anyx >1, with probability larger than1−6e^−x, for any∈(0,1],

X

16i6=j6n

ωi,jUm(ξi, ξj)

6 D_m

n−p+ κn (n−p)²

1 +nπ^∗ p

nAx (n−p)+

1 +A

n

x²

. (67) Proof We start with the following symmetrization trick

X

16i6=j6n

ω_i,jU_m(ξ_i, ξ_j) = X

16i<j6n

ω_i,jU_m(ξ_i, ξ_j) +ω_j,iU_m(ξ_j, ξ_i)

= X

16i<j6n

(ω_i,j+ω_j,i)U_m(ξ_i, ξ_j)

= X

16i6=j6n

ω_i,j⁰ U_m(ξ_i, ξ_j) , where

ω⁰_i,j = ω_i,j+ω_j,i

2 = 1

(n−p)²

1−(πi+πj)n p +

2n p −1

πi,j

= 1

(n−p)²

(1−πi,j) + n

p(πi,j−πi) +n

p(πi,j−πj)

.

From the last formula forω_i,j⁰ , using Eq. (66), we get that (ω_i,j⁰ )² 6 1

(n−p)⁴

1 +n²

p²(π_i+π_j)²

(68)

and max

i,j∈JnK

|ω_i,j⁰ |6 1 (n−p)²

1 +2n

p π^∗

. (69)

The concentration of theU-statistics follows from Houdr´e and Reynaud-Bouret (2003, The- orem 3.4), that is Eq. (44) withg_i,j(ξ_i, ξ_j) =ω⁰_i,jU_m(ξ_i, ξ_j). To apply this result, it remains to compute the termsA,B,C,D. First,

2A² = X

16i6=j6n

(ω_i,j⁰ )²E

U_m(ξ_i, ξ_j)²

6ksk_∞D_m X

16i6=j6n

(ω⁰_i,j)²

(12)

by Eq (45). Algebraic computations and Eq. (68) and (66) show that X

16i6=j6n

(ω⁰_i,j)²6 1 (n−p)⁴

X

16i6=j6n

1 +n²

p²(π_i+π_j)²

6 1

(n−p)⁴ X

16i,j6n

1 +n²

p²(π^∗πi+π^∗πj + 2πiπj)

= n² (n−p)⁴

3 +2π^∗n p

Hence,

A6 n

(n−p)² s

3

2 +π^∗n p

ksk_∞D_m . Second, let a_i and b_j be functions such that Pn

i=1E a_i(ξ)²

6 1 and Pn i=1E

b_i(ξ)² 61.

Eq (47) shows that E

ai(ξ)bj(ξ⁰)Um(ξ, ξ⁰)

6 ksk∞

2

E ai(ξ)²

+E

bj(ξ)² ,

hence, using Eq. (69),

B = X

16i6=j6n

ω_i,j⁰ E

a_i(ξ)b_j(ξ⁰)U_m(ξ, ξ⁰) 6 max

16i6=j6n

ω⁰_i,j

ksk∞

2

X

16i6=j6n

E

ai(ξ)² +E

bj(ξ)² 6 nksk_∞

(n−p)²

1 +2n p π^∗

.

Third, Eq (48) shows that, for any x >0, E

Um(ξ, x)² 62

b²_m+ks_mk² ksk_∞ and by Eq. (68) we have

n

X

i=2

(ω_i,1⁰ )²6 1 (n−p)⁴

n

X

i=2

1 + (πi+π1)²n² p²

6 n

(n−p)⁴

1 + 2π^∗n p

2

.

So, for anyx >0,

n

X

i=2

(ω⁰_i,1)²E

U_m(ξ, x)² 62

b²_m+ks_mk²

ksk∞× 1 (n−p)⁴

1 + 2π^∗n p

2

hence

C6

1 + 2π^∗n p

n (n−p)²

s

2 b²_m+ks_mk² ksk∞

n .

(13)

Fourth, using Eq (49) and (69), D6 max

i,j∈JnK

ω_i,j⁰

sup

x,y

Um(x, y) 6

1 +2n

p π^∗ n

(n−p)²

2 b²_m+ks_mk²

n .

Now, we remark that b²_m 6 n by (H1), and ks_mk² 6 ksk² 6 ksk_∞ 6 A, and we can plug this two inequalities in the upper bounds above. By (Ind), we can apply Houdr´e and Reynaud-Bouret (2003, Theorem 3.4), conditionally on the weights ω_i,j. We obtain that an absolute constant κ > 0 exists such that, for any x > 1, with probability larger than 1−6e^−x, for any∈(0,1], Eq. (67) holds true.

C.3 Oracle Inequality (Proof of Theorem 9)

Theorem 9 actually is a corollary of the following general result.

Theorem 23 Let ξ

JnK be i.i.d. real-valued random variables with common density s with respect toµ, such thats∈L^∞(µ), (TK)_16K6B be some sequence of subsets of JnKsatisfying (SameSize) and (Ind), and (Sm)m∈Mn be a collection of separable linear spaces satisfying (H1). Assume that either (H2) or (H2⁰) holds true. For every m ∈ M_n, let bs_m be the estimator defined by Eq. (1), and es=sb_m_b where

mb ∈argmin

m∈M_n

n

critCV m,(TK)_16K6Bo and crit_CV is defined by Eq. (25). Define π^∗ = max_i=1,...,n_B¹ PB

K=11i∈T_K^c and for any x, , κ >0,

ρ4(, x, κ, n, τn, π^∗, A) := κ nτ_n²

1 + π^∗ 1−τn

α Ax

τn+(A∨1)x² ³

withα= 1 under assumption (H2) andα= 2 under assumption (H2⁰). Then, an absolute constant κ >0 exists such that, for anyx>0, with probability at least1−12|M_n|²e^−x, for any∈(0, κ⁻¹),

1−

τ_n

kes−sk²6 1 + τ_n inf

m∈M_n

nkbs_m−sk²o

+ρ₄(, x, κ, n, τ_n, π^∗, A) .

The oracle inequality of Theorem 23 is similar to the one of Theorem 5, withδ replaced by 1/τn−1 (both quantities correspond to the bias of the criterion as an estimator of the risk) and a slightly different remainder term. In addition to the remarks already made about Theorem 5, we can make the following comments.

• The remainder termρ4 is of orderx²/n, as in Theorem 5 under the following sufficient conditions: (i)τ_n stays away from 0, (ii)π^∗/(1−τ_n) is bounded.

• ForV-fold criteria, τ_n= (V −1)/V >1/2 and π^∗/(1−τ_n) = 1, so conditions (i) and (ii) are satisfied and we recover an oracle inequality forV-fold cross-validation similar to Theorem 5.

(14)

• The leading constant in front of the oracle inequality of Theorem 23 is of order 1/τ_n, so we can get asymptotic optimality only ifτn→1, that is,pn. This is consistent with the fact that the bias of the cross-validation criterion is negligible at first order if and only if τ_n→1.

• For hold-out criteria,π^∗= 1 so the remainder term is of orderx²/(n(1−τ_n)^α)>x²/p which is large whenτn is close to 1, that is, whenpis small. Hence, for such criteria, we cannot get a leading constant close to 1 and a “small” remainder term.

Let us now explain why Theorem 9 is also a corollary of Theorem 23.

Proof of Theorem 9 We only have to prove some upper bound onπ^∗ under assumption (MCCV), thanks to which Theorem 9 is a straightforward corollary of Theorem 23.

By (SameSize) and (MCCV), for any i ∈ JnK, π_i is the empirical mean of K independent Bernoulli random variables with common parameter P(i∈T_K^c) =p/n. Then, by Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10)

∀y >0,∀i∈JnK, P π_i− p n >

r2p(n−p)y n²B + x

3B

!

6e^−y . A union bound over i∈JnKyields that for anyx >0,

P

π^∗ 61∧ 2p

n +logn+x B

>1−e^−x , where we used also thatπ^∗ 61 almost surely. Theorem 9 follows.

We finally prove Theorem 23.

Proof of Theorem 23 Throughout the proof,Ldenotes some positive numerical constant, whose value may change from line to line. Given Eq. (64), the proof relies on concentration inequalities that are detailed in Section C.2. Let us fix x > 0 and define for every κ > 1 the event Ω_good(κ, x) where all the following inequalities hold for anym, m⁰ ∈ M_n and any ∈(0,1]

n

X

i=1

6 D_m

n−p +κ(n+A)x (n−p)²

n

X

i=1

(−2ω_i,i+σ_i) s_m(ξ_i)−P s_m−s_m⁰(ξ_i)−P s_m⁰

6ks_m−s_m⁰k²+R_n²¹(x, , π^∗, A)

X

16i6=j6n

ωi,jUm(ξi, ξj)

6 D_m

n−p +κ n (n−p)²

1 +π^∗n p

nAx

(n−p)+ (n+A)x² n

kbsm−smk²−D_m n

6D_m

n +κAx² ³n .

It follows from Lemmas 14, 20, 21 and 22 that an absolute constant κ > 0 exists such that P(Ω_good(κ, x)) > 1− |M_n|²e^−x −10|M_n|e^−x. Let us remark that we can assume

(15)

x >log(11)>1 in the following, since otherwise the above probability bound is negative.

On Ωgood(κ, x), for everym∈ M_n and ∈(0,1), D_m

n 6 1

1−kbsm−smk²+ LAx²

³(1−)n . (70)

By definition ofm, for everyb m∈ M_n, kbs

mb −sk² 6kbs_m−sk²+

crit_CV(m)− ksb_m−sk²

−

crit_CV(m)b − kbs

mb −sk²

. (71) In addition, by Eq. (64),

crit_CV(m)− kbsm−sk² = D_m− ks_mk²

n−p − ks_mk²+ks_m−sk²

| {z }

=ksk²

−D_m n

+

n

X

i=1

ω_i,i Ψ_m(ξ_i)− D_m +

n

X

i=1

(−2ω_i,i+σ_i) s_m(ξ_i)−P s_m

+ X

16i6=j6n

ω_i,jU_m(ξ_i, ξ_j)−

kbs_m−sk²−D_m n

.

So, on Ω_good(κ, x), for every m, m⁰ ∈ M_n and∈(0,1/5), critCV(m)− kbsm−sk²−

critCV(m⁰)− kbs_m⁰−sk² 6D_m

1 + 2

n−p −1− n

+

+D_m⁰

1 +

n −1−2 n−p

+

+ks_m−s_m⁰k² +R²¹_n (x, , π^∗, A) + Ln

(n−p)²

1 +π^∗n p

nAx

(n−p)+ (A∨1)x² ³

+ks_m⁰k² n−p 6 n

1−

1 + 2

n−p −1− n

+

kbsm−smk²+ n 1−

1 +

n −1−2 n−p

+

kbsm⁰ −sm⁰k²

+ 2ks_m−sk²+ 2ks_m⁰ −sk² +R²¹_n (x, , π^∗, A) + Ln

(n−p)²

1 +π^∗n p

nAx

(n−p)+ (A∨1)x² ³

6max 1

1− 1

τn

−1 ++2 τn

+

,2

kbs_m−sk²

+ max 1

1−

1− 1 τn

++2 τn

+

,2

kbsm⁰−sk²

+R²¹_n (x, , π^∗, A) + L nτ_n²

1 + π^∗ 1−τ_n

Ax

τ_n+(A∨1)x² ³

6 1

τ_n −1 +L τ_n

kbs_m−sk²+ 4

τ_nkbs_m⁰−sk² +R²¹_n (x, , π^∗, A) + L

nτ_n²

1 + π^∗ 1−τn

Ax

τn+(A∨1)x² ³