Optimal model selection in density estimation

(1)

www.imstat.org/aihp 2012, Vol. 48, No. 3, 884–908

DOI:10.1214/11-AIHP425

Optimal model selection in density estimation

Matthieu Lerasle

¹

INSA Toulouse, Institut Mathématiques de Toulouse UMR CNRS 5219, IME-USP, São Paulo, Brasil. E-mail:lerasle@gmail.com Received 7 October 2009; revised 24 January 2011; accepted 15 March 2011

Abstract. In order to calibrate a penalization procedure for model selection, the statistician has to choose a shape for the penalty and a leading constant. In this paper, we study, for the marginal density estimation problem, the resampling penalties as general estimators of the shape of an ideal penalty. We prove that the selected estimator satisfies sharp oracle inequalities without remainder terms under a few assumptions on the marginal densitysand the collection of models. We also study the slope heuristic, which yields a data-driven choice of the leading constant in front of the penalty when the complexity of the models is well-chosen.

Résumé. Une procédure de pénalisation en sélection de modèle repose sur la construction d’une forme pour la pénalité ainsi que sur le choix d’une constante de calibration. Dans cet article, nous étudions, pour le problème d’estimation de la densité, les pénalités obtenues par rééchantillonnage de pénalités idéales. Nous montrons l’efficacité de ces procédures pour l’estimation de la forme des pénalités en prouvant, pour les estimateurs sélectionnés, des inégalités d’oracle fines sans termes résiduelles; les résultats sont valides sous des hypothèses faibles à la fois sur la densité inconnueset sur les collections de modèles. Ces pénalités sont de plus faciles à calibrer puisque la constante asymptotiquement optimale peut être calculée en fonction des poids de rééchantillonnage.

En pratique, le nombre de données est toujours fini, nous étudions donc également l’heuristique de pente et justifions l’algorithme de pente qui permet de calibrer la constante de calibration à partir des données.

MSC:62G07; 62G09

Keywords:Density estimation; Optimal model selection; Resampling methods; Slope heuristic

1. Introduction

Model selection by penalization of an empirical loss is a general approach including famous procedures as AIC [1,2], BIC [24], MallowsC_p[19], cross-validation [23] or hard thresholding [14] as shown by [6]. The objective is to select an estimator satisfying an oracle inequality. In order to achieve this goal, the statistician should evaluate the shape of a good penalty and the constant in front of it.

In this paper, we study theoretically in a density estimation framework the slope heuristic of Birgé and Massart [10]. There is two main reasons for this. First, it provides a general shape of the penalty under a few restrictions on the densitysand the collection of models, for example, these models can be of infinite dimension. Then, it gives the precise behavior of the selected model when the leading constant increases. A remarkable fact is that the complexity of the selected model is as large as possible until the leading constant reaches some particular point K_min. This complexity decreases then strongly, the selected model becomes more reasonable and the estimator avoids to over fit the data. Another remarkable fact is that the model selected by a leading constant equal to 2Kminis a sharp oracle.

The heuristic can then be used to justify the slope algorithm, introduced in [5], which allows to evaluate in practice the leading constant in front of a penalty (see Section2.1for details). The calibration of this constant is usually an issue

1Supported in part by FAPESP Processo 2009/09494-0.

(2)

for the statistician. The upper bounds given in theoretical theorems, when computable, are in general too pessimistic and some cross-validation methods have been used to overcome this problem in simulations (see for example [16]).

The slope algorithm chooses in general a constant more reasonable than the theoretical one and ensures that the chosen model is of reasonable size. The first main contribution of the paper is a proof of the slope heuristic for density estimation, for general collections of models. Theorems3.1and3.2extend the results of [10] in Gaussian regression and those of [5] in non-Gaussian heteroscedastic regression over histograms.

The penalty shape obtained in the slope heuristic is not computable in general by the statistician. This is why we also study in this paper the resampling penalties. These penalties were defined by Arlot [4] following Efron’s heuristic [15], as natural estimators of the “ideal penalty.” We prove that the selected estimators satisfy sharp oracle inequalities, without extra assumptions on s or the collection of models. This extends the results of Arlot [4] in non-Gaussian heteroscedastic regression among histograms. We also prove that they provide sharp estimators of the penalty shape proposed in the slope theorems. Hence, they can be used together with the slope algorithm in the general framework presented in Section2.

Resampling penalties and the slope heuristic can be defined in a more general statistical learning framework, including the problems of classification and regression (see [4,5]). Our results are therefore contributions to the theoretical understanding of these generic methods. Up to our knowledge, they are the first obtained in density estimation.

The oracle approach is now classical in statistical learning in general and in density estimation in particular. Oracle inequalities can be derived, for example, from1penalization methods [12], aggregation procedures [22], blockwise Stein method [21] or usingT-estimators [8]. Up to our knowledge, none of these methods yield oracle inequalities without remainder terms and with a leading constant asymptotically equal to one at the level of generality presented in this paper. For example, our results are valid for data taking value in any metric space and the models can be of infinite dimension. The results of [8] hold for infinite dimensional models but the estimators are not computable in practice.

The paper is organized as follows. Section2presents the notations and the main definitions. In Section3, we state our main results, that is the slope heuristic and the oracle inequality satisfied by the estimator selected by resampling penalties. In Section4, we compute the rates of convergence in the oracle inequalities using classical collections of models. The proofs of the main results are postponed to Section5. In theAppendixwe prove technical lemmas, in particular all the concentration inequalities required in the main proofs.

2. Notations

LetX₁, . . . , X_n, be i.i.d. random variables, defined on a probability space(Ω,A,P), valued in a measurable space (X,X), with common lawP. Letμbe a known measure on(X,X)and letL²(μ)be the space of square integrable real valued functions defined onX. The spaceL²(μ)is endowed with the following scalar product, defined for allt, t inL²(μ)by

t, t

=

Xt (x)t(x)dμ(x)

and the associated L²-norm · , defined for allt inL²(μ) byt =√

t, t. We assume thatP is absolutely con- tinuous with respect toμand we want to estimate the densitysofP with respect toμ. We assume thatsbelongs to L²(μ)and we measure the risk of an estimatorsˆofsbyL²-loss,s− ˆs². For all functionst inL1(P ), let

P t=

XtdP =

Xt sdμ= t, s, Pnt=1 n

n

i=1

t (Xi), νnt=(Pn−P )t.

We estimatesby minimization of a penalized empirical loss. Let(Sm, m∈Mn)be a finite collection of linear spaces of measurable, real valued functions. For allminMn, let pen(m)be a positive number and letsˆmbe the least-squares estimator, defined by

ˆ

s_m=arg min

t∈Sm

t²−2Pnt

. (1)

(3)

The final estimator is given bys˜= ˆs_m_ˆ, where ˆ

m∈arg min

m∈Mⁿ

ˆs_m²−2Pnsˆ_m+pen(m)

. (2)

We say thats˜satisfies an oracle inequality without remainder term when there exists a bounded sequenceC_nsuch that

˜s−s²≤C_n inf

m∈Mn

ˆs_m−s² .

We say that the oracle inequality is sharp, or optimal when, moreover,Cn→1 whenn→ ∞. An oracle minimizes overMnthe quantity

ˆsm−s²− s²= ˆsm²−2Psˆm= ˆsm²−2Pnsˆm+pen_id(m).

In the previous inequality, the “ideal” penalty (see [4]), pen_id(m)is defined by pen_id(m)=2νnsˆ_m.

The ideal penalty is the central object in this paper. We will prove that the slope algorithm can be used with a penalty shape proportional to its expectation and that resampling penalties provide sharp estimators of this expectation.

2.1. The slope heuristic

The “slope heuristic” has been introduced by Birgé and Massart [10] in the Gaussian regression framework and developed in a general algorithm by Arlot and Massart [5]. Let(Δm)_m_∈M_nbe a collection of complexity measures of the models. The heuristic states that there exists a constantK_minsatisfying the following properties.

SH1 When pen(m) < KminΔ_m, thenΔ_m_ˆ is too large, typicallyΔ_m_ˆ≥Cmaxm∈MⁿΔ_m. SH2 When pen(m)(K_min+δ)Δ_mfor someδ >0, thenΔ_m_ˆ is much smaller.

SH3 When pen(m)2KminΔ_m, the selected estimator is optimal.

In this paper, we prove SH1–SH3 for Δ_m=E(pen_id(m)), K_min=1. Besides the theoretical implications that we discuss later, the heuristic is of particular interest in the following situation. Imagine that there exists, for allminMn, a quantityΔmcomputable by the statistician and an unknown constantKusuch that, for someδKu,

(K_u−δ)Δ_m≤E

pen_id(m) ≤(K_u+δ)Δ_m.

In that case, it comes from SH3 that a penalty of the formKΔ_myields a good procedure ifKis large enough. In order to chooseKin a data-driven way, we can use the following algorithm (see Arlot and Massart [5]).

Slope algorithm

SA1 For allK >0, compute the selected modelm(K)ˆ given by (2) with the penalty pen(m)=KΔ_mand the associated complexityΔ_m(K)_ˆ .

SA2 Find a constantK˜_minsuch thatΔ_m(K)_ˆ is large whenK <K˜_min, and “much smaller” whenK >K˜_min. SA3 Take the finalmˆ= ˆm(2K˜_min).

The idea is thatK˜_minK_minbecause we observe a jump in the complexity of the selected model at this point (SH1 and SH2). Hence, the model selected by 2K˜minΔm2KminΔmshould be optimal thanks to SH3. By construction, the slope algorithm ensures that the selected model has a reasonable size. It prevents the statistician from choosing a too large model which could have terrible consequences (see Theorem3.1). The words “much smaller,” borrowed from [5,10], are not very clear here. We refer to [5], Section 3.3 for a detailed discussion on what “much smaller”

means in this context and for precise suggestions on the implementation of the slope algorithm. We refer also to [7]

for a practical implementation of the slope algorithm.

(4)

2.2. Resampling penalties

Data-driven penalties have already been used in density estimation in particular cross-validation methods as in Stone [25], Rudemo [23] or Celisse [13]. We are interested here in the resampling penalties introduced by Arlot [4]. Let (W1, . . . , Wn)be a resampling scheme, i.e. a vector of random variables independent ofX, X1, . . . , Xnand exchange- able, that is, for all permutationsτ of(1, . . . , n),

(W1, . . . , Wn)has the same law as(Wτ (1), . . . , W_{τ (n)}).

Hereafter, we denote byW¯_n=_n

i=1W_i/nand byE^WandL^Wrespectively the expectation and the law conditionally on the data X, X₁, . . . , X_n. Let P_n^W =_n

i=1W_iδ_X_i/n,ν_n^W =P_n^W − ¯W_nP_n be the resampled empirical processes.

Arlot’s procedure is based on the resampling heurististic of Efron (see Efron [15]), which states that the law of a functional F (P , Pn) is close to the conditional lawL^W(CWF (W¯nPn, P_n^W)).CW is a renormalizing constant that depends only on the resampling scheme and onF. Following this heuristic, Arlot defines the resampling penalties as resampling estimates ofE(pen_id(m)), that is

pen(m)=2CWE^W ν_n^W

ˆ s_m^W

, wheresˆ_m^W=arg min

t∈S_m

t²−2P_n^Wt

. (3)

We prove concentration inequalities for pen(m)and we compute the value ofCW such that pen(m)provides a sharp estimator of pen_id(m), we deduce that pen(m)provides an optimal model selection procedure (see Theorem3.3).

2.3. Main assumptions

For allm,minMn, letD_m=nE(s_m− ˆs_m²), Rm

n =E

s− ˆs_m² = s−s_m²+Dm

n , v²_m,m= sup

t∈S_m+S_m,t≤1

Var t (X) ,

e_m,m=1

n sup

t∈S_m+S_m,t≤1

t²_∞.

For allk∈N, letM^kn= {m∈Mn, R_m∈ [k, k+1)}. For allninN, for allk >0,k>0 andγ≥0, let[k]be the integer part ofkand let

l_n,γ

k, k =ln

1+Card M^[n^k^]

+ln

1+Card M^[n^k^]

+ln (k+1)

k+1

+(lnn)^γ. (4)

[V] There existγ >1 and a sequence(ε_n)_n_∈N, withε_n→0 such that, for allninN, sup

(k,k)∈(N^∗)²

sup

(m,m)∈M^kn×M^kn

v_m,m² R_m∨R_m

2

∨ e_m,m

R_m∨R_m

l_n,γ²

k, k ≤ε⁴_n.

Comments.

• [V]ensures that the fluctuations of the ideal penalty are uniformly small compared to the risk of the estimator ˆ

s_m.Note that for all k, k, l_n,γ(k, k)≥(lnn)^γ, thus, [V] holds only in non-parametric situations whereR_n= infm∈MnRm→ ∞asn→ ∞.

[BR] There exist two sequences(h^∗_n)n∈N^∗and(h^o_n)n∈N^∗with(h^o_n∨h^∗_n)→0asn→ ∞such that, for allninN^∗, for allm_o∈arg min_m_∈M_nR_mand allm^∗∈arg max_m_∈M_nD_m,

R_m_o

D_m∗ ≤h^o_n, ns−s_m∗² D_m∗ ≤h^∗_n.

(5)

Comments.

• The slope heuristic states that the complexityΔ_m_ˆ of the selected estimator is too large when the penalty term is too small. A minimal assumption for this heuristic to hold with Δ_m=D_m is that there exists a sequence (θ_n)_n_∈N∗ with θ_n →0 as n→ ∞ such that,for all n in N^∗, for all m_o∈arg minm∈MnE(s− ˆs_m²)and all m^∗∈arg maxm∈MnE(s_m− ˆs_m²),

D_m_o≤θ_nD_m∗.

Assumption[BR]is slightly stronger but will always hold in the examples(see Section4).

In order to have an idea of the ratesR_n,ε_n,h^∗_n, h⁰_n, let us briefly consider the very simple following example:

ExampleHR. We assume thatsis supported in[0,1]and that(S_m)_m_∈M_nis the collection of the regular histograms on[0,1],withd_m=1, . . . , npieces.We will see in Section4.2thatD_m∼d_masymptotically,henceD_m∗n.More- over,ifsis Hölderian and not constant,there exist positive constantsc_l, c_u, α_l, α_usuch that,for allminMn,see[4],

c_ld_m⁻^α^l≤ s−s_m²≤c_ud_m⁻^α^u.

In Section4.2,we prove that this assumption implies[V]withε_n≤Cln(n)n⁻^1/(8α^l⁺⁴⁾.

Moreover,there exists a constantC >0such thatR_m_o≤infm∈Mn(c_und_m⁻^αû+d_m)≤Cn^1/(2αû⁺¹⁾,thusR_m_o/D_m^∗ ≤ Cn⁻^2αû^/(2αû⁺¹⁾. Since there existsC >0 such that ns−s_m∗²/D_m∗ ≤Cd_m⁻_∗^αû =Cn⁻^αû, [BR] holds with hô_n = Cn⁻^2αû^/(2αû⁺¹⁾andh^∗_n=Cn⁻^αû.

Other examples can be found in Birgé and Massart [9], see also Section4.

3. Main results

3.1. The slope heuristic

The first result deals with the behavior of the selected estimator when pen(m)is too small.

Theorem 3.1 (Minimal penalty). LetMn be a collection of models satisfying[V]and[BR]and letε^∗_n=ε_n∨h^∗_n. Assume that there exists0< δ_n<1such that0≤pen(m)≤(1−δ_n)D_m/n.Letm,ˆ s˜be the random variables defined in(2)and let

cn=δ_n−28ε_n^∗ 1+16εn

.

There exists a constantC >0such that,with probability larger than1−Ce⁻^(1/2)(lnⁿ⁾^γ, D_m_ˆ≥c_nD_m∗, s− ˜s²≥ cn

5h^o_n inf

m∈Mn

s− ˆs_m². (5)

Comments.

• Assume thatpen(m)≤(1−δ)D_m/n,then,fornsufficiently large,D_m_ˆ ≥cD_m∗is as large as possible.This proves SH1withΔm=Dm/n,Kmin=1.

• The second part of(5)proves also that,in that case,we cannot obtain an oracle inequality.

Theorem 3.2. LetMnbe a collection of models satisfying[V].Assume that there existδ⁺≥δ₋>−1and0≤p<1 such that,with probability at least1−p,

2Dm

n +δ₋Rm

n ≤pen(m)≤2Dm

n +δ⁺Rm

n . (6)

(6)

Letm,ˆ s˜be the random variables defined in(2)and let C_n

δ₋, δ⁺ =

1+δ₋−46εn

1+δ⁺+26εn∨0 ₋1

.

There exists a constantC >0such that,with probability larger than1−p−Ce⁻^(1/2)(lnn)^γ, D_m_ˆ ≤C_n

δ₋, δ⁺ R_m_o, s− ˜s²≤C_n

δ₋, δ⁺ inf

m∈Mn

s− ˆs_m². (7)

Comments.

• Assume first thatpen(m)=(2+δ)D_m/nwithδclose to−1,then,fornsufficiently large, (7)ensures thatD_m_ˆ = O(Rm_o).Hence,D_m_ˆ jumps fromDm^∗(Theorem3.1)toRm_o.This provesSH2,withΔm=Dm/nandKmin=1.

• Assume then thatδ₋=δ⁺=0,so that the penalty termpen(m)=2KminΔ_m.The second part of(7)ensures thats˜ satisfies a sharp oracle inequality.This provesSH3.In general,the rate of convergence of the leading constant to 1is given by the supremum betweenδ₋,δ⁺andε_n.

• The second part of(7)proves also that,ifpen(m)=KΔ_m,withK >2Kmin,then,we only loose a constant in the oracle inequality.This is why it is better to choose a too large penalty which only yields a loss in the constant in the oracle inequality,than a too small one that may lead to an explosion of the risk.

• Dm/n is unknown in general and cannot be used in the slope algorithm. We propose two alternatives to solve this issue.In Section3.2,we give a resampling estimator of D_m,it can be used for every collection of models satisfying[V].This estimator satisfies(6)with−δ₋=δ⁺=O(εn).In Section4.2,we will also see that,in regular models,we can used_minstead ofD_mand the error is upper bounded byCR_m/R_m_o,thus Theorem3.2holds with (δ₋∨δ⁺)≤C/R_m_oε_n,p=0.In both cases,we deduce from Theorem3.2that the estimators˜given by the slope algorithm achieves an optimal oracle inequality.In ExampleHR,for example,we obtainε_n=Cn⁻^1/(8α^l⁺⁴⁾lnn.

• Let us notice here that the constantKmin=1 is absolute when we chooseΔm=Dm.This is not true in general andKmincan even depend ons.In order to see that,let us consider the collection of regular histograms on[0,1].

In that case,we have(see(9))D_m=(F (1)−F (0))dm− s_m²,whereF (x)=_x

−∞s(t)dμ(t ).Hence,the slope heuristic holds forΔ_m=d_mbut the associated constantK_min=(F (1)−F (0))is not absolute ifsis not supported on[0,1].

3.2. Resampling penalties

Theorem 3.3. LetX₁, . . . , X_nbe i.i.d.random variables with common densitys.LetMnbe a collection of models satisfying [V]. Let W₁, . . . , W_n be a resampling scheme,let W¯_n=_n

i=1W_i/n, v_W² =Var(W1− ¯W_n)and C_W = 2(v²_W)⁻¹.Lets˜be the penalized least-squares estimator defined in(2)withpen(m)defined in(3).Then,there exists a constantC >0such that

P

s− ˜s²≤(1+100εn) inf

m∈Mⁿs− ˆsm²

≥1−Ce⁻^(1/2)(lnn)^γ. (8)

Comments. The main advantage of this result is that the penalty term is always totally computable.It does not depend on an arbitrary choice of a constantK_minmade by the observer,that may be hard to detect in practice(see the paper of Alot and Massart[5]for an extensive discussion on this important issue).However,CWis only optimal asymptotically.

It is sometimes useful to overpenalize a little in order to improve the non-asymptotic performances of our procedures (see Massart[20]and the remarks after Theorem3.2)and the slope heuristic can be used to do it in an optimal way.

4. Rates of convergence for classical examples

The aim of this section is to show that [V] can be derived from more classical hypotheses in two classical collections of models: the histograms and Fourier spaces. We obtain the ratesεnunder these new hypotheses.

(7)

4.1. Assumption on the risk of the oracle

Recall thatRn=infm∈MnRm. In this section, we make the following assumption.

[BR] (Bounds on the Risk) There exist constantsC_u>0,α_u>0,γ >1, and a sequence(θ_n)_n_∈Nwithθ_n→ ∞as n→ ∞such that, for allninN^∗, for allminMn

θ_n²(lnn)^2γ≤R_n≤R_m≤C_un^α^u.

Comments. Let(Sm, m∈Mn)be the collection of regular histograms of ExampleHR.Assume thatsis an Hölderian, non-constant and compactly supported function,then there exist positive constantsc_i,c_u,α_i,α_u such that(see for example Arlot[3])

cid_m⁻^αⁱ≤ s−sm ≤cud_m⁻^α^u.

We have also,see(9),D_m=d_m− s_m².Hence, R_n≥c inf

dm=1,...,n

nd_m⁻^2αⁱ+d_m ≥cn^1/(2αⁱ⁺¹⁾.

For allminMn,we also haveR_m≤(c_u+1)n.Hence, [BR]holds for allγ >1withθ_n=cn^1/(4αⁱ⁺²⁾(lnn)⁻^2γ.It is also a classical result of minimax theory that there exist functions in Sobolev spaces satisfying this kind of assumption whenMnis the collection of Fourier spaces that we will introduce below.

We also make the following assumption on the collection(S_m, m∈Mn).

[PC] (Polynomial collection) There exist constantsc_M≥0,α_M≥0, such that, for allninN, Card(Mn)≤c_Mn^α^M.

Under Assumptions [BR] and [PC], for allm∈Mn,R_m≤C_un^αû, thus for allk > C_un^αû, Card(M^[n^k^])=0. In particular, we only have to take into account in [V] the integerskandksuch thatk≤C_un^αû andk≤C_un^αûand there exists a constantκ >0 such that ln[(1+k)(1+k)] ≤κlnn. Moreover, under [PC], ln(1+Card(M^[n^k^]))≤κlnn, hence, there exists a constantκ >0 such that, for allγ >1 andn≥3,

sup

(k,k)∈(N^∗)²

sup

(m,m)∈M^[k]n ×M^[k]n

v_m,m² Rm∨R_m

2

∨ e_m,m

Rm∨R_m

l_n,γ²

k, k

≤ sup

(m,m)∈(Mn)²

v²_m,m R_m∨R_m

2

∨ e_m,m

R_m∨R_m

κ(lnn)^2γ.

4.2. The histogram case

Let(X,X)be a measurable space. Let(P_m)_m_∈M_n be a growing collection of measurable partitionsP_m=(I_λ)_λ_∈_m of subsets ofXsuch that, for allm∈Mn, for allλ∈m, 0< μ(Iλ) <∞. LetminMn, the setSmof histograms associated toP_m is the set of functions which are constant on eachI_λ,λ∈m.S_mis a linear space. Setting, for all λ∈m,ψ_λ=(√

μ(I_λ))⁻¹1Iλ, the functions(ψ_λ)_λ_∈_mform an orthonormal basis ofS_m. Let us recall that, for allminMn,

D_m=

λ∈m

Var

ψ_λ(X) =

λ∈m

P

ψ_λ² −(P ψ_λ)²=

λ∈m

P (X∈I_λ)

μ(I_λ) − s_m². (9)

Moreover, from Cauchy–Schwarz inequality, for allx inX, for allm,minMn

sup

t∈B_m,m

t²(x)≤

λ∈m∪m

ψ_λ²(x), thuse_m,m=1 n sup

λ∈m∪m

1

μ(I_λ). (10)

(8)

Finally, it is easy to check that, for allm,minMn

v²_m,m= sup

λ∈m∪m

Var

ψ_λ(X) = sup

λ∈m∪m

P (X∈I_λ)(1−P (X∈I_λ))

μ(I_λ) . (11)

We will consider two particular types of histograms.

Example 1 ([Reg]: μ-regular histograms). For all m in Mn, P_m is a partition of X and there exist a family (d_m)_m_∈M_nbounded bynand two constantsc_rh,C_rhsuch that,for allminMn,for allλ∈Mn,

c_rh

d_m≤μ(Iλ)≤C_rh d_m.

The typical example here is the collection described in Example HR, wherec_r=C_r=1. Remark that a collection satisfying [Reg] can be of infinite dimension.

Example 2 ([Ada]:Adapted histograms). There exist positive constantsc_r,C_ahsuch that,for allminMn,for all λ∈m,μ(I_λ)≥c_rn⁻¹and

P (X∈I_λ) μ(Iλ) ≤C_ah.

[Ada] is typically satisfied whensis bounded onX. Remark that the models satisfying [Ada] have finite dimension d_m≤Cnsince

1≥

λ∈m

P (X∈I_λ)≥C_ah

λ∈m

μ(I_λ)≥C_ahc_rd_mn⁻¹.

The example[Reg]

It comes from Eqs (9)–(11) and Assumption [Reg] that C⁻_rh¹d_m− s_m²≤D_m≤c_rh⁻¹d_m− s_m², e_m,m≤c_rh⁻¹d_m∨d_m

n , v_m,m² ≤ sup

t∈B_m,mt∞ts ≤c⁻_rh^1/2s

d_m∨d_m.

Thus

e_m,m

R_m∨R_m ≤C_rhc⁻_rh¹(R_m∨R_m)+ s²

n(R_m∨R_m) ≤Cn⁻¹. IfD_m∨D_m≤θ_n²(lnn)^2γ,

v²_m,m Rm∨R_m ≤

C_rhc⁻_rh¹

(Dm∨D_m)+ s²

Rm_o ≤ C

θn(lnn)^γ. IfD_m∨D_m≥θ_n²(lnn)^2γ,

v²_m,m R_m∨R_m ≤

C_rhc⁻_rh¹

(D_m∨D_m)+ s²

D_m∨D_m ≤ C

θ_n(lnn)^γ.

There exists κ >0 such thatθ_n²(lnn)^2γ ≤κnsince for allminMn,Rm≤ns−sm²+c⁻_rh¹dm≤(s²+c⁻_rh¹)n.

Hence Assumption [V] holds withγ given in Assumption [BR] andεn=Cθn⁻^1/2.

(9)

The example[Ada]

It comes from inequalities (10), (11) and Assumption [Ada] that, for allmandminMn

e_m,m≤c⁻_r¹ and v_m,m² ≤Cah.

Thus, there exists a constantκ >0 such that, for allmanminMn, sup

(m,m)∈(Mn)²

v_m,m² Rm∨R_m

2

∨ e_m,m

Rm∨R_m

≤ κ θ_n²(lnn)^2γ.

Therefore Assumption [V] holds also withγ given in Assumption [BR] andε_n=κθ_n⁻^1/2. 4.3. Fourier spaces

In this section, we assume thatsis supported in[0,1]. We introduce the classical Fourier basis. Letψ₀:[0,1] →R, x→1 and, for allk∈N^∗, we define the functions

ψ1,k:[0,1] →R, x→√

2 cos(2πkx), ψ2,k:[0,1] →R, x→√

2 sin(2πkx).

For allj inN^∗, let m_j= {0} ∪

(i, k), i=1,2, k=1, . . . , j

and Mn= {m_j, j=1, . . . , n}.

For allminMn, letS_mbe the space spanned by the family(ψ_λ)_λ_∈_m.(ψ_λ)_λ_∈_mis an orthonormal basis ofS_mand for allj in 1, . . . , n, dm_j=2j+1.

Letj in 1, . . . , n,for allxin[0,1],

λ∈m_j

ψ_λ²(x)=1+2 j

k=1

cos²(2πkx)+sin²(2πkx)=1+2j=dm_j.

Hence, for allminMn, D_m=P

λ∈m_j

ψ_λ²

− s_m²=d_m− s_m². (12)

It is also clear that, for allm,minMn, e_m,m=d_m∨d_m

n , v_m,m² ≤ s

d_m∨d_m. (13)

The collection of Fourier spaces of dimensiond_m≤nsatisfies Assumption [PC], and the quantitiesD_m e_m,m and v_m,m² satisfy the same inequalities as in the collection [Reg], therefore, [V] comes also in this collection from [BR].

We have obtained the following corollary of Theorem3.3.

Corollary 4.1. LetMn be either a collection of histograms satisfying Assumptions[PC]–[Reg]or[PC]–[Ada]or the collection of Fourier spaces of dimensiondm≤n.Assume thatssatisfies Assumption[BR]for someγ >1and θ_n→ ∞.Then,there exist constantsκ >0 and C >0 such that the estimator s˜ selected by a resampling penalty satisfies

P

s− ˜s²≤

1+κθ_n⁻^1/2 inf

m∈Mn

s− ˆs_m²

≥1−Ce⁻^(1/2)(lnⁿ⁾^γ.

(10)

Comment. Assumption[BR]is hard to check in practice.We described in the beginning of this section some examples where it holds.In more general situations,following Arlot[4],we use our main theorem only for the models with dimensiond_m≥(lnn)⁴⁺^2γ,they satisfy[BR]withθ_n=(lnn)²,at least whennis sufficiently large,because

s²+R_m≥ s²+D_m≥cd_m≥c(lnn)⁴(lnn)^2γ.

With our concentration inequalities,we can control easily the risk of the models with dimensiondm≤(lnn)⁴⁺^2γ by κ(lnn)³⁺^{5γ /2}with probability larger than1−Ce⁻^(1/2)(lnn)^γ.

We can then deduce the following corollary.

Corollary 4.2. LetMnbe either a collection of histograms satisfying Assumptions[PC]–[Reg]or[PC]–[Ada]or the collection of Fourier spaces of dimensiond_m≤n.There exist constantsκ >0,η >3+5γ /2andC >0such that the estimators˜selected by a resampling penalty satisfies

P

s− ˜s²≤

1+κ(lnn)⁻¹ inf

m∈Mn

s− ˆsm²+(lnn)^η n

≥1−Ce⁻^(1/2)(lnⁿ⁾^γ.

5. Proofs of the main results

5.1. Notations

Let us recall here the main notations. For allm,minMn, let p(m)= s_m− ˆs_m², D_m=nE

p(m) =nE

ˆs_m−s_m² , R_m=nE

s− ˆs_m² =ns−s_m²+D_m, δ

m, m =ν_n(s_m−s_m).

For alln∈N^∗,k >0,k>0,γ >0, let[k]be the integer part ofkand let l_n,γ

k, k =ln

1+Card M^[_n^k^]

1+Card

M^[_n^k^] +ln

(1+k) 1+k

+(lnn)^γ. Recall that Assumption [V] implies that, for allm, minMn,

v²_m,mln,γ(Rm, R_m)≤ε_n²(Rm∨R_m),

(14) e_m,m

l_n,γ(R_m, R_m)²≤ε⁴_n(R_m∨R_m).

Let(ψλ)λ∈mbe an orthonormal basis ofSm. Easy algebra leads to s_m=

λ∈m

(P ψ_λ)ψ_λ, sˆ_m=

λ∈m

(P_nψ_λ)ψ_λ, thuss_m− ˆs_m²=

λ∈m

ν_n(ψ_λ)².

Therefore,sˆ_mis an unbiased estimator ofs_mand

pen_id(m)=2νn(sˆm)=2νn(sˆm−sm)+2νn(sm)=2sm− ˆsm²+2νn(sm).

By definition,mˆ minimizess− ˆs_m²₂+pen(m)−pen_id(m). Hence, for allminMn, s− ˜s²2≤ s− ˆs_m²2+

pen(m)−2p(m) +

2p(m)ˆ −pen(m)ˆ +δ(m, m).ˆ (15)