Optimal model selection in density estimation

(1)

HAL Id: hal-00422655

https://hal.archives-ouvertes.fr/hal-00422655

Submitted on 8 Oct 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Optimal model selection in density estimation

Matthieu Lerasle

To cite this version:

Matthieu Lerasle. Optimal model selection in density estimation. Annales de l’Institut Henri Poincaré

(B) Probabilités et Statistiques, Institut Henri Poincaré (IHP), 2012, 48 (3), pp.884–908. �10.1214/11-

AIHP425�. �hal-00422655�

(2)

Optimal model selection in density estimation

Matthieu Lerasle ^∗

Abstract

We build penalized least-squares estimators using the slope heuristic and re- sampling penalties. We prove oracle inequalities for the selected estimator with leading constant asymptotically equal to 1. We compare the practical perfor- mances of these methods in a short simulation study.

Key words: Density estimation, optimal model selection, resampling methods, slope heuristic.

2000 Mathematics Subject Classification: 62G07, 62G09.

1 Introduction

The aim of model selection is to construct data-driven criteria to select a model among a given list. The history of statistical model selection goes back at least to Akaike [1], [2] and Mallows [18]. They proposed to select among a collection of parametric models the one which minimizes an empirical loss plus some penalty term proportional to the dimension of the model. Birg´e & Massart [8] and Barron, Birg´e & Massart [6] generalized this approach, making in particular the link between model selection and adaptive estimation.

They proved that previous methods, in particular cross-validation (see Rudemo [20]) and hard thresholding (see Donoho et.al. [12]) can be viewed as penalization methods. More recently, Birg´e & Massart [9], Arlot & Massart [5] and Arlot [4], (see also [3]) arised the problem of optimal efficient model selection. Basically, the aim is to select an estimator satisfying an oracle inequality with leading constant asymptotically equal to 1. They obtained such procedures thanks to a sharp estimator of the ideal penalty pen _id . We will be interested in two natural ideas, that are used in practice to evaluate pen _id and proved to be efficient in other frameworks. The first one is the slope heuristic. It was introduced in Birg´e & Massart [9] in Gaussian regression and developed in Arlot & Massart [5] in a M-estimation framework. It allows to optimize the choice of a leading constant in the penalty term, provided that we know the shape of pen _id . The other one is Efron’s resampling heuristic. The basic idea comes from Efron [14] and was used by Fromont [15]

in the classification framework. Then, Arlot [4] made the link with ideal penalties and developed the general procedure. Up to our knowledge, these methods have only been theoretically validated in regression frameworks. We propose here to prove their efficiency in density estimation. Let us now explain more precisely our context.

1.1 Least-squares estimators

In this paper, we define and study efficient penalized least-squares estimators in the den- sity estimation framework when the error is measured with the L ² -loss. We observe n

∗

Institut de Math´ematiques (UMR 5219), INSA de Toulouse, Universit´e de Toulouse, France

(3)

i.i.d random variables X ₁ , ..., X _n , defined on a probability space (Ω, A , P ), valued in a measurable space ( X , X ), with common law P . We assume that a measure µ on ( X , X ) is given and we denote by L ² (µ) the Hilbert space of square integrable real valued functions defined on X . L ² (µ) is endowed with its classical scalar product, defined for all t, t ^′ in L ² (µ) by

< t, t ^′ >=

Z

X

t(x)t ^′ (x)dµ(x)

and the associated L ² -norm k . k , defined for all t in L ² (µ) by k t k = √

< t, t >. The parameter of interest is the density s of P with respect to µ, we assume that it belongs to L ² (µ). The risk of an estimator ˆ s of s is measured with the L ² -loss, that is k s − s ˆ k ² , which is random when ˆ s is.

s minimizes the integrated quadratic contrast P Q(t), where Q : L ² (µ) → L ¹ (P ) is defined for all t in L ² (µ) by Q(t) = k t k ² − 2t. Hence, density estimation is a problem of M - estimation. These problems are classically solved in two steps. First, we choose a ”model”

S m that should be close to the parameter s, which means that inf t∈S

m

k s − t k ² is ”small”.

Then, we minimize over S _m the empirical version of the integrated contrast, that is, we choose

ˆ

s m ∈ arg min

t∈S

m

P n Q(t). (1)

This last minimization can be computationaly untractable for general sets S _m , leading to untractable procedures in practice. However, it can be easily solved when S _m is a linear subspace of L ² (µ) since, for all orthonormal basis (ψ _λ ) _λ∈m ,

ˆ

s _m = X

λ∈m

(P _n ψ _λ )ψ _λ . (2)

Thus, we will always assume that a model is a linear subspace in L ² (µ). The risk of the least-squares estimator ˆ s _m defined in (1) is then decomposed in two terms, called bias and variance, thanks to Pythagoras relation. Let s m be the orthogonal projection of s onto S _m ,

k s − s ˆ _m k ² = k s − s _m k ² + k s _m − s ˆ _m k ² .

The statistician should choose a space S _m realizing a trade-off between those terms. S _m must be sufficiently “large” to ensure a small bias k s − s m k ² , but not too much, for the variance k s _m − s ˆ _m k ² not to explose. The best trade-off depends on unknown properties of s, since the bias is unknown, and on the behavior of the empirical minimizer ˆ s _m in the space S m . Classically, S m is a parametric space and the dimension d m of S m as a linear space is used to give upper bounds on D _m = n E k s _m − s ˆ _m k ²

. This approach is validated in regular models under the assumption that the support of s is a known compact, as mentioned in section 3. However, this definition can fail dramatically because there exist simple models (histograms with a small dimension d _m ) where D _m is very large, and infinite dimensional models where D _m is easily upper bounded. This issue is extensively discussed in Birg´e [7]. Birg´e chooses to keep the dimension d _m of S _m as a complexity measure and build new estimators that achieve better risk bounds than the empirical minimizer. His procedures are unfortunatly untractable for the practical user because he can only prove the existence of his estimators. Even his bounds on the risk are only interesting theoretically because they involve constants which are not optimal.

We will not take this point of view here and our estimator will always be the empirical

minimizer, mainly because it can easily be computed, see (2). We will focus on the quantity

D _m /n and introduce a general Assumption (namely Assumption [V]) that allows to work

(4)

indifferently with D _m /n or with the actual risk k s _m − s ˆ _m k ² . We will also provide and study an estimator of D m /n based on the resampling heuristic.

We insist here on the fact that, unlike classical methods, we will not use in this paper strong extra assumptions on s, like k s k ∞ < ∞ or assume that s is compactly supported.

1.2 Model selection

Recall that the choice of an optimal model S _m is impossible without strong assumptions on s, for example a precise information on its regularity. However, under less restrictive hypotheses, we can build a countable collection of models (S _m ) _m∈M

_n

, growing with the number of observations, such that the best estimator in the associated collection (ˆ s _m ) _m∈M

_n

realizes an optimal trade-off, see for example Birg´e & Massart [8] and Barron, Birg´e &

Massart [6]. The aim is then to build an estimator ˆ m such that our final estimator, ˜ s = ˆ s _m _ˆ behaves almost as well as any model m _o in the set of oracles

M ^∗ n = { m _o ∈ M n , k s ˆ _m

_o

− s k ² = inf

m∈M

n

k s ˆ _m − s k ² } .

This is the problem of model selection. More precisely, we want that ˜ s satisfies an oracle inequality defined in general as follows.

Definition: (Trajectorial oracle inequality) Let (p _n ) _n∈ N be a summable sequence and let (C _n ) _n∈ N and (R _m,n ) _n∈ N be sequences of positive real numbers. The estimator s ˜ = ˆ s _m _ˆ satisfies a trajectorial oracle inequality T O(C n , (R m,n ) m∈M

n

, p n ) if

∀ n ∈ N ^∗ , P

k s ˜ − s k ² > C _n inf

m∈M

n

k s − s ˆ _m k ² + R _m,n

≤ p _n . (3) When ˜ s satisfies an oracle inequality, C _n is called the leading constant.

In this paper, we are interested in the problem of optimal model selection defined as follows.

Definition: (Optimal model selection) We say that s ˜ is optimal or that the procedure of selection (X ₁ , ..., X _n ) 7→ m ˆ is optimal when s ˜ satisfies a trajectorial oracle inequality T O(1 + r _n , (R _m,n ) _m∈M

_n

, p _n ) with r _n → 0 and for all n in N ^∗ and m in M n R _m,n = 0. In order to simplify the notations, when s ˜ is optimal we will say that s ˜ satisfies an optimal oracle inequality OT O(r _n , p _n ).

In order to build ˆ m, we remark that, for all m in M n ,

k s − s ˆ _m k ² = k s ˆ _m k ² − 2P s ˆ _m + k s k ² = P _n Q(ˆ s _m ) + 2ν _n (ˆ s _m ) + k s k ² , (4) where ν _n = P _n − P is the centered empirical process. An oracle minimizes k s − s ˆ _m k ² and thus P _n Q(ˆ s _m ) + 2ν _n (ˆ s _m ). As we want to imitate the oracle, we will design a map pen : M n → R ⁺ and choose

ˆ

m ∈ arg min

m∈M

n

P _n Q(ˆ s _m ) + pen(m), ˜ s = ˆ s _m _ˆ . (5) It is clear that the ideal penalty is pen _id (m) = 2ν _n (ˆ s _m ). For all m in M n , for all orthonor- mal basis (ψ _λ ) _λ∈m , ˆ s _m = P

λ∈m (P _n ψ _λ )ψ _λ and s _m = P

λ∈m (P ψ _λ )ψ _λ , thus ν _n (ˆ s _m − s _m ) = ν _n X

λ∈m

(ν _n ψ _λ )ψ _λ

!

= X

λ∈m

(ν _n ψ _λ ) ² = k ˆ s _m − s _m k ² .

(5)

Let us define, for all m in M n

p(m) = ν _n (ˆ s _m − s _m ) = k ˆ s _m − s _m k ² . From (4), for all m in M n ,

k s − s ˜ k ² = k s ˜ k ² − 2P s ˜ + k s k ² = k ˜ s k ² − 2P n s ˜ + 2ν n s ˜ + k s k ²

≤ P _n Q(ˆ s _m ) + pen(m) + (2ν _n (˜ s) − pen( ˆ m)) + k s k ²

= k s − ˆ s _m k ² + (pen(m) − 2ν _n (ˆ s _m )) + (2ν _n (˜ s) − pen( ˆ m)) Hence, for all m in M n ,

k s − ˜ s k ² ≤ k s − s ˆ _m k ² + (pen(m) − 2p(m)) + (2p( ˆ m) − pen( ˆ m)) + 2ν _n (s _m _ˆ − s _m ). (6) Let us define, for all c 1 , c 2 > 0, the function

f _c

1

,c

²

: R ⁺ → R ⁺ , x 7→

_1+c

1

x

1−c

²

x − 1 if x < 1/c ₂

+ ∞ if x ≥ 1/c ₂ . (7)

It comes from inequality (6) that ˜ s satisfies an oracle inequality OT O(f _2,2 (ǫ _n ), p _n ) as soon as, with probability larger than 1 − p _n

∀ m ∈ M n | 2p(m) − pen(m) |

k s − s ˆ _m k ² ≤ ǫ _n and (8)

∀ (m, m ^′ ) ∈ M ² n , 2ν n (s _m

^′

− s m )

k s − s ˆ _m

^′

k ² + k s − ˆ s _m k ² ≤ ǫ _n . (9) Inequality (9) does not depend on our choice of penalty, we will check that it can easily be satisfied in classical collections of models. In order to obtain inequality (8), we use two methods, defined in M-estimation, but studied only on some regression frameworks.

1.2.1 The slope heuristic

The first one is refered as the ”slope heuristic”. The idea has been introduced by Birg´e

& Massart [9] in the Gaussian regression framework and developed in a general algorithm by Arlot & Massart [5]. This heuristic states that there exist a sequence (∆ _m ) _m∈M

_n

and a constant K _min satisfying the following properties,

1. when pen(m) < K _min ∆ _m , then ∆ _m _ˆ is too large, typically ∆ _m _ˆ ≥ C max _m∈M

_n

∆ _m , 2. when pen(m) ≃ (K min + δ)∆ m for some δ > 0, then ∆ _m _ˆ is much smaller,

3. when pen(m) ≃ 2K _min ∆ _m , the selected estimator is optimal.

Thanks to the third point, when ∆ _m and K _min are known, this heuristic says that the penalty pen(m) = 2K _min ∆ _m selects an optimal estimator. When ∆ _m only is known, the first and the second point can be used to calibrate K _min in practice, as shown by the following algorithm (see Arlot & Massart [5]):

Slope algorithm

For all K > 0, compute the selected model ˆ m(K) given by (5) with the penalty pen(m) = K∆ m and the associated complexity ∆ _m(K) _ˆ .

Find the constant K _min such that ∆ _m(K) _ˆ is large when K < K _min , and ”much smaller”

(6)

when K > K _min .

Take the final ˆ m = ˆ m(2K min ).

We will justify the slope heuristic in the density estimation framework for ∆ m = E ( k s m − ˆ

s _m k ² ) = D _m /n and K _min = 1. In general, D _m is unknown and has to be estimated, we propose a resampling estimator and prove that it can be used without extra assumptions to obtain optimal results.

1.2.2 Resampling penalties

Data-driven penalties have already been used in density estimation in particular cross- validation methods as in Stone [21], Rudemo [20] or Celisse [11]. We are interested here in the resampling penalties introduced by Arlot [4]. Let (W ₁ , ..., W _n ) be a resampling scheme, i.e. a vector of random variables independent of X, X 1 , ..., X n and exchangeable, that is, for all permutations τ of (1, ..., n),

(W ₁ , ..., W _n ) has the same law as (W _τ(1) , ..., W _τ(n) ).

Hereafter, we denote by ¯ W _n = P _n

i=1 W _i /n and by E ^W and L ^W respectively the expectation and the law conditionally to the data X, X ₁ , ..., X _n . Let P _n ^W = P n

i=1 W _i δ _X

_i

/n, ν _n ^W = P _n ^W − W ¯ _n P _n be the resampled empirical processes. Arlot’s procedure is based on the resampling heurististic of Efron (see Efron [13]), which states that the law of a functional F (P, P _n ) is close to its resampled counterpart, that is the conditional law L ^W (C _W F ( ¯ W _n P _n , P _n ^W )).

C _W is a renormalizing constant that depends only on the resampling scheme and on F . Following this heuristic, Arlot defines as a penalty the resampling estimate of the ideal penalty 2D _m /n, that is

pen(m) = 2C W E ^W (ν _n ^W (ˆ s ^W _m )), (10) where ˆ s ^W _m minimizes P _n ^W Q(t) over S m . We prove concentration inequalities for pen(m) and deduce that pen(m) provides an optimal procedure.

The paper is organized as follows. In Section 2, we state our main results, we prove the efficiency of the slope algorithm and the resampling penalties.

In Section 3, we compute the rates of convergence in the oracle inequalities using classical collections of models. Section 4 is devoted to a short simulation study where we compare different methods in practice. The proofs are postponed to Section 5. Section 6 is an Appendix where we add some probabilistic material, we prove a concentration inequality for Z ² , where Z = sup _t∈B ν _n (t) and B is symmetric. We deduce a simple concentration inequality for U -statistics of order 2 that extends a previous result by Houdr´e & Reynaud- Bouret [16].

2 Main results

Hereafter, we will denote by c, C, K, κ, L, α, with various subscripts some constants that may vary from line to line.

2.1 Concentration of the ideal penalty

Take an orthonormal basis (ψ _λ ) _λ∈m of S m . Easy algebra leads to s _m = X

λ∈m

(P ψ _λ )ψ _λ , ˆ s _m = X

λ∈m

(P _n ψ _λ )ψ _λ , thus k s _m − s ˆ _m k ² = X

λ∈m

(ν _n (ψ _λ )) ² .

(7)

ˆ

s _m is an unbiased estimator of s _m and

pen _id (m) = 2ν _n (ˆ s _m ) = 2ν _n (ˆ s _m − s _m ) + 2ν _n (s _m ) = 2 k s _m − ˆ s _m k ² + 2ν _n (s _m ).

For all m, m ^′ in M n , let

p(m) = k s _m − s ˆ _m k ² = X

λ∈m

(ν _n (ψ _λ )) ² , δ(m, m ^′ ) = 2ν _n (s _m − s _m

^′

). (11) From (6), for all m in M n ,

k s − ˜ s k ² 2 ≤ k s − s ˆ _m k ² 2 + (pen(m) − 2p(m)) + (2p( ˆ m) − pen( ˆ m)) + δ( ˆ m, m). (12) In this section, we are interested in the concentration of p(m) around E (p(m)) = D _m /n.

Let us first remark that, for all m in M n , p(m) is the supremum of the centered empirical process over the ellipsoid B m = { t ∈ S m , k t k ≤ 1 } . From Cauchy-Schwarz inequality, for all real numbers (b _λ ) _λ∈m ,

X

λ∈m

b ² _λ = sup

P a

²_λ

≤1

X

λ∈m

a _λ b _λ

! 2

. (13)

We apply this inequality with b _λ = ν _n (ψ _λ ). We obtain, since the system (ψ _λ ) _λ∈m is orthonormal,

X

λ∈m

(ν _n (ψ _λ )) ² = sup

P a

²_λ

≤1

X

λ∈m

a _λ ν _n (ψ _λ )

! 2

= sup

P a

²_λ

≤1

ν _n X

λ∈m

a _λ ψ _λ

!! 2

= sup

t∈B

m

(ν _n (t)) ² . Hence, p(m) is bounded by a Talagrand’s concentration inequality (see Talagrand [22]).

This inequality involves D _m = n E k ˆ s _m − s _m k ²

and the constants e _m = 1

n sup

t∈B

m

k t k ² ∞ and v _m ² = sup

t∈B

m

Var(t(X)). (14)

More precisely, the following proposition holds:

Proposition 2.1 Let X, X ₁ , ..., X _n be iid random variables with common density s with respect to a probability measure µ. Assume that s belongs to L ² (µ) and let S m be a linear subspace in L ² (µ). Let s _m and s ˆ _m be respectively the orthogonal projection and the projection estimator of s onto S _m . Let p(m) = k s _m − ˆ s _m k ² , D _m = n E (p(m)) and let v _m , e m be the constants defined in (14). Then, for all x > 0,

P p(m) − D _m

n > D _m ^3/4 (e _m x ² ) ^1/4 + 0.7 p

D _m v _m ² x + 0.15v ² _m x + e _m x ² n

!

≤ e ^−x/20 (15)

P D m

n − p(m) > 1.8D ^3/4 _m (e _m x ² ) ^1/4 + 1.71 p

D _m v ² _m x + 4.06e _m x ² n

!

≤ 2.8e ^−x/20 (16) Comments : From (12), for all m in M n ,

k s − s ˜ k ² 2 ≤ k s − s ˆ _m k ² 2 +

pen(m) − 2 D _m n

+ 2

D _m

n − p(m)

+2

p( ˆ m) − D _m _ˆ n

+

2 D _m _ˆ

n − pen( ˆ m)

+ δ( ˆ m, m). (17)

(8)

It appears from (17) that we can obtain oracle inequalities with a penalty of order 2D _m /n if, uniformly over m, m ^′ in M n ,

p(m) − D _m

n << k s − s ˆ _m k ² and δ(m ^′ , m) << k s − s ˆ _m k ² + k s − s ˆ _m

^′

k ² .

Proposition 2.1 proves that the first part holds with large probability for all m in M n

such that e _m ∨ v _m ² << n E ( k s − ˆ s _m k ² ). Actually, the other part also holds under the same kind of assumption.

2.2 Main assumptions

For all m, m ^′ in M n , let D _m = n E k s _m − s ˆ _m k ² , R _m

n = E k s − ˆ s _m k ²

= k s − s _m k ² + D _m n , v ² _m,m

^′

= sup

t∈S

m

+S

_m^′

,ktk≤1

Var(t(X)), e _m,m

^′

= 1

n sup

t∈S

m

+S

_m^′

,ktk≤1 k t k ² _∞ .

For all k ∈ N , let M ^k n = { m ∈ M n , R m ∈ [k, k + 1) } . For all n in N , for all k > 0, k ^′ > 0 and γ ≥ 0, let [k] be the integer part of k and let

l _n,γ (k, k ^′ ) = ln(1 + Card( M ^[k] n )) + ln(1 + Card( M ^[k n

^′

^] )) + ln((k + 1)(k ^′ + 1)) + (ln n) ^γ (18) Assumption [V]: There exist γ > 1 and a sequence (ǫ n ) n∈ N , with ǫ n → 0 such that, for all n in N ,

sup

(k,k

^′

)∈( N

^∗

)

²

sup

(m,m

^′

)∈M

^k_n

×M

^k_n^′











v ² _m,m

′

R _m ∨ R _m

^′

! 2

∨ e _m,m

^′

R _m ∨ R _m

^′



 l ² _n,γ (k, k ^′ )







≤ ǫ ⁴ _n .

[BR] There exist two sequences (h ^∗ _n ) _n∈ N

^∗

and (h ^o _n ) _n∈ N

^∗

with (h ^o _n ∨ h ^∗ _n ) → 0 as n → ∞ such that, for all n in N ^∗ , for all m o ∈ arg min m∈M

n

R m and all m ^∗ ∈ arg max m∈M

n

D m ,

R _m

_o

D _m

^∗

≤ h ^o _n , n k s − s _m

^∗

k ² D _m

^∗

≤ h ^∗ _n . Comments:

• Assumption [V] ensures that the fluctuations of the ideal penalty are uniformly small compared to the risk of the estimator ˆ s _m . Note that for all k, k ^′ , l _n,γ (k, k ^′ ) ≥ (ln n) ^γ , thus, Assumption [V] holds only in typical non parametric situations where R _n = inf _m∈M

_n

R _m → ∞ as n → ∞ .

• The slope heuristic states that the complexity ∆ _m _ˆ of the selected estimator is too large when the penalty term is too small. A minimal assumption for this heuristic to hold with ∆ _m = D _m would be that there exists a sequence (θ _n ) _n∈ ^N

^∗

with θ _n → 0 as n → ∞ such that, for all n in N ^∗ , for all m _o ∈ arg min _m∈M

_n

E k s − s ˆ _m k ²

and all m ^∗ ∈ arg max _m∈M

_n

E k s _m − ˆ s _m k ²

,

D _m

_o

≤ θ _n D _m

^∗

.

Assumption [BR] is slightly stronger but will always hold in the examples (see

Section 3).

(9)

In order to have an idea of the rates R _n , ǫ _n , h ^∗ _n , h ^o _n and θ _n , let us briefly consider the very simple following example:

Example HR: We assume that s is supported in [0, 1] and that (S m ) m∈M

n

is the collection of the regular histograms on [0, 1], with d _m = 1, ..., n pieces. We will see in Section 3.2 that D _m ∼ d _m asymptotically, hence D _m

^∗

≃ n. Moreover, we assume that s is H¨olderian and not constant so that there exist positive constants c _l , c _u , α _l , α _u such that, for all m in M n , see for example Arlot [4],

c _l d ^−α _m

^l

≤ k s − s _m k ² ≤ c _u d ^−α _m

^u

.

In Section 3.2, we prove that this assumption implies [V] with ǫ _n ≤ C ln(n)n ^−1/(8α

^l

⁺⁴⁾ . Moreover, there exists a constant C > 0 such that R _m

_o

≤ inf _m∈M

_n

(c _u nd ^−α _m

^u

+ d _m ) ≤ Cn ^−1/(2α

^u

⁺¹⁾ , thus R _m

_o

/D _m ^∗ ≤ Cn ^1/(2α

^u

⁺¹⁾⁻¹ = Cn ^−2α

^u

^/(2α

^u

⁺¹⁾ . Since there exists C > 0 such that n k s − s _m

^∗

k ² /D _m

^∗

≤ Cd ^−α _m

^∗^u

= Cn ^−α

^u

, [BR] holds with h ^o _n = Cn ^−2α

^u

^/(2α

^u

⁺¹⁾ and h ^∗ _n = Cn ^−α

^u

.

Other examples can be found in Birg´e & Massart [8], see also Section 3.

2.3 Results on the Slope Heuristic

Let us now turn to the slope heuristic presented in Section 1.2.1.

Theorem 2.2 (Minimal penalty) Let M n be a collection of models satisfying [V] and [BR] and let ǫ ^∗ _n = ǫ _n ∨ h ^∗ _n .

Assume that there exists 0 < δ _n < 1 such that 0 ≤ pen(m) ≤ (1 − δ _n )D _m /n. Let m, ˆ s ˜ be the random variables defined in (5) and let

c _n = δ _n − 28ǫ ^∗ _n 1 + 16ǫ n

. There exists a constant C > 0 such that,

P

D _m _ˆ ≥ c _n D _m

^∗

, k s − s ˜ k ² ≥ c _n 5h ^o _n inf

m∈M

n

k s − ˆ s _m k ²

≥ 1 − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

. (19) Comments: Assume that pen(m) ≤ (1 − δ)D _m /n, then, inequality (19) proves that an oracle inequality can not be obtained since c _n /h ^o _n → ∞ . Moreover, D _m _ˆ ≥ cD _m

^∗

is as large as possible. This proves point 1 of the slope heuristic.

Theorem 2.3 Let M n be a collection of models satisfying Assumption [V]. Assume that there exist δ ⁺ ≥ δ ₋ > − 1 and 0 ≤ p ^′ < 1 such that, with probability at least 1 − p ^′ ,

2 D _m

n + δ ₋ R _m

n ≤ pen(m) ≤ 2 D _m

n + δ ⁺ R _m n . Let m, ˆ s ˜ be the random variables defined in (5) and let

C _n (δ ₋ , δ ⁺ ) =

1 + δ ₋ − 46ǫ _n 1 + δ ⁺ + 26ǫ _n ∨ 0

−1

.

There exists a constant C > 0 such that, with probability larger than 1 − p ^′ − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

, D _m _ˆ ≤ C _n (δ ₋ , δ ⁺ )R _m

_o

, k s − s ˜ k ² ≤ C _n (δ ₋ , δ ⁺ ) inf

m∈M

n

k s − s ˆ _m k ² . (20)

(10)

Comments :

• Assume that pen(m) = KD m /n with K > 1, then inequality (20) ensures that D _m _ˆ ≤ C _n (K, K )R _m

_o

. Hence, D _m _ˆ jumps from D _m

^∗

(Theorem 2.2) to R _m

_o

(20) when pen(m) is around D _m /n, which is much smaller thanks to Assumption [BR]. This proves point 2 of the slope heuristic.

• Point 3 of this heuristic comes from inequality (20) applied with small δ − and δ ⁺ . The rate of convergence of the leading constant to 1 is then given by the supremum between δ ₋ , δ ⁺ and ǫ _n .

• The condition on the penalty has the same form as the one given in Arlot & Massart [5]. It comes from the fact that we do not know D m /n in many cases, therefore, it has to be estimated. We propose two alternatives to solve this issue. In Section 2.4, we give a resampling estimator of D _m . It can be used for all collection of models satisfying [V] and its error of approximation is upper bounded by ǫ n R m /n. Thus Theorem 2.3 holds with (δ ₋ ∨ δ ⁺ ) ≤ Cǫ _n . In Section 3.2, we will also see that, in regular models, we can use d _m instead of D _m and the error is upper bounded by CR m /R m

o

, thus Theorem 2.3 holds with (δ − ∨ δ ⁺ ) ≤ C/R m

o

<< ǫ n , p ^′ = 0. In both cases, we deduce from Theorem 2.3 that the estimator ˜ s given by the slope algorithm achieves an optimal oracle inequality OT O(κǫ _n , Ce ⁻

¹²

^(ln ⁿ⁾

^γ

). In Example HR, for example, we obtain ǫ n = Cn ^−1/(8α

^l

⁺⁴⁾ ln n.

2.4 Resampling penalties

Optimal model selection is possible in density estimation provided that we have a sharp estimation of D _m = n E sup _t∈B

_m

(ν _n (t)) ²

. We propose an estimator of this quantity based on the resampling heuristic. The model selection algorithm that we deduce is the same as the resampling penalization procedure introduced by Arlot [4]. Let F be a fixed functional. Efron’s heuristic states that the law L (F(ν _n )) is close to the conditional law L ^W (C W F (ν _n ^W )), where C W is a normalizing constant depending only on the resampling scheme and the functional F . Let P _n ^W = P _n

i=1 W _i δ _X

_i

/n and ν _n ^W = P _n ^W − W ¯ _n P _n . The resampling estimator of D _m is D ^W _m = nC _W ² E ^W sup _t∈B

_m

(ν _n ^W (t)) ²

and the resampling penalty associated is pen(m) = 2D ^W _m /n. Actually, the following result describes the concentration of D _m ^W around its mean D _m and around np(m).

Proposition 2.4 Let (W ₁ , ..., W _n ) be a resampling scheme, let S _m be a linear space, B _m = { t ∈ S _m , k t k ≤ 1 } , p(m) = sup _t∈B

_m

(ν _n (t)) ² , D _m = n E (p(m)) and let D ^W _m be the resam- pling estimator of D m based on (W 1 , ..., W n ), that is D ^W _m = nC _W ² E ^W sup _t∈B

_m

(ν _n ^W (t)) ²

, where v ² _W = Var(W ₁ − W ¯ _n ) and C _W ² = (v _W ² ) ⁻¹ .

Then, for all m in M n , E (D _m ^W ) = D _m . Moreover, let e _m , v _m be the quantities defined in (14). For all x > 0, on an event of probability larger than 1 − 7.8e ^−x ,

D _m ^W − D _m ≤ p

8e _m D _m x + e _m 4x

3 + (40.3x) ² n − 1

+ 9D m ^3/4 (e m x ² ) ^1/4 + 7.61 p

v ² _m D m x

n − 1 . (21)

D _m ^W − D _m ≥ − p

8e _m D _m x − e _m 4x

3 + (19.1x) ² n − 1

− 5.31D _m ^3/4 (e _m x ² ) ^1/4 + 3 p

v ² _m D _m x + 3v ² _m x

n − 1 . (22)

(11)

For all x > 0, P p(m) − D _m ^W

n > 5.31D m ^3/4 (e _m x ² ) ^1/4 + 3 p

v ² _m D _m x + 3v _m ² x + e _m (19.1x) ² n − 1

!

≤ 2e ^−x (23) P D ^W _m

n − p(m) ≤ 9D ^3/4 _m (e _m x ² ) ^1/4 + 7.61 p

v _m ² D _m x + e _m (40.3x) ² n − 1

!

≤ 3.8e ^−x . (24) Remark

The concentration of the resampling estimator involves the same quantities as the concen- tration of p(m), thus, it can be used to estimate the ideal penalty in the slope heuristic’s algorithm presented in the previous section without extra assumptions on the collection M n . Proposition 2.4 and Theorem 2.3 prove that this resampling penalty leads to an efficient model selection procedure. However, we do not need to use the slope heuristic in our framework to obtain an optimal model selection procedure as shown by the following theorem.

Theorem 2.5 Let X ₁ , ..., X _n be i.i.d random variables with common density s. Let M n be a collection of models satisfying Assumption [V]. Let W 1 , ..., W n be a resampling scheme, let W ¯ _n = P _n

i=1 W _i /n, v _W ² = Var(W ₁ − W ¯ _n ) and C _W = 2(v ² _W ) ⁻¹ . Let ˜ s be the penalized least-squares estimator defined in (5) with

pen(m) = C W E ^W

sup

t∈B

m

(ν _n ^W (t)) ²

. Then, there exists a constant C > 0 such that

P

k s − ˜ s k ² ≤ (1 + 100ǫ n ) inf

m∈M

n

k s − ˆ s m k ²

≥ 1 − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

. (25) Comments : The main advantage of this results is that the penalty term is always totally computable. Unlike the penalties derived from the slope heuristic, it does not depend on an arbitrary choice of a constant K _min made by the observer, that may be hard to detect in practice (see the paper of Alot & Massart [5] for an extensive discussion on this important issue). However, C _W is only optimal asymptotically. It is sometimes useful to overpenalize a little in order to improve the non-asymptotic performances of our procedures (see Massart [19]) and the slope heuristic can be used to do it in an optimal way (see our short simulation study in Section 4).

2.5 A remarks on the ”regularization phenomenon”

The regularization of the bootstrap phenomenon (see Arlot [3, 4] and the references therein) states that the resampling estimator C _W E ^W (F(ν _n ^W )) of a functional F (ν _n ) con- centrates around its mean better than F(ν n ). This phenomenon can be justified with our previous results for our functional F. Recall that we have proven in Proposition 2.1 that, for all x > 0, with probability larger than 1 − 3.8e ^−x/20 ,

p(m) − D _m n

≤ 1.8D ^3/4 m (e m x ² ) ^1/4 + 1.5 p

D m v ² _m x + 0.2v _m ² x + 4.1e m x ²

n .

In Example HR, we have the following upper bounds D _m ≤ d _m , e _m ≤ d _m

n , v ² _m ≤ c k s k p

d _m .

(12)

Thus, there exists a constant C such that, for all x > 0, P | np(m) − D m | > Cd m

r x

√ n + x

√ n 2 !!

≤ 3.8e ^−x/20 . (26) On the other hand, it comes from Inequalities (21) and (22), that, for all x > 0, on an event of probability larger than 1 − 7.8e ^−x/20 ,

D _m ^W − D _m

≤ p

0.4e _m D _m x + e _m x

15 + 4.1x ² n − 1

+ 1.8D ^3/4 m (e _m x ² ) ^1/4 + 1.45 p

v _m ² D _m x + 0.2v ² _m x

n − 1 .

Thus, there exists a constant C such that, for all x > 0, P

D ^W _m − D _m

> Cd _m r x

n + x n

2 ≤ 7.8e ^−x/20 .

The concentration of D _m ^W is then much better than the one of np(m). This implies that D _m ^W is an estimator of D _m rather than an estimator of np(m). Thus, the resampling penalty can be used when D _m /n is a good penalty for example, under [V]. When D _m /n is known to underpenalize (see the examples in Barron, Birg´e & Massart [6]), there is no chance that D ^W _m /n can work.

3 Rates of convergence for classical examples

The aim of this section is to show that [V] can be derived from a more classical hypothesis in two classical collections of models: the histograms and Fourier spaces. We derive the rates ǫ _n under this new hypothesis.

3.1 Assumption on the risk of the oracle

As mentioned in Section 2.2, Assumption [V] can only hold if there exists γ > 1 such that R n (ln n) ^−γ → ∞ as n → ∞ , where R n = inf m∈M

n

R m . In our example, we will make the following Assumption that ensures that this condition is always satisfied.

[BR] (Bounds on the Risk) There exist constants C _u > 0, α _u > 0, γ > 1, and a sequence (θ _n ) _n∈ N with θ _n → ∞ as n → ∞ such that, for all n in N ^∗ , for all m in M n

θ ² _n (ln n) ^2γ ≤ R _n ≤ R _m ≤ C _u n ^α

^u

.

Comments: Assumption [BR] holds with θ _n = Cn ^α for the collection of regular his- tograms of example HR, provided that s is an H¨olderian, non constant and compactly supported function (see for example Arlot [3]). It is also a classical result of minimax the- ory that there exist functions in Sobolev spaces satisfying this kind of Assumption when M n is the collection of Fourier spaces that we will introduce below.

We want to check that these collections satisfy Assumption [V], i.e. that there exists γ > 1 such that

sup

(k,k

^′

)∈( N

^∗

)

²

sup

(m,m

^′

)∈M

^k_n

×M

^k_n^′











v ² _m,m

^′

R _m ∨ R _m

^′

! 2

∨ e _m,m

^′

R _m ∨ R _m

^′



 l ² _n,γ (k, k ^′ )







≤ ǫ ⁴ _n .

(13)

For all m ∈ M n , R _m ≤ C _u n ^α

^u

, thus for all k > C _u n ^α

^u

, Card( M ^k n ) = 0. In particular, we can assume in the previous supremum that k ≤ C u n ^α

^u

and k ^′ ≤ C u n ^α

^u

. Hence, there exists a constant κ > 0 such that ln[(1 + k)(1 + k ^′ )] ≤ κ ln n. We also add the following assumption that ensures that there exists a constant κ > 0 such that, for all k ∈ N , ln(1 + Card( M ^k n )) ≤ κ ln n.

[PC] (Polynomial collection) There exist constants c _M ≥ 0, α _M ≥ 0, such that, for all n in N ,

Card( M n ) ≤ c _M n ^α

^M

.

Under Assumptions [BR] and [PC], there exists a constant κ > 0 such that, for all γ > 1 and n ≥ 3,

sup

(k,k

^′

)∈( N

^∗

)

²

sup

(m,m

^′

)∈M

^k_n

×M

^k_n^′











v ² _m,m

′

R _m ∨ R _m

^′

! 2

∨ e _m,m

^′

R _m ∨ R _m

^′



 l _n,γ ² (k, k ^′ )







≤ sup

(m,m

^′

)∈(M

n

)

²







v _m,m ²

′

R _m ∨ R _m

^′

! 2

∨ e _m,m

^′

R _m ∨ R _m

^′







κ(ln n) ^2γ .

3.2 The histogram case

Let ( X , X ) be a measurable space. Let (P _m ) _m∈M

_n

be a collection of measurable partitions P _m = (I _λ ) _λ∈m of subsets of X such that, for all m ∈ M n , for all λ ∈ m, 0 < µ(I _λ ) < ∞ . Let m in M n , the set S _m of histograms associated to P _m is the set of functions which are constant on each I _λ , λ ∈ m. S _m is a linear space. Setting, for all λ ∈ m, ψ _λ = ( p

µ(I _λ )) ⁻¹ 1 _I

_λ

, the functions (ψ _λ ) _λ∈m form an orthonormal basis of S _m . Let us recall that, for all m in M n ,

D _m = X

λ∈m

Var(ψ _λ (X)) = X

λ∈m

P (ψ _λ ² ) − (P ψ _λ ) ² = X

λ∈m

P(X ∈ I _λ )

µ(I _λ ) − k s _m k ² . (27) Moreover, from Cauchy-Schwarz inequality, for all x in X , for all m, m ^′ in M n

sup

t∈B

_m,m′

t ² (x) ≤ X

λ∈m∪m

^′

ψ _λ ² (x), thus e _m,m

^′

= 1 n sup

λ∈m∪m

^′

1 µ(I _λ ) . (28) Finally, it is easy to check that, for all m ,m ^′ in M n

v _m,m ²

^′

= sup

λ∈m∪m

^′

Var(ψ _λ (X)) = sup

λ∈m∪m

^′

P (X ∈ I _λ )(1 − P (X ∈ I _λ ))

µ(I _λ ) . (29)

We will consider two particular types of histograms.

Example 1 [Reg] : µ-regular histograms.

For all m in M n , P _m is a partition of X and there exist a family (d _m ) _m∈M

_n

bounded by n and two constants c _rh , C _rh such that, for all m in M n , for all λ ∈ M n ,

c _rh

d _m ≤ µ(I _λ ) ≤ C _rh d _m .

The typical example here is the collection described in Example HR.

Example 2 [Ada]: Adapted histograms.

(14)

There exist positive constants c _r , C _ah such that, for all m in M n , for all λ ∈ M n , µ(I _λ ) ≥ c r n ⁻¹ and

P (X ∈ I _λ )

µ(I _λ ) ≤ C _ah .

[Ada] is typically satisfied when s is bounded on X . Remark that the models satisfying [Ada] have finite dimension d _m ≤ Cn since

1 ≥ X

λ∈m

P (X ∈ I _λ ) ≥ C _ah X

λ∈m

µ(I _λ ) ≥ C _ah c _r d _m n ⁻¹ .

The example [Reg].

It comes from equations (27, 28, 29) and Assumption [Reg] that C _rh ⁻¹ d _m − k s _m k ² ≤ D _m ≤ c ⁻¹ _rh d _m − k s _m k ² . e _m,m

^′

≤ c ⁻¹ _rh d _m ∨ d _m

^′

n , v _m,m ²

^′

≤ sup

t∈B

_m,m′

k t k _∞ k t kk s k ≤ c ^−1/2 _rh k s k p

d _m ∨ d _m

^′

.

Thus e _m,m

^′

R _m ∨ R _m

^′

≤ C _rh c ⁻¹ _rh (R _m ∨ R _m

^′

) + k s k ²

n(R _m ∨ R _m

^′

) ≤ Cn ⁻¹ . If D _m ∨ D _m

^′

≤ θ ² _n (ln n) ^2γ ,

v ² _m,m

′

R _m ∨ R _m

^′

≤ q

C _rh c ⁻¹ _rh

p (D _m ∨ D _m

^′

) + k s k ²

R _m

_o

≤ C

θ _n (ln n) ^γ . If D _m ∨ D _m

^′

≥ θ ² _n (ln n) ^2γ ,

v ² _m,m

^′

R _m ∨ R _m

^′

≤

q C _rh c ⁻¹ _rh

p (D _m ∨ D _m

^′

) + k s k ²

D _m ∨ D _m

^′

≤ C θ _n (ln n) ^γ .

There exists κ > 0 such that θ _n ² (ln n) ^2γ ≤ κn since for all m in M n , R _m ≤ n k s − s _m k ² + c ⁻¹ _rh d _m ≤ ( k s k ² + c ⁻¹ _rh )n. Hence Assumption [V] holds with γ given in Assumption [BR]

and ǫ _n = Cθ n ^−1/2 . The example [Ada].

It comes from inequalities (28), (29) and Assumption [Ada] that, for all m and m ^′ in M n

e _m,m

^′

≤ c ⁻¹ _r and v ² _m,m

^′

≤ C _ah .

Thus, there exists a constant κ > 0 such that, for all m an m ^′ in M n , sup

(m,m

^′

)∈(M

n

)

²







v _m,m ²

^′

R m ∨ R _m

^′

! 2

∨ e _m,m

^′

R m ∨ R _m

^′







≤ κ θ ² _n (ln n) ^2γ .

Therefore Assumption [V] holds also with γ given in Assumption [BR] and ǫ _n = κθ ^−1/2 _n .

(15)

3.3 Fourier spaces

In this section, we assume that s is supported in [0, 1]. We introduce the classical Fourier basis. Let ψ 0 : [0, 1] → R , x 7→ 1 and, for all k ∈ N ^∗ , we define the functions

ψ _1,k : [0, 1] → R , x 7→ √

2 cos(2πkx), ψ _2,k : [0, 1] → R , x 7→ √

2 sin(2πkx).

For all j in N ^∗ , let

m _j = { 0 } ∪ { (i, k), i = 1, 2, k = 1, ..., j } and M n = { m _j , j = 1, ..., n } .

For all m in M n , let S _m be the space spanned by the family (ψ _λ ) _λ∈m . (ψ _λ ) _λ∈m is an orthonormal basis of S _m and for all j in 1, ..., n, d _m

_j

= 2j + 1.

Let j in 1, ...n, for all x in [0, 1], X

λ∈m

j

ψ _λ ² (x) = 1 + 2

j

X

k=1

cos ² (2πkx) + sin ² (2πkx) = 1 + 2j = d _m

_j

. Hence, for all m in M n ,

D _m = P



 X

λ∈m

j

ψ _λ ²



 − k s _m k ² = d _m − k s _m k ² . (30) It is also clear that, for all m, m ^′ in M n ,

e _m,m

^′

= d _m ∨ d _m

^′

n , v _m,m ²

^′

≤ k s k p

d m ∨ d _m

^′

. (31)

The collection of Fourier spaces of dimension d _m ≤ n satisfies Assumption [PC], and the quantities D _m e _m,m

^′

and v _m,m ²

′

satisfy the same inequalities as in the collection [Reg], therefore, [V] comes also in this collection from [BR]. We have obtained the following corollary of Theorem 2.5.

Corollary 3.1 Let M n be either a collection of histograms satisfying Assumptions [PC]- [Reg] or [PC]-[Ada] or the collection of Fourier spaces of dimension d m ≤ n. Assume that s satisfies Assumption [BR] for some γ > 1 and θ _n → ∞ . Then, there exist constants κ > 0 and C > 0 such that the estimator s ˜ selected by a resampling penalty satisfies

P

k s − s ˜ k ² ≤ (1 + κθ _n ^−1/2 ) inf

m∈M

n

k s − ˆ s m k ²

≥ 1 − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

.

Comment: Assumption [BR] is hard to check in practice. We mentioned that it holds in Example HR provided that s is Hölderian, non constant and compactly supported (see Arlot [4]). It is also classical to build functions satisfying [BR] with the Fourier spaces in order to prove that the oracle reaches the minimax rate of convergence over some Sobolev balls, see for example Birgé & Massart [8], Barron, Birgé & Massart [6] or Massart [19].

In these cases, there exist c > 0, α > 0 such that θ _n ≥ cn ^α . In more general situations, we can use the same trick as Arlot [4] and use our main theorem only for the models with dimension d _m ≥ (ln n) ^4+2γ , they satisfy [BR] with θ _n = (ln n) ² , at least when n is sufficiently large, because

k s k ² + R _m ≥ k s k ² + D _m ≥ cd _m ≥ c(ln n) ⁴ (ln n) ^2γ .

With our concentration inequalities, we can control easily the risk of the models with

dimension d _m ≤ (ln n) ^4+2γ by κ(ln n) ^3+5γ/2 with probability larger than 1 − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

and we can then deduce the following corollary.

(16)

Corollary 3.2 Let M n be either a collection of histograms satisfying Assumptions [PC]- [Reg] or [PC]-[Ada] or the collection of Fourier spaces of dimension d m ≤ n. There exist constants κ > 0, η > 3 + 5γ/2 and C > 0 such that the estimator s ˜ selected by a resampling penalty satisfies

P

k s − ˜ s k ² ≤ (1 + κ(ln n) ⁻¹ )

m∈M inf

n

k s − ˆ s _m k ² + (ln n) ^η n

≥ 1 − Ce ⁻

¹²

^(ln ⁿ⁾

^γ

.

4 Simulation study

We propose in this section to show the practical performances of the slope algorithm and the resampling penalties on two examples. We estimate the density

s(x) = 3

4 x ^−1/4 1 _[0,1] (x) and we compare the three following methods.

1. The first one is the slope heuristic applied with the linear dimension d m of the models. We observe two main behaviors of d _m(K) _ˆ with respect to K. Most of the times, we only observe one jump, as in Figure 1, and we find K _min easily.

Figure 1: Classical behavior of K 7→ d _m(K) _ˆ

We also observe more difficult situations as the one of Figure 2 below, where we can see several jumps. In these cases, as prescribed in the regression framework by Arlot & Massart [5], we choose the constant K _min realizing the maximal jump of d _m(K) _ˆ . Arlot & Massart [5] also proposed to select K _min as the minimal K such that d _m(K) _ˆ ≤ d m

^∗

(ln n) ⁻¹ , but they obtained worse performances of the selected estimator in their simulations.

We justify this method only for collection of models where d _m ≃ KD _m for some constant K. We will see that it gives really good performances when this condition is satisfied.

2. The second method is the resampling based penalization algorithm of Theorem 2.5.

Note here that all the resampling penalties D _m ^W /n can be easily computed, without any Monte Carlo approximations. Actually, for all resampling scheme,

D _m ^W

n = 1

n X

λ∈m



P _n ψ _λ ² − 1 n(n − 1)

n

X

i6=j=1

ψ _λ (X _i )ψ _λ (X _j )



 .

(17)

Resampling penalties give always good approximations of D _m . However, in non asymptotic situations, it may be usefull to overpenalize a little bit in order to improve the leading constants in the oracle inequality (in Theorem 2.3, imagine that 46ǫ _n is very close to 1).

3. In a third method, we propose therefore to use the slope algorithm applied with a complexity D _m ^W . By this way, we hope to overpenalize a little bit the resampling penalty when it is necessary.

4.1 Example 1: regular case

In the first example, we consider the collection of regular histograms described in example HR and we observe n = 100 data. In this example, we saw that D _m ^W ≃ D _m ≃ d _m . We can actually verify in Figure 2 that these quantities almost coincide for the selected model.

0 100 200 300

0 10 20 30 40 50 60 70 80 90 100

Resampling estimation of D

m

0 100 200 300

0 10 20 30 40 50 60 70 80 90 100

Dimension of the selected model

Constant K Constant K

Figure 2: Comparison of d m and D ^W _m on the selected model

We compute N = 1000 times the oracle constant c = k s − s ˜ k ² /(inf _m∈M

_n

k s − s ˆ _m k ² ) for the 3 methods. We put in the following array the mean, the median and the 0.95-quantile, q _0.95 of these quantities.

method mean of the N constants c median q _0.95

slope + d _m 3.56 2.30 10.07

resampling 4.43 2.52 15.47

resampling + slope 3.57 2.21 10.86

We observe that the slope algorithm allows to improve the resampling penalty in practice.

This may be due to a little overpenalization even if it is not a straightforward consequence of our theoretical results. Note that, as d _m ≃ D ^W _m , the slope algorithm leads to the same results when applied with d _m or with D _m ^W . Although we have an explicite formula to compute the resampling penalties, the computation time is much longer if we use D ^W _m . Therefore, we clearly recommand to use the slope algorithm with d _m for regular collections of model, as regular histograms or Fourier spaces described in Section 3.3.

4.2 Example 2: a more complicated collection

In the next example, we want to show that the linear dimension shall not be used in

general. Let us consider a slightly more complicated collection. Let k, J ₁ , J ₂ , n be four

(18)

non null integers satisfying k ≤ n, J ₁ ≤ k, J ₂ ≤ n − k. We denote by S _k,J

1

,J

2

,n the linear space of histograms on the following partition.

l k

J ₁ n , (l + 1) k J ₁ n

, l = 0, ..., J 1 − 1

∪ k

n + l 1 − k/n J ₂ , k

n + (l + 1) 1 − k/n J ₂

, l = 0, ...J ₂ − 1

.

Let n ∈ N ^∗ and let M n = { (k, J ₁ , J ₂ ) ∈ ( N ^∗ ) ³ ; k ≤ n, J ₁ ≤ k, J ₂ ≤ n − k } . It is clear that Card( M n ) ≤ n ³ . The oracle of this collection is better than the previous one since the regular histograms belongs to (S m,n ) m∈M

n

. It is easy to check that the dimension of S _k,J

1

,J

²

,n is equal to J ₁ + J ₂ and that D _k,J

1

,J

²

,n is equal to (nJ ₁ /k)F (k/n) + (nJ ₂ /(n − k))(1 − F(k/n)) − k s _k,J

1

,J

2

,n k ² /n, where F is the distribution function of the observations.

Hence, there is no constant K o such that K o d _k,J

1

,J

²

,n ≃ D _k,J

1

,J

²

,n as in the previous example. Figure 3 let us see this fact on the selected model.

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

constant K

dimension of the selected model

0 1 2 3 4 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4

constant K

resampling estimation of Dm

Figure 3: Comparison of d _m and D ^W _m on the selected model

We also compute N = 1000 times the oracle constant c = k s − s ˜ k ² /(inf m∈M

n

k s − s ˆ m k ² ) for the 3 methods, taking n = 100 observations each time. The results are summarized in following array.

method mean of the N constants c median q _0.95

slope + d _m 8.30 7.01 19.73

resampling 6.11 5.08 13.52

resampling + slope 5.33 4.04 12.92

The slope heuristic gives bad results when applied with d _m . This is due to the fact that

d _m is not proportional to D _m here. The resampling based penalty 2D ^W _m /n is much better

and, as in the regular case, it is well improved by the slope algorithm. Therefore, for

general collections of models where we do not know an optimal shape of the ideal penalty,

we recommand to apply the slope algorithm with a complexity equal to D ^W _m .

(19)

5 Proofs

5.1 Proof of Proposition 2.1

It is a straightforward application of Corollary 6.6 in the appendix.

5.2 Technical lemmas

Before giving the proofs of the main theorems, we state and prove some important technical lemmas that we will use repeatedly all along the proofs. Let us recall here the main notations. For all m, m ^′ in M n ,

p(m) = k s _m − s ˆ _m k ² , D _m = n E (p(m)) = n E k ˆ s _m − s _m k ² R _m = n E k s − s ˆ _m k ²

= n k s − s _m k ² + D _m , δ(m, m ^′ ) = ν _n (s _m − s _m

^′

).

For all n ∈ N ^∗ , k > 0, k ^′ > 0, γ > 0,, let [k] be the integer part of k and let

l _n,γ (k, k ^′ ) = ln((1 + Card( M ^[k] n ))(1 + Card( M ^[k n

^′

^] ))) + ln((1 + k)(1 + k ^′ )) + (ln n) ^γ . Recall that Assumption [V] implies that, for all m, m ^′ in M n ,

v _m,m ²

^′

l n,γ (R m , R _m

^′

) ≤ ǫ ² _n (R m ∨ R _m

^′

),

e _m,m

^′

(l n,γ (R m , R _m

^′

)) ² ≤ ǫ ⁴ _n (R m ∨ R _m

^′

). (32) Let us prove a simple result

Lemma 5.1 For all K > 1, Σ(K) = X

k∈ N

X

m∈M

^k_n

e −K[ln(1+Card(M

^kn

))+ln(1+k)] < ∞ . (33)

For all m in M n , let l _m = l _n,γ (R _m , R _m ), then, for all K > 1/ √ 2, X

m∈M

n

e ^−K

²

^l

^m

= Σ(2K ² )e ^−K

²

^(ln ⁿ⁾

^γ

. (34) For all m, m ^′ in M n , let l _m,m

^′

= l _n,γ (R _m , R _m

^′

), then, for all K > 1,

X

(m,m

^′

)∈(M

n

)

²

e ^−K

²

^l

^m,m^′

= (Σ(K ² )) ² e ^−K

²

^(ln ⁿ⁾

^γ

. (35)

Proof :

Inequality (33) comes from the fact that, when K > 1,

∀ k ∈ N , X

m∈M

^k_n

e −K[ln(1+Card(M

^k_n

))] ≤ 1, and X

k∈ N

^∗

Optimal model selection in density estimation

HAL Id: hal-00422655

https://hal.archives-ouvertes.fr/hal-00422655

Submitted on 8 Oct 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Optimal model selection in density estimation

Matthieu Lerasle

To cite this version:

Matthieu Lerasle. Optimal model selection in density estimation. Annales de l’Institut Henri Poincaré

(B) Probabilités et Statistiques, Institut Henri Poincaré (IHP), 2012, 48 (3), pp.884–908. �10.1214/11-

AIHP425�. �hal-00422655�

Optimal model selection in density estimation

Matthieu Lerasle ∗

Abstract

We build penalized least-squares estimators using the slope heuristic and re- sampling penalties. We prove oracle inequalities for the selected estimator with leading constant asymptotically equal to 1. We compare the practical perfor- mances of these methods in a short simulation study.

Key words: Density estimation, optimal model selection, resampling methods, slope heuristic.

2000 Mathematics Subject Classification: 62G07, 62G09.

1 Introduction

1.1 Least-squares estimators

In this paper, we define and study efficient penalized least-squares estimators in the den- sity estimation framework when the error is measured with the L 2 -loss. We observe n

Institut de Math´ematiques (UMR 5219), INSA de Toulouse, Universit´e de Toulouse, France

< t, t ′ >=

Z

X

t(x)t ′ (x)dµ(x)

and the associated L 2 -norm k . k , defined for all t in L 2 (µ) by k t k = √

< t, t >. The parameter of interest is the density s of P with respect to µ, we assume that it belongs to L 2 (µ). The risk of an estimator ˆ s of s is measured with the L 2 -loss, that is k s − s ˆ k 2 , which is random when ˆ s is.

s minimizes the integrated quadratic contrast P Q(t), where Q : L 2 (µ) → L 1 (P ) is defined for all t in L 2 (µ) by Q(t) = k t k 2 − 2t. Hence, density estimation is a problem of M - estimation. These problems are classically solved in two steps. First, we choose a ”model”

S m that should be close to the parameter s, which means that inf t∈S

k s − t k 2 is ”small”.

Then, we minimize over S m the empirical version of the integrated contrast, that is, we choose

ˆ

s m ∈ arg min

t∈S

P n Q(t). (1)

This last minimization can be computationaly untractable for general sets S m , leading to untractable procedures in practice. However, it can be easily solved when S m is a linear subspace of L 2 (µ) since, for all orthonormal basis (ψ λ ) λ∈m ,

ˆ

s m = X

λ∈m

(P n ψ λ )ψ λ . (2)

Thus, we will always assume that a model is a linear subspace in L 2 (µ). The risk of the least-squares estimator ˆ s m defined in (1) is then decomposed in two terms, called bias and variance, thanks to Pythagoras relation. Let s m be the orthogonal projection of s onto S m ,

k s − s ˆ m k 2 = k s − s m k 2 + k s m − s ˆ m k 2 .

We will not take this point of view here and our estimator will always be the empirical

minimizer, mainly because it can easily be computed, see (2). We will focus on the quantity

D m /n and introduce a general Assumption (namely Assumption [V]) that allows to work

indifferently with D m /n or with the actual risk k s m − s ˆ m k 2 . We will also provide and study an estimator of D m /n based on the resampling heuristic.

We insist here on the fact that, unlike classical methods, we will not use in this paper strong extra assumptions on s, like k s k ∞ < ∞ or assume that s is compactly supported.

1.2 Model selection

Recall that the choice of an optimal model S m is impossible without strong assumptions on s, for example a precise information on its regularity. However, under less restrictive hypotheses, we can build a countable collection of models (S m ) m∈M

, growing with the number of observations, such that the best estimator in the associated collection (ˆ s m ) m∈M

realizes an optimal trade-off, see for example Birg´e & Massart [8] and Barron, Birg´e &

Massart [6]. The aim is then to build an estimator ˆ m such that our final estimator, ˜ s = ˆ s m ˆ behaves almost as well as any model m o in the set of oracles

M ∗ n = { m o ∈ M n , k s ˆ m

− s k 2 = inf

m∈M

k s ˆ m − s k 2 } .

This is the problem of model selection. More precisely, we want that ˜ s satisfies an oracle inequality defined in general as follows.

Definition: (Trajectorial oracle inequality) Let (p n ) n∈ N be a summable sequence and let (C n ) n∈ N and (R m,n ) n∈ N be sequences of positive real numbers. The estimator s ˜ = ˆ s m ˆ satisfies a trajectorial oracle inequality T O(C n , (R m,n ) m∈M

, p n ) if

∀ n ∈ N ∗ , P

k s ˜ − s k 2 > C n inf

m∈M

k s − s ˆ m k 2 + R m,n

≤ p n . (3) When ˜ s satisfies an oracle inequality, C n is called the leading constant.

In this paper, we are interested in the problem of optimal model selection defined as follows.

Definition: (Optimal model selection) We say that s ˜ is optimal or that the procedure of selection (X 1 , ..., X n ) 7→ m ˆ is optimal when s ˜ satisfies a trajectorial oracle inequality T O(1 + r n , (R m,n ) m∈M

, p n ) with r n → 0 and for all n in N ∗ and m in M n R m,n = 0. In order to simplify the notations, when s ˜ is optimal we will say that s ˜ satisfies an optimal oracle inequality OT O(r n , p n ).

In order to build ˆ m, we remark that, for all m in M n ,

ˆ

m ∈ arg min

m∈M

P n Q(ˆ s m ) + pen(m), ˜ s = ˆ s m ˆ . (5) It is clear that the ideal penalty is pen id (m) = 2ν n (ˆ s m ). For all m in M n , for all orthonor- mal basis (ψ λ ) λ∈m , ˆ s m = P

λ∈m (P n ψ λ )ψ λ and s m = P

λ∈m (P ψ λ )ψ λ , thus ν n (ˆ s m − s m ) = ν n X

λ∈m

(ν n ψ λ )ψ λ

!

= X

λ∈m

Matthieu Lerasle ^∗

In this paper, we define and study efficient penalized least-squares estimators in the den- sity estimation framework when the error is measured with the L ² -loss. We observe n

< t, t ^′ >=

t(x)t ^′ (x)dµ(x)

and the associated L ² -norm k . k , defined for all t in L ² (µ) by k t k = √

< t, t >. The parameter of interest is the density s of P with respect to µ, we assume that it belongs to L ² (µ). The risk of an estimator ˆ s of s is measured with the L ² -loss, that is k s − s ˆ k ² , which is random when ˆ s is.

s minimizes the integrated quadratic contrast P Q(t), where Q : L ² (µ) → L ¹ (P ) is defined for all t in L ² (µ) by Q(t) = k t k ² − 2t. Hence, density estimation is a problem of M - estimation. These problems are classically solved in two steps. First, we choose a ”model”

k s − t k ² is ”small”.

Then, we minimize over S _m the empirical version of the integrated contrast, that is, we choose

This last minimization can be computationaly untractable for general sets S _m , leading to untractable procedures in practice. However, it can be easily solved when S _m is a linear subspace of L ² (µ) since, for all orthonormal basis (ψ _λ ) _λ∈m ,

s _m = X

(P _n ψ _λ )ψ _λ . (2)

Thus, we will always assume that a model is a linear subspace in L ² (µ). The risk of the least-squares estimator ˆ s _m defined in (1) is then decomposed in two terms, called bias and variance, thanks to Pythagoras relation. Let s m be the orthogonal projection of s onto S _m ,

k s − s ˆ _m k ² = k s − s _m k ² + k s _m − s ˆ _m k ² .

D _m /n and introduce a general Assumption (namely Assumption [V]) that allows to work

indifferently with D _m /n or with the actual risk k s _m − s ˆ _m k ² . We will also provide and study an estimator of D m /n based on the resampling heuristic.

Recall that the choice of an optimal model S _m is impossible without strong assumptions on s, for example a precise information on its regularity. However, under less restrictive hypotheses, we can build a countable collection of models (S _m ) _m∈M

, growing with the number of observations, such that the best estimator in the associated collection (ˆ s _m ) _m∈M

Massart [6]. The aim is then to build an estimator ˆ m such that our final estimator, ˜ s = ˆ s _m _ˆ behaves almost as well as any model m _o in the set of oracles

M ^∗ n = { m _o ∈ M n , k s ˆ _m

− s k ² = inf

k s ˆ _m − s k ² } .

Definition: (Trajectorial oracle inequality) Let (p _n ) _n∈ N be a summable sequence and let (C _n ) _n∈ N and (R _m,n ) _n∈ N be sequences of positive real numbers. The estimator s ˜ = ˆ s _m _ˆ satisfies a trajectorial oracle inequality T O(C n , (R m,n ) m∈M

∀ n ∈ N ^∗ , P

k s ˜ − s k ² > C _n inf

k s − s ˆ _m k ² + R _m,n

≤ p _n . (3) When ˜ s satisfies an oracle inequality, C _n is called the leading constant.

Definition: (Optimal model selection) We say that s ˜ is optimal or that the procedure of selection (X ₁ , ..., X _n ) 7→ m ˆ is optimal when s ˜ satisfies a trajectorial oracle inequality T O(1 + r _n , (R _m,n ) _m∈M

, p _n ) with r _n → 0 and for all n in N ^∗ and m in M n R _m,n = 0. In order to simplify the notations, when s ˜ is optimal we will say that s ˜ satisfies an optimal oracle inequality OT O(r _n , p _n ).

P _n Q(ˆ s _m ) + pen(m), ˜ s = ˆ s _m _ˆ . (5) It is clear that the ideal penalty is pen _id (m) = 2ν _n (ˆ s _m ). For all m in M n , for all orthonor- mal basis (ψ _λ ) _λ∈m , ˆ s _m = P

λ∈m (P _n ψ _λ )ψ _λ and s _m = P

λ∈m (P ψ _λ )ψ _λ , thus ν _n (ˆ s _m − s _m ) = ν _n X

(ν _n ψ _λ )ψ _λ

(ν _n ψ _λ ) ² = k ˆ s _m − s _m k ² .

p(m) = ν _n (ˆ s _m − s _m ) = k ˆ s _m − s _m k ² . From (4), for all m in M n ,

k s − s ˜ k ² = k s ˜ k ² − 2P s ˜ + k s k ² = k ˜ s k ² − 2P n s ˜ + 2ν n s ˜ + k s k ²

≤ P _n Q(ˆ s _m ) + pen(m) + (2ν _n (˜ s) − pen( ˆ m)) + k s k ²

= k s − ˆ s _m k ² + (pen(m) − 2ν _n (ˆ s _m )) + (2ν _n (˜ s) − pen( ˆ m)) Hence, for all m in M n ,

k s − ˜ s k ² ≤ k s − s ˆ _m k ² + (pen(m) − 2p(m)) + (2p( ˆ m) − pen( ˆ m)) + 2ν _n (s _m _ˆ − s _m ). (6) Let us define, for all c 1 , c 2 > 0, the function

f _c

: R ⁺ → R ⁺ , x 7→

_1+c

x − 1 if x < 1/c ₂

+ ∞ if x ≥ 1/c ₂ . (7)

It comes from inequality (6) that ˜ s satisfies an oracle inequality OT O(f _2,2 (ǫ _n ), p _n ) as soon as, with probability larger than 1 − p _n

k s − s ˆ _m k ² ≤ ǫ _n and (8)

∀ (m, m ^′ ) ∈ M ² n , 2ν n (s _m

k s − s ˆ _m

& Massart [9] in the Gaussian regression framework and developed in a general algorithm by Arlot & Massart [5]. This heuristic states that there exist a sequence (∆ _m ) _m∈M

and a constant K _min satisfying the following properties,

1. when pen(m) < K _min ∆ _m , then ∆ _m _ˆ is too large, typically ∆ _m _ˆ ≥ C max _m∈M

∆ _m , 2. when pen(m) ≃ (K min + δ)∆ m for some δ > 0, then ∆ _m _ˆ is much smaller,

3. when pen(m) ≃ 2K _min ∆ _m , the selected estimator is optimal.

For all K > 0, compute the selected model ˆ m(K) given by (5) with the penalty pen(m) = K∆ m and the associated complexity ∆ _m(K) _ˆ .

Find the constant K _min such that ∆ _m(K) _ˆ is large when K < K _min , and ”much smaller”

when K > K _min .

s _m k ² ) = D _m /n and K _min = 1. In general, D _m is unknown and has to be estimated, we propose a resampling estimator and prove that it can be used without extra assumptions to obtain optimal results.

(W ₁ , ..., W _n ) has the same law as (W _τ(1) , ..., W _τ(n) ).

Hereafter, we denote by ¯ W _n = P _n

i=1 W _i /n and by E ^W and L ^W respectively the expectation and the law conditionally to the data X, X ₁ , ..., X _n . Let P _n ^W = P n

i=1 W _i δ _X

C _W is a renormalizing constant that depends only on the resampling scheme and on F . Following this heuristic, Arlot defines as a penalty the resampling estimate of the ideal penalty 2D _m /n, that is

pen(m) = 2C W E ^W (ν _n ^W (ˆ s ^W _m )), (10) where ˆ s ^W _m minimizes P _n ^W Q(t) over S m . We prove concentration inequalities for pen(m) and deduce that pen(m) provides an optimal procedure.

Take an orthonormal basis (ψ _λ ) _λ∈m of S m . Easy algebra leads to s _m = X

(P ψ _λ )ψ _λ , ˆ s _m = X

(P _n ψ _λ )ψ _λ , thus k s _m − s ˆ _m k ² = X

(ν _n (ψ _λ )) ² .

s _m is an unbiased estimator of s _m and

pen _id (m) = 2ν _n (ˆ s _m ) = 2ν _n (ˆ s _m − s _m ) + 2ν _n (s _m ) = 2 k s _m − ˆ s _m k ² + 2ν _n (s _m ).

For all m, m ^′ in M n , let

p(m) = k s _m − s ˆ _m k ² = X