• Aucun résultat trouvé

Spatially adaptive density estimation by localised Haar projections

N/A
N/A
Protected

Academic year: 2022

Partager "Spatially adaptive density estimation by localised Haar projections"

Copied!
15
0
0

Texte intégral

(1)

www.imstat.org/aihp 2013, Vol. 49, No. 3, 900–914

DOI:10.1214/12-AIHP485

© Association des Publications de l’Institut Henri Poincaré, 2013

Spatially adaptive density estimation by localised Haar projections

Florian Gach

a

, Richard Nickl

a

and Vladimir Spokoiny

b,1

aStatistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, CB30WB, Cambridge, UK.

E-mail:r.nickl@statslab.cam.ac.uk

bWeierstrass Institute for Applied Analysis and Stochastics, Mohrenstrasse 39, 10117 Berlin, Germany, Humboldt University Berlin and Moscow Institute of Physics and Technology. E-mail:spokoiny@wias-berlin.de

Received 15 March 2011; revised 10 January 2012; accepted 21 February 2012

Abstract. Given a random sample from some unknown densityf0:R→ [0,∞)we devise Haar wavelet estimators forf0with variable resolution levels constructed from localised test procedures (as in Lepski, Mammen and Spokoiny (Ann. Statist.25(1997) 927–947)). We show that these estimators satisfy an oracle inequality that adapts to heterogeneous smoothness off0, simultane- ously for every pointxin a fixed interval, in sup-norm loss. The thresholding constants involved in the test procedures can be chosen in practice under the idealised assumption that the true density is locally constant in a neighborhood of the pointxof estimation, and an information theoretic justification of this practise is given.

Résumé. A partir d’un échantillon d’une loi de densitéf0:R→ [0,∞), nous construisons des estimateurs par ondelettes de Haar def0, dont les niveaux de résolution varient et sont construits à partir de tests localisés (comme dans l’article Lepski (Ann. Sta- tist.25(1997) 927–947)). Nous montrons que ces estimateurs satisfont une inégalité oracle adaptive par rapport à la régularité potentiellement hétérogène def0, simultanément pour tout pointxdans un intervalle donné, en norme infinie. Les constantes de seuillage utilisées dans les procédures de test peuvent être choisies en pratique en supposant de manière idéalisée que la vraie densité est localement constante dans un voisinage du pointxconsidéré, pratique que nous justifions par un argument de théorie de l’information.

MSC:62G05

Keywords:Spatial adaptation; Propagation condition

1. Introduction

One of the most enduring challenges in statistical function estimation is to devise procedures that adapt to the locally variable complexity of the unknown function. For example, if one observes a random sampleX1, . . . , Xnwith density f0:R→R, thenf0may exhibitspatially inhomogeneous smoothness: The density could be infinitely-differentiable on most of its support except for a few pointsxmwhere it behaves locally like|xxm|αmfor some distinct numbersαm. The location of the irregular pointsxmwill usually not be known, and neither the corresponding degree of smoothness αm. Moreoverf0could possess a so-calledmultifractalbehavior, changing its Hölder exponents continuously on its domain of definition – in fact, as shown in Jaffard [9], ‘typical’ functions in the Besov spaces usually considered in

1Supported in part by Laboratory for Structural Methods of Data Analysis in Predictive Modeling, MIPT, RF government grant, ag.

11.G34.31.0073. Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 “Economic Risk”

is also gratefully acknowledged.

(2)

nonparametric statistics are always multifractal. Donoho and Johnstone [1] and Donoho, Johnstone, Kerkyacharian, and Picard [2], [3] have suggested that methods based on wavelet shrinkage can, to a certain extent, adapt to spatially inhomogeneous complexity of the unknown functionf0. Moreover, Lepski, Mammen, and Spokoiny [10] showed that this is not intrinsic to wavelet methods, and that similar spatial adaptation results can be proved for kernel methods based on locally variable bandwidth choices.

There are several ways in which one can measure spatial adaptivity of an estimator. A minimal requirement may be to devise a rulefˆn(x)that estimatesf0(x)in an optimal way at every pointx, and the methods suggested in [1]

and [10] meet this requirement. These procedures depend on the pointx, and the natural question arises as to how a given procedure performs globally as an estimator for f0. To address this question, Donoho et al. [3] and Lepski et al. [10] considered global Lr-loss, r <∞, and argued that taking Lr-loss over Besov-bodies B(s, p, q)where smoothness is measured inLp,r > p, gives a way to assess the spatial performance of an estimator. A probably more transparent approach to the spatial adaptation problem is to consider sup-norm loss for estimators with locally variable bandwidths: one aims to find an estimatorfˆn(x)that is locally optimal for estimatingf0(x), and simultaneously so for allx. This approach was not considered in the literature so far – the results [5–8] address the spatially homogeneous setting only.

A first contribution of this article is to show that a dyadic histogram estimator with variable bin size spatially adapts to possibly inhomogeneous local Hölder smoothness off0, in global sup-norm loss. More precisely, forK(x, y)the Haar wavelet projection kernel, we shall construct

fˆn(x)=2jˆn(x) n

n

i=1

K

2jˆn(x)x,2jˆn(x)Xi ,

wherejˆn(x)is a variable resolution level that depends both onxand the sample, and show that the random variable sup

x

1

r(n, x, f0)fˆn(x)f0(x)

is stochastically bounded, where r(n, x, f0)is the optimal risk of an ‘oracle estimator’ for f0 at the pointx. We show moreover that this rate equals the pointwise minimax rate of adaptive estimation forf0(x)at everyx, and that spatial adaptation occurs uniformly inxexcept near discontinuities of the Hölder exponent functiont (f, x), see after Theorem3for a detailed discussion.

While this result shows that spatial adaptation is indeed possible in a strong theoretical way, a drawback shared by most results in the literature on adaptive estimation remains: The theoretical findings give no indication whatsoever as to how to choose the numerical constants in the thresholds that feature in shrinkage- or Lepski-test-based methods.

It has become a common practice that thresholding constants are chosen according to simulation results where sim- ulations are drawn as if the true underlying signal is very simple (say, uniform or piecewise constant). This practise has not had any general theoretical corroboration until recently Spokoiny and Vial [11] gave, in a simple Gaussian regression model, a certain justification based on the idea of ‘propagation.’ The results in [11] are heavily tied to the simplicity of the model used, in particular to the strong Gaussianity assumption employed, and to the fact that pointwise loss is considered. In the present paper we show how the ideas of [11] generalise, subject to some nontriv- ial modifications, to nonparametric density estimation. A key idea in the proofs in [11], translated into the density estimation context, is to replace the sampling distribution by a locally constant product measure. The ‘transportation cost’ of this replacement is easy to control in the Gaussian setting of [11], but in the density estimation case the fluctuations of the likelihood ratios between the unknown sampling distribution and relevant locally constant product measures do not obey a Gaussian regime, but turn out to be of Poisson type, so that the ‘Gaussian intuitions’ of [11]

could be entirely misleading. We show however that the main information theoretic idea of [11] remains sound in this Poissonian setting as well: We use a Lepski-type procedure to constructjˆn(x), and we show that if we compute sharp thresholds for this procedure as if the true densityf0belonged to a familyF of locally constant densities, then the resulting estimator is spatially adaptive in sup-norm loss. In contrast to the results in [11], the rates of convergence we obtain for the risk of the final estimator areexactrate-adaptive.

While the techniques and results of this paper generalise in principle to more complex estimation problems that involve in particular adaptation to higher degrees of smoothness, we prefer to stay within the simpler setting of Haar wavelets, which allows for a clean exposition of the main ideas.

(3)

2. Uniform spatial adaptation using propagation methods

We will use the symbolgT to denote the supremum suptT|g(t)|of a functiongover some setT, but we will still use the symbolgto denote supx∈R|g(x)|if no confusion can arise.

For anyj∈N, we define a dyadic partition of(0,1]into 2j-many disjoint subintervals by settingIj,k=(k2j, (k+ 1)2j], k=0, . . . ,2j −1; and for 0< x ≤1 we denote by Ij,k(x) the unique interval containing x. For j ∈N, k=1, . . . ,2j−1, letVj,kbe the space of all bounded density functions onRthat are constant onIj,k. Via the local projections

Kj,x(f )(z):=

2j

Ij,k(x)f (y)dy ifzIj,k(x),

f (z) otherwise,

we map any bounded densityf ontoVj,k(x). (Note thatKj,x(f )is indeed a density sinceKj,x(f )andf assign the same probability to the intervalIj,k(x).) ForfVj,kandjj we clearly haveKj,x(f )=f.

2.1. Estimation procedure

LetX, X1, . . . , Xn be i.i.d. with bounded densityf0:R→ [0,∞),n >1. We wish to construct a single estimator which estimatesf0(x)in an optimal way, uniformly so for points x in the interval(a, b]. We shall take without loss of generality (a, b] =(0,1], and we shall assume throughout thatf0 is bounded away from zero on (0,1]. LetK(x, y)=

kφ(xk)φ(yk)be the projection kernel based on the Haar waveletφ=1(0,1]. We shall write Kj(x, y)=2jK(2jx,2jy), and the associated linear density estimator is the dyadic histogram estimator given by

fn(j, x):=1 n

n

i=1

Kj(x, Xi).

We make the important observation that Effn(j, x)=2jPf(Ij,k(x)), which directly follows from the identity Kj(x, y)=2j1Ij,k(x)(y). If f is constant on Ij,k(x) this in particular implies Effn(j, x)=f (x). (In other words:

for any locally (atx) constant densityf the bias offn(j, x)equals zero if the resolution level is chosen fine enough.) We finally note that the estimatorfn(j, x)by construction only depends on data points falling intoIj,k(x). This amounts ton2j being the ‘effective’ sample size for estimatingf0atx.

2.2. Local choice of the resolution level

We fixjmax:=jmax,n∈Nsatisfying 2jmax(logn)2/nfor somed >0. For thresholdsζnto be specified below, and forJ∈N,Jjmaxand 0< x≤1, we define

jˆn(J, x)=min j ∈N, Jjjmax:

n2jfn j, x

fn(j, x)ζn

fn(j, x)for allj, j < jjmax

(1) as well as

jˆn(x)= ˆjn(0, x). (2)

(If the condition in (1) is not met for anyj,Jjjmax, we setjˆn(J, x)=jmax.) Given the locally variable resolution leveljˆn, we define the family of nonlinear estimators

fˆn(J, x):=fnjˆn(J, x), x

, fˆn(x):=fnjˆn(x), x

, x∈ [0,1]. (3)

These are estimators forf0(x)based on a locally variable resolution level depending onx, and they are density- analogues of the estimators introduced in [10] in the context of the Gaussian white noise model. Note that by con- structionjˆn(x)is a step function inx. Introducing the parameterJwill be useful in what follows – effectively,fˆn(J, x) is a nonlinear estimator based on a search over the resolution levelsjJ that stops atjmax.

(4)

2.3. Threshold choice by propagation

One of the main challenges for all adaptive procedures is the choice of the thresholdsζn used in the tests defined in (1). Define the standardisation

1 sn(j, x):=

1

fn(j,x) iffn(j, x) >0,

0 otherwise.

We suggest to choose the thresholds in such a way that the following condition is satisfied:

Condition 1. LetFj,kbe any triangular array of subsets of Vj,k,jjmax,k=0, . . . ,2j −1,and letk(m)be the uniqueksuch thatIjmax,mIj,k.We say that the thresholdsζnsatisfy the uniform propagation conditionUP(α,Fj,k) for some fixedα >0if for everyn,everyjjmax,everym=0, . . . ,2jmax−1,and everyfFj,k(m)we have that

Ef

sup

xIjmax,m

jmaxjjmax

n2j

logn

fˆn(j, x)fn(j, x) sn(j, x)

2

α

n22jmax. (4)

(Note that sincejˆn(j)j we have thatfn(j, x)=0 impliesfˆn(j, x)=0 for the fully data-driven estimator fˆn(j), and so the error| ˆfn(j, x)fn(j, x)|is then 0.) An interpretation of this condition can be given along the following lines: For 0< x≤1 the classFj,k(x)contains only densitiesf that can be exactly reconstructed onIj,k(x)

by

Kj(x, y)f (y)dy, so that the bias of the linear estimatorfn(j, x)equals zero locally. In particular, any choice of the resolution level finer than j will only increase the variance without reducing the bias, and we would want jˆn(j, x)to detect that and equal, with large probability,j. This property ofjˆnwill then be mirrored in the fact that fˆn(j, x)fn(j, x)=0 for everyjj on an event with large probability, in which case the l.h.s. of (4) is exactly equal to zero. The quantity α/(n22jmax)stands for the a priori expected tolerance for a probabilistic error of jˆn to detect the ‘correct’ resolution level on each intervalIjmax,min this ‘no-bias’ situation.

The following lemma shows that Condition1is not empty and that thresholdsζnsatisfying the uniform propagation condition exist. It shows furthermore that the thresholds can be taken to be of order√

lognand independent off, which will be crucial in understanding the adaptive properties offˆnbelow.

Lemma 1. LetFj,kequalVj,kintersected with the set

f: 0< δ≤ inf

0<x1f (x),fM

for some fixed0< δ, M <∞.Then for every givenα >0there exists a numerical constantκ >0that depends only onαsuch that for any threshold choice

ζnκ logn

the uniform propagation conditionUP(α,Fj,k)is at least satisfied fornlarger than some index that only depends on δandM.

While Lemma1proves the existence of thresholds of the order√

lognunder the uniform propagation condition – a fact that will be seen to imply adaptivity offˆnbelow – it does not suggest a practical choice ofζn. Instead, this choice can be made by direct evaluation of (4), as follows: Condition1only concerns the local error bounds over small intervalsIjmax,mon which the functionf is constant, which effectively means that it suffices to check this condition only for classes of densities which are constant on the interval of interest. The particular choice of the intervalIjmax,m is unimportant. Secondly, all quantities in Condition1depend on known quantities afterf is chosen. By construction of the estimatorsfn andfˆn the random variable featuring in (4) – we call itT – only depends on the number of data points falling into each of the (uniquely determined)j-fine intervals containingIjmax,m. This observation allows for an easy computation of the l.h.s. of (4) along the following lines: Fix 0≤p≤1. Then, for any fFj,k(m)

satisfying 2jf =p onIj,k(m), the numberZ of observations falling into the intervalIj,k(m) is binomialB(n, p).

(5)

Conditionally onZ=k, takek-many independent random variables that are uniform onIj,k(m)and count the number of observationsVj in each of thej-fine intervals. Then computefn,fˆn; andT. This shows thatT does only depend onVj,jjjmax, and that the l.h.s. of (4) is therefore equal toE[E[T (Vj, . . . , Vjmax)|Z]].

The practical choice ofζncan then be obtained via a Monte Carlo simulation of (4) by choosingζnas the smallest threshold for which (4) is satisfied in the simulation for one specific intervalIjmax,muniformly over the class of all densities constant on this interval. Givenjmaxandα, this procedure has to be performed only for one fixed interval Ijmax,m,and then applies for everymsimultaneously.

2.4. Local small bias condition

The idea behind Condition1is that we take ‘idealised’ classes of densitiesFfor which we compute sharp thresholds ζn. The danger arises that the true densityf0may be very different from the elements inF, which may lead to wrong thresholds (and inference). We have to assess the error that comes from replacingf0 by an element fromF, in a neighborhood of a given pointx. This can be fundamentally quantified in terms of the log-likelihood ratio between f0and its local (atx) approximand inF. As we shall see, one of the deeper reasons behind the fact that propagation methods imply adaptation results is that this error can be related to the usual bias term in linear estimation.

Condition 2. Given real numbersΔj,x, 0< x≤1,j∈N∪ {0}satisfyingΔl,xΔl,x for everyl> l,we say thatf0

satisfies the local small bias condition atx(0,1]and withΔj,xΔj,x(f0)if VarKj,x(f0)log f0

Kj,x(f0)Δj,x(f0) for allj∈N.

The local ‘cost’ of transporting a product measuren

i=1f0(xi)ton

i=1Kj,x(f0)(xi)can be quantified byntimes the variance featuring in the above condition, and we shall have to restrict ourselves to resolution levelsj for which this transportation cost is at most a fixed constant times the logarithm of the sample sizen. The smallest resolution level for which this is still the case will be defined asj(x): More precisely, for some fixed positive constantΔ, define the local resolution level

j(x):=j(x, n, Δ, f0)=min j ∈N: jjmax, nΔj,x(f0)Δlogn

. (5)

While this is an information-theoretic definition ofj, a key observation of this subsection is that it has the classical

‘bias-variance’ tradeoff generically built into it for suitable choices ofΔj,x(f0).

Lemma 2. Supposef0is bounded by some finite numberM >0and that

0<xinf1f0(x)δ >0.

Thenf0satisfies Condition2with

Δj,x(f0)=M

δ22jf0Kj,x(f0)2

.

Proof. First, observe thatKj,x(f0)is bounded byM and bounded below byδ >0. Then, using thatKj,x(f0)coin- cides withf0outside ofIj,k(x)and the inequality|logx−logy| ≤max(x1, y1)|xy|, we get

VarKj,x(f0)log f0

Kj,x(f0)

≤ log f0(y) Kj,x(f0)(y)

2

Kj,x(f0)(y)dy

(6)

max

f0(y)2, Kj,x(f0)(y)2

f0(y)Kj,x(f0)(y)2

Kj,x(f0)(y)dy

M

δ2 f0(y)Kj,x(f0)(y)2

dy≤M

δ22jKj,x(f0)f02

.

The lemma shows that the quantity(n/logn)Δj,x(f0)can be viewed as the square of the ‘bias divided by the variance’ of linear projection estimators forf0(x).Hence, to choose the smallestjjmaxsuch that(n/logn)Δj,x(f0) is still bounded by a fixed constantΔmeans to locally balance the ‘variance’ and ‘bias’ term in the nonparametric setting.

To be more concrete, let us briefly discuss what this means in the classical situation where the bias is bounded by local regularity properties of the unknown densityf0. Since we are interested in spatial adaptation, we wish to take locally inhomogeneous smoothness into account by appealing to local Hölder conditions: Let 0< t≤1 and let us say that a functiong:R→Ris locallyt-Hölder atx∈Rif for someη >0

sup

0<|m|≤η

|g(x+m)g(x)|

|m|t <.

Define further a ‘local’ Hölder ball of bounded functions C(t, x, L, η):=

g:R→R,max

g, sup

0<|m|≤η

|g(x+m)g(x)|

|m|t

L

.

Condition2then has the following more classical interpretation in terms of local smoothness properties off0: Lemma 3. Iff0C(t, x, L, η)for some0< t≤1,then the local biasf0Kj,x(f0)is bounded byc2j t for some constantcc(t, L, η).Furthermore,if

0<xinf1f0(x)δ >0, then Condition2is satisfied with

Δj,x(f0)=c2L

δ22j (2t+1). (6)

Proof. LetyIj,k(x)be arbitrary. Then, using the substitution 2jz=2jyu, f0(y)Kj,x(f0)(y)=

2j

Ij,k(x)

f0(y)f0(z) dz

1

1

f0(y)f0(x)+f0(x)f0

y−2judu

≤2f0(y)f0(x)+ 1

1

f0(x)f0

y−2judu.

By definition ofx, y, Ij,k(x)we have|yx| ≤2j, and also|y−2jux| ≤2j+1by the triangle inequality, so that for 2j+1ηthe last quantity is bounded byc02j t in view off0C(t, x, L, η). If 2j> η/2, then the quantity in the last display can still be bounded by 6f0≤6L, so that choosingc1=6L(2/η)t establishes the desired bound

forc=max(c0, c1). To prove the second claim, apply Lemma2.

Using the bound from the last lemma to verify Condition 2, we see that, by definition of j(x)and for f0C(t, x, L, η),

n2j(x) logn

n logn

t /(2t+1)

(7)

(7)

is the locally (atx) optimal adaptive rate of convergence, so that the local small bias condition constructs a minimax optimal resolution levelj(x)at everyx∈ [0,1].

2.5. Main results

We now state the main results, starting with the following ‘oracle’ inequality. Note that the oraclefn(j(x), x)is not an estimator in itself as it depends on unknown quantities.

Theorem 1. Letfˆn(·)be the density estimator defined in(3)with thresholdsζnthat satisfy the uniform propagation conditionUP(α,Fj,k)for someFj,k.Supposef0satisfies Condition2for every0< x≤1,and letj(x)be as in(5).

Then we have

Ef0 sup

0<x1

n2j(x) logn

fˆn(x)fn(j(x), x) sn(j(x), x)

ζn

√logn+ α

nnΔe4U (8)

for anyUsatisfying U≥ sup

0<x1

log f0 Kj(x),x(f0)

. (9)

Ifζn=O(√

logn)– as follows under the conditions of Lemma1– and if one choosesΔ <1/2,Uas in the remark below, then the r.h.s. of (8) is O(1)asntends to infinity. Theorem1thus implies that the estimatorfˆnwith resolution levels chosen by the propagation approach is close to the linear ‘oracle estimator’ evaluated at the locally optimal resolution levelj(x), and this uniformly so on(0,1].

Remark 1. IfFj,kis as in Lemma1andf0is bounded byMand bounded below byδ,we may apply Lemma2(using 2jmaxd(logn)2/n)to obtain the bound

log f0

Kj(x),x(f0)=log

1+f0Kj(x),x(f0) Kj(x),x(f0)

≤log

1+ Δ

dMlogn

,

which tends to zero asntends to infinity.

Our results then imply the following uniform spatial adaptation result:

Theorem 2. Assume thatf0is bounded byMand satisfiesinf0<x1f0(x)δ >0.Letfˆn(·)be the density estimator from(3)with thresholds ζn=O(√

logn)that satisfy the uniform propagation conditionUP(α,Fj,k)forFj,kas in Lemma1.Letj(x)be as in(5)withΔ <1/2and withΔj,xas in Lemma2.Then

sup

0<x1

n2j(x)

logn fˆn(x)f0(x)=OPrf0(1). (10)

Thus the fully data-driven estimatorfˆn for f0 achieves the locally optimal risk of the ‘oracle’ based onj(x), uniformly at all points in(0,1]. Ifj(x)– withΔ <1/2 – is based onΔj,x as in Lemma3, then (10) holds true and the ‘oracle’ rate is the adaptive locally minimax rate of convergence at every 0< x≤1 wheref0is locallyt-Hölder with 0< t≤1, see the discussion in Section2.4surrounding (7). This means that at any given pointx our estimator is rate-adaptive to local Hölder smoothness (with the usual lognpenalty for adaptation).

One may ask further if spatial adaptation in the minimax sense occursuniformlyfor everyx(0,1]. A consequence of Theorem2is the following.

(8)

Theorem 3. Suppose the assumptions of Theorem2are satisfied and that the true densityf0lies inC(t (x), x, L(x), η(x)), 0< x≤1, for some t (·), L(·), η(·)that are bounded and uniformly bounded away from zero on (0,1].Let j(x)be as in(5)withΔ <1/2and withΔj,xas in Lemma3.Then

sup

0<x1

n logn

t (x)/(2t (x)+1)

fˆn(x)f0(x)=OPrf0(1).

The assumptions on the functionst, L, ηneed discussion. For densities that locally look like|xxm|αm we would wish to chooset (x)equal to their pointwise Hölder exponentst (xm)=αmandt (x)=1 otherwise, but thenηis not uniformly bounded away from zero for points xxm. However, Theorem3holds foranychoice of the functions t, L, η for whichf0satisfies f0C(t (x), x, L(x), η(x)), 0< x≤1. In other words, in the above example we can chooset (x)=αmon the interval(xmη0, xm+η0)andt (x)=1 otherwise, whereη0is some arbitrary lower bound forη(x). This comes at the expense of not being adaptive nearxm, i.e., forx(xmη0, xm+η0)\ {xm}, which is sensible as we cannot expect adaptation for points x arbitrarily close toxm from a finite sample. Inspection of the proofs (particularly the dependence onηin Lemma3) shows that, for fixedn, the above theorem holds for densities C(t (x), x, L(x), η(n)), 0< x≤1, whereη(n)can be taken of ordern1/3, the binwidth corresponding to the maximal smoothnesst=1 one wants to adapt to in our setting, and this is again reasonable: Hölder smoothness off0in an interval [x±rn]wherern=o(n1/3)does not allow to control the bias atx with the locally optimal binwidth of ordern1/3. By the same arguments multifractal densitiesf0which change their Hölder exponent continuously can be handled by takingt (x)piecewise constant on a partition of(0,1]into bins of size of ordern1/3, the estimator achieving the local uniform minimax rate on each bin of the partition.

3. Proofs

3.1. Proof of Theorem1

A first idea is to use a moment bound, localised at any pointx of estimation, on the log-likelihood ratio betweenf0 and its approximand inVj,k.

Lemma 4. If,for fixed0< x≤1, VarKj,x(f0)log f0

Kj,x(f0)Dlogn

n (11)

for some0< D <and everyn∈N,then,for everyn∈N, EKj,x(f0)

n

i=1

f0(Xi) Kj,x(f0)(Xi)

2

n2De4U holds for anyUsatisfying

U

log f0 Kj,x(f0)

.

Proof. Since the Kullback–Leibler distance K

f0, Kj,x(f0)

= −EKj,x(f0)log f0 Kj,x(f0)≥0 is nonnegative, we have

EKj,x(f0)

n

i=1

f0(Xi) Kj,x(f0)(Xi)

2

EKj,x(f0)e2(log(f0/Kj,x(f0))EKj,x (f0)log(f0/Kj,x(f0)))n

(9)

by the i.i.d. assumption. Using the power series expansion of the exponential function and that the variables in the exponent are centered, one easily bounds the previous display by

1+2De4Ulogn n

n

≤e2De4Ulogn=n2De4U.

Here is the proof of Theorem1: We first note that Condition2allows us to takeΔj,x(f0)to be constant on the intervalsIj,k. Consequently,j(·)from (5) is then constant on every intervalIjmax,m, and we set

jm= sup

xIjmax,m

j(x).

To prove the theorem, we split Ef0 sup

0<x1

n2j(x) logn

fˆn(x)fn(j(x), x) sn(j(x), x)

Ef0 sup

0<x1

n2j(x) logn

fˆn(x)fn(j(x), x) sn(j(x), x)

1{ ˆj

n(x)<j(x)}

+Ef0 sup

0<x1

n2j(x) logn

fˆn(x)fn(j(x), x) sn(j(x), x)

1{ ˆj

n(x)j(x)}

=:I+II

according to whetherjˆn(x)comes to lie below the local resolution levelj(x)or not. By definition ofjˆn(x)in (1) one immediately has

Iζn

√logn. AboutII: Define

Sm= sup

xIjmax,m

jmmaxjjmax

n2j logn

fˆn(j, x)fn(j, x) sn(j, x)

. (12)

Using that on the eventjˆn(x)j(x)we necessarily havefˆn(x)= ˆfn(j(x), x), we see that IIEf0 sup

0<x1

n2j(x) logn

fˆn(j(x), x)fn(j(x), x) sn(j(x), x)

Ef0max

m sup

xIjmax,m

jmmaxjjmax

n2j logn

fˆn(j, x)fn(j, x) sn(j, x)

≤2jmaxmax

m Ef0Sm. (13)

We use the Cauchy–Schwarz inequality to bound Ef0Sm=

· · ·

Sm(x1, . . . , xn) n

i=1

f0(xi)dx1· · ·dxn

=

· · ·

Sm(x1, . . . , xn) n

i=1

f0(xi) Kj

m,x(f0)(xi) n

i=1

Kjm,x(f0)(xi)dx1· · ·dxn

EK

jm,x (f0)Sm2 EK

jm,x (f0)

n

i=1

f0(Xi) Kjm,x(f0)(Xi)

2

(10)

by the square-root of the second moment ofSmunder the ‘idealised’ densityKjm,x(f0)times the square-root of the second moment of the likelihood ratio. (Here,x is any point inIjmax,m.) Using Condition1and Lemma4, we obtain a bound for the last term in (13) of order

2jmaxmax

m Ef0Smα

nnΔe4U,

which concludes the proof of the theorem.

3.2. Proofs of Theorems2and3 We first prove Theorem2: Clearly,

sup

0<x1

n2j(x)

logn fˆn(x)f0(x)≤ sup

0<x1

n2j(x) logn

fˆn(x)fn(j(x), x) sn(j(x), x)

fn

j(x), x

+ sup

0<x1

n2j(x) logn fn

j(x), x

f0(x).

The first factor of the first summand is bounded in probability in view of Theorem1and of Lemma1and the hypoth- esisζn=O(√

logn). The second factor of the first summand is also bounded in probability since sup

0<x1 jmaxjmax

fn(j, x)Ef0fn(j, x)=oPf0(1)

by Proposition2, using 2jmaxd(logn)2/n, and since supx,j|Ef0fn(j, x)| ≤ f0<∞. It remains to prove that the second summand is bounded in probability, and we achieve this by bounding the moment

Ef0 sup

0<x1

n2j(x) logn fn

j(x), x

Ef0fn

j(x), x

+ sup

0<x1

n2j(x) logn Ef0fn

j(x), x

f0(x)

Ef0 sup

0<x1 jmaxjmax

n2j

lognfn(j, x)Ef0fn(j, x) +max

m sup

xIjmax,m

n2j(x)

logn Ef0fn

j(x), x

f0(x).

The first term is bounded by a fixed constant using Proposition 2 below. Recalling the definition ofjm from the beginning of the proof of Theorem1and choosingΔj,x(f0)from Lemma2, the second term is bounded by

maxm

n2jm

logn sup

xIjmax,m

Ef0fn

jm, x

f0(x)≤max

m

n2jm

logn Kjm,x(f0)f0

δ Δ

M, wherexis any point inIjmax,m, and this completes the proof.

We next prove Theorem3: Using the hypotheses ont (·), L(·), η(·), the proof of Lemma3shows thatf0satisfies Condition2with

Δj,x(f0)=c2j (2t (x)+1), (14)

(11)

0< c<∞, wherecdoes not depend onx. Using thatt (·)is bounded below by some positive number implies that Δjmax,x(f0)=c2jmax(2t (x)+1)Δlogn

n

holds fornlarge enough (independent ofx(0,1]), so thatj(x), when based onΔj,x(f0)as in(14), is asymptoti- cally equivalent to the minimax optimal locally adaptive rate, uniformly so for allx.

3.3. Proof of Lemma1

The proof relies on Propositions1and2which are given below. Recall first from Section2.1that forfVj,k and jj we necessarily haveEffn(j, x)f (x)=0 for everyxIj,k, so the bias atxIj,kis exactly zero, a fact we shall use repeatedly below without separate mentioning. Write

sup

xIjmax,m

maxjj

n2j

logn

fˆn(j, x)fn(j, x) sn(j, x)

= sup

xIjmax,m

max

jj

n2j

logn

l>j

fn(l, x)fn(j, x) sn(j, x)

1{ ˆj

n(j,x)=l}. (15)

To treat the indicator, observe that jˆn

j, x

=l

n2lfn

l, x

fn(l−1, x)> ζn

fn(l−1, x)for somell

n2lfn l, x

Effn l, x

+Effn(l−1, x)−fn(l−1, x)> ζn

f (x)

2 for somell

minj

fn( , x)

f (x) 2

.

Observe that the first set is a subset of n2lfn

l, x

Effn

l, xζnf (x)

4 for somell

n2(l1)fn(l−1, x)−Effn(l−1, x)nf (x) 4

maxj

n2 fn( )Effn( )

Ijmax,mn

fIj,k(m)

4

=:B1 and that, usingy≥√

δyforyδ, the second set is contained in

maxj

fn( , x)Effn( , x)>f (x) 2

maxj

fn( , x)Effn( , x)>

δf (x) 2

sup

xIjmax,m, j

fn( , x)Effn( , x)>

δfIj,k(m)

2

:=B2;

Références

Documents relatifs

We also give in Section 4 an alternative proof of Theorem 1.1 that uses the complex analytic version of the Abhyankar-Jung Theorem and the Artin Approximation Theorem..

[r]

The existence of this solution is shown in Section 5, where Dirichlet solutions to the δ-Ricci-DeTurck flow with given parabolic boundary values C 0 close to δ are constructed..

Key Lemma. Then we prove the Key Lemma for the special case when ∆ P is of A-type in section 3.3, and we obtain two consequences in section 3.4. Finally in section 3.5, we prove the

But for finite depth foliations (which have two sided branching), there are many examples where it is impossible to make the pseudo-Anosov flow transverse to the foliation [Mo5] and

Delano¨e [4] proved that the second boundary value problem for the Monge-Amp`ere equation has a unique smooth solution, provided that both domains are uniformly convex.. This result

A second scheme is associated with a decentered shock-capturing–type space discretization: the II scheme for the viscous linearized Euler–Poisson (LVEP) system (see section 3.3)..

By Baire’s theorem, an infinite dimensional Fréchet space cannot have a countable algebraic