Spatially adaptive density estimation by localised Haar projections

(1)

www.imstat.org/aihp 2013, Vol. 49, No. 3, 900–914

DOI:10.1214/12-AIHP485

Spatially adaptive density estimation by localised Haar projections

Florian Gach

^a

, Richard Nickl

^a

and Vladimir Spokoiny

^b,1

aStatistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, CB30WB, Cambridge, UK.

E-mail:r.nickl@statslab.cam.ac.uk

bWeierstrass Institute for Applied Analysis and Stochastics, Mohrenstrasse 39, 10117 Berlin, Germany, Humboldt University Berlin and Moscow Institute of Physics and Technology. E-mail:spokoiny@wias-berlin.de

Received 15 March 2011; revised 10 January 2012; accepted 21 February 2012

Abstract. Given a random sample from some unknown densityf₀:R→ [0,∞)we devise Haar wavelet estimators forf₀with variable resolution levels constructed from localised test procedures (as in Lepski, Mammen and Spokoiny (Ann. Statist.25(1997) 927–947)). We show that these estimators satisfy an oracle inequality that adapts to heterogeneous smoothness off₀, simultaneously for every pointxin a fixed interval, in sup-norm loss. The thresholding constants involved in the test procedures can be chosen in practice under the idealised assumption that the true density is locally constant in a neighborhood of the pointxof estimation, and an information theoretic justification of this practise is given.

Résumé. A partir d’un échantillon d’une loi de densitéf₀:R→ [0,∞), nous construisons des estimateurs par ondelettes de Haar def₀, dont les niveaux de résolution varient et sont construits à partir de tests localisés (comme dans l’article Lepski (Ann. Sta- tist.25(1997) 927–947)). Nous montrons que ces estimateurs satisfont une inégalité oracle adaptive par rapport à la régularité potentiellement hétérogène def₀, simultanément pour tout pointxdans un intervalle donné, en norme infinie. Les constantes de seuillage utilisées dans les procédures de test peuvent être choisies en pratique en supposant de manière idéalisée que la vraie densité est localement constante dans un voisinage du pointxconsidéré, pratique que nous justifions par un argument de théorie de l’information.

MSC:62G05

Keywords:Spatial adaptation; Propagation condition

1. Introduction

One of the most enduring challenges in statistical function estimation is to devise procedures that adapt to the locally variable complexity of the unknown function. For example, if one observes a random sampleX₁, . . . , X_nwith density f₀:R→R, thenf₀may exhibitspatially inhomogeneous smoothness: The density could be infinitely-differentiable on most of its support except for a few pointsx_mwhere it behaves locally like|x−x_m|^α^mfor some distinct numbersα_m. The location of the irregular pointsx_mwill usually not be known, and neither the corresponding degree of smoothness α_m. Moreoverf₀could possess a so-calledmultifractalbehavior, changing its Hölder exponents continuously on its domain of definition – in fact, as shown in Jaffard [9], ‘typical’ functions in the Besov spaces usually considered in

1Supported in part by Laboratory for Structural Methods of Data Analysis in Predictive Modeling, MIPT, RF government grant, ag.

11.G34.31.0073. Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 “Economic Risk”

is also gratefully acknowledged.

(2)

nonparametric statistics are always multifractal. Donoho and Johnstone [1] and Donoho, Johnstone, Kerkyacharian, and Picard [2], [3] have suggested that methods based on wavelet shrinkage can, to a certain extent, adapt to spatially inhomogeneous complexity of the unknown functionf₀. Moreover, Lepski, Mammen, and Spokoiny [10] showed that this is not intrinsic to wavelet methods, and that similar spatial adaptation results can be proved for kernel methods based on locally variable bandwidth choices.

There are several ways in which one can measure spatial adaptivity of an estimator. A minimal requirement may be to devise a rulefˆ_n(x)that estimatesf₀(x)in an optimal way at every pointx, and the methods suggested in [1]

and [10] meet this requirement. These procedures depend on the pointx, and the natural question arises as to how a given procedure performs globally as an estimator for f₀. To address this question, Donoho et al. [3] and Lepski et al. [10] considered global L^r-loss, r <∞, and argued that taking L^r-loss over Besov-bodies B(s, p, q)where smoothness is measured inL^p,r > p, gives a way to assess the spatial performance of an estimator. A probably more transparent approach to the spatial adaptation problem is to consider sup-norm loss for estimators with locally variable bandwidths: one aims to find an estimatorfˆ_n(x)that is locally optimal for estimatingf₀(x), and simultaneously so for allx. This approach was not considered in the literature so far – the results [5–8] address the spatially homogeneous setting only.

A first contribution of this article is to show that a dyadic histogram estimator with variable bin size spatially adapts to possibly inhomogeneous local Hölder smoothness off₀, in global sup-norm loss. More precisely, forK(x, y)the Haar wavelet projection kernel, we shall construct

fˆ_n(x)=2^j^ˆⁿ^(x) n

n

i=1

K

2^j^ˆⁿ^(x)x,2^j^ˆⁿ^(x)X_i ,

wherejˆ_n(x)is a variable resolution level that depends both onxand the sample, and show that the random variable sup

x

1

r(n, x, f₀)fˆ_n(x)−f₀(x)

is stochastically bounded, where r(n, x, f0)is the optimal risk of an ‘oracle estimator’ for f0 at the pointx. We show moreover that this rate equals the pointwise minimax rate of adaptive estimation forf₀(x)at everyx, and that spatial adaptation occurs uniformly inxexcept near discontinuities of the Hölder exponent functiont (f, x), see after Theorem3for a detailed discussion.

While this result shows that spatial adaptation is indeed possible in a strong theoretical way, a drawback shared by most results in the literature on adaptive estimation remains: The theoretical findings give no indication whatsoever as to how to choose the numerical constants in the thresholds that feature in shrinkage- or Lepski-test-based methods.

It has become a common practice that thresholding constants are chosen according to simulation results where sim- ulations are drawn as if the true underlying signal is very simple (say, uniform or piecewise constant). This practise has not had any general theoretical corroboration until recently Spokoiny and Vial [11] gave, in a simple Gaussian regression model, a certain justification based on the idea of ‘propagation.’ The results in [11] are heavily tied to the simplicity of the model used, in particular to the strong Gaussianity assumption employed, and to the fact that pointwise loss is considered. In the present paper we show how the ideas of [11] generalise, subject to some nontriv- ial modifications, to nonparametric density estimation. A key idea in the proofs in [11], translated into the density estimation context, is to replace the sampling distribution by a locally constant product measure. The ‘transportation cost’ of this replacement is easy to control in the Gaussian setting of [11], but in the density estimation case the fluctuations of the likelihood ratios between the unknown sampling distribution and relevant locally constant product measures do not obey a Gaussian regime, but turn out to be of Poisson type, so that the ‘Gaussian intuitions’ of [11]

could be entirely misleading. We show however that the main information theoretic idea of [11] remains sound in this Poissonian setting as well: We use a Lepski-type procedure to constructjˆ_n(x), and we show that if we compute sharp thresholds for this procedure as if the true densityf₀belonged to a familyF of locally constant densities, then the resulting estimator is spatially adaptive in sup-norm loss. In contrast to the results in [11], the rates of convergence we obtain for the risk of the final estimator areexactrate-adaptive.

While the techniques and results of this paper generalise in principle to more complex estimation problems that involve in particular adaptation to higher degrees of smoothness, we prefer to stay within the simpler setting of Haar wavelets, which allows for a clean exposition of the main ideas.

(3)

2. Uniform spatial adaptation using propagation methods

We will use the symbolgT to denote the supremum sup_t_∈_T|g(t)|of a functiongover some setT, but we will still use the symbolg∞to denote sup_x_∈R|g(x)|if no confusion can arise.

For anyj∈N, we define a dyadic partition of(0,1]into 2^j-many disjoint subintervals by settingI_j,k=(k2⁻^j, (k+ 1)2⁻^j], k=0, . . . ,2^j −1; and for 0< x ≤1 we denote by I_j,k(x) the unique interval containing x. For j ∈N, k=1, . . . ,2^j−1, letV_j,kbe the space of all bounded density functions onRthat are constant onI_j,k. Via the local projections

K_j,x(f )(z):=

2^j

I_j,k(x)f (y)dy ifz∈I_j,k(x),

f (z) otherwise,

we map any bounded densityf ontoV_j,k(x). (Note thatK_j,x(f )is indeed a density sinceK_j,x(f )andf assign the same probability to the intervalI_j,k(x).) Forf ∈V_j,kandj≥j we clearly haveK_j,x(f )=f.

2.1. Estimation procedure

LetX, X₁, . . . , X_n be i.i.d. with bounded densityf₀:R→ [0,∞),n >1. We wish to construct a single estimator which estimatesf0(x)in an optimal way, uniformly so for points x in the interval(a, b]. We shall take without loss of generality (a, b] =(0,1], and we shall assume throughout thatf₀ is bounded away from zero on (0,1]. LetK(x, y)=

kφ(x−k)φ(y−k)be the projection kernel based on the Haar waveletφ=1(0,1]. We shall write Kj(x, y)=2^jK(2^jx,2^jy), and the associated linear density estimator is the dyadic histogram estimator given by

f_n(j, x):=1 n

n

i=1

K_j(x, X_i).

We make the important observation that E_ff_n(j, x)=2^jP_f(I_j,k(x)), which directly follows from the identity K_j(x, y)=2^j1I_j,k(x)(y). If f is constant on I_j,k(x) this in particular implies E_ff_n(j, x)=f (x). (In other words:

for any locally (atx) constant densityf the bias off_n(j, x)equals zero if the resolution level is chosen fine enough.) We finally note that the estimatorf_n(j, x)by construction only depends on data points falling intoI_j,k(x). This amounts ton2⁻^j being the ‘effective’ sample size for estimatingf0atx.

2.2. Local choice of the resolution level

We fixj_max:=j_max,n∈Nsatisfying 2⁻^j^max≥(logn)²/nfor somed >0. For thresholdsζ_nto be specified below, and forJ∈N,J≤j_maxand 0< x≤1, we define

jˆ_n(J, x)=min j ∈N, J≤j≤j_max:

n2⁻^jf_n j, x

−f_n(j, x)≤ζ_n

f_n(j, x)for allj, j < j≤j_max

(1) as well as

jˆn(x)= ˆjn(0, x). (2)

(If the condition in (1) is not met for anyj,J≤j≤jmax, we setjˆn(J, x)=jmax.) Given the locally variable resolution leveljˆ_n, we define the family of nonlinear estimators

fˆ_n(J, x):=f_njˆ_n(J, x), x

, fˆ_n(x):=f_njˆ_n(x), x

, x∈ [0,1]. (3)

These are estimators forf₀(x)based on a locally variable resolution level depending onx, and they are density- analogues of the estimators introduced in [10] in the context of the Gaussian white noise model. Note that by con- structionjˆn(x)is a step function inx. Introducing the parameterJwill be useful in what follows – effectively,fˆn(J, x) is a nonlinear estimator based on a search over the resolution levelsj≥J that stops atjmax.

(4)

2.3. Threshold choice by propagation

One of the main challenges for all adaptive procedures is the choice of the thresholdsζ_n used in the tests defined in (1). Define the standardisation

1 s_n(j, x):=

₁

√fn(j,x) iff_n(j, x) >0,

0 otherwise.

We suggest to choose the thresholds in such a way that the following condition is satisfied:

Condition 1. LetFj,kbe any triangular array of subsets of V_j,k,j ≤j_max,k=0, . . . ,2^j −1,and letk(m)be the uniqueksuch thatIj_max,m⊆Ij,k.We say that the thresholdsζnsatisfy the uniform propagation conditionUP(α,Fj,k) for some fixedα >0if for everyn,everyj≤jmax,everym=0, . . . ,2^j^max−1,and everyf ∈Fj,k(m)we have that

Ef

sup

x∈Ijmax,m

j≤maxj≤jmax

n2⁻^j

logn

fˆ_n(j, x)−f_n(j, x) s_n(j, x)

2

≤ α

n2^2j^max. (4)

(Note that sincejˆ_n(j)≥j we have thatf_n(j, x)=0 impliesfˆ_n(j, x)=0 for the fully data-driven estimator fˆ_n(j), and so the error| ˆf_n(j, x)−f_n(j, x)|is then 0.) An interpretation of this condition can be given along the following lines: For 0< x≤1 the classFj,k(x)contains only densitiesf that can be exactly reconstructed onIj,k(x)

by

K_j(x, y)f (y)dy, so that the bias of the linear estimatorf_n(j, x)equals zero locally. In particular, any choice of the resolution level finer than j will only increase the variance without reducing the bias, and we would want jˆ_n(j, x)to detect that and equal, with large probability,j. This property ofjˆ_nwill then be mirrored in the fact that fˆ_n(j, x)−f_n(j, x)=0 for everyj≥j on an event with large probability, in which case the l.h.s. of (4) is exactly equal to zero. The quantity α/(n2^2j^max)stands for the a priori expected tolerance for a probabilistic error of jˆ_n to detect the ‘correct’ resolution level on each intervalIj_max,min this ‘no-bias’ situation.

The following lemma shows that Condition1is not empty and that thresholdsζnsatisfying the uniform propagation condition exist. It shows furthermore that the thresholds can be taken to be of order√

lognand independent off, which will be crucial in understanding the adaptive properties offˆ_nbelow.

Lemma 1. LetFj,kequalV_j,kintersected with the set

f: 0< δ≤ inf

0<x≤1f (x),f∞≤M

for some fixed0< δ, M <∞.Then for every givenα >0there exists a numerical constantκ >0that depends only onαsuch that for any threshold choice

ζ_n≥κ logn

the uniform propagation conditionUP(α,Fj,k)is at least satisfied fornlarger than some index that only depends on δandM.

While Lemma1proves the existence of thresholds of the order√

lognunder the uniform propagation condition – a fact that will be seen to imply adaptivity offˆnbelow – it does not suggest a practical choice ofζn. Instead, this choice can be made by direct evaluation of (4), as follows: Condition1only concerns the local error bounds over small intervalsI_j_max_,mon which the functionf is constant, which effectively means that it suffices to check this condition only for classes of densities which are constant on the interval of interest. The particular choice of the intervalI_j_max_,m is unimportant. Secondly, all quantities in Condition1depend on known quantities afterf is chosen. By construction of the estimatorsf_n andfˆ_n the random variable featuring in (4) – we call itT – only depends on the number of data points falling into each of the (uniquely determined)j-fine intervals containingI_j_max_,m. This observation allows for an easy computation of the l.h.s. of (4) along the following lines: Fix 0≤p≤1. Then, for any f ∈Fj,k(m)

satisfying 2⁻^jf =p onIj,k(m), the numberZ of observations falling into the intervalIj,k(m) is binomialB(n, p).

(5)

Conditionally onZ=k, takek-many independent random variables that are uniform onI_j,k(m)and count the number of observationsV_j in each of thej-fine intervals. Then computef_n,fˆ_n; andT. This shows thatT does only depend onV_j,j≤j≤j_max, and that the l.h.s. of (4) is therefore equal toE[E[T (V_j, . . . , V_j_max)|Z]].

The practical choice ofζ_ncan then be obtained via a Monte Carlo simulation of (4) by choosingζ_nas the smallest threshold for which (4) is satisfied in the simulation for one specific intervalI_j_max_,muniformly over the class of all densities constant on this interval. Givenj_maxandα, this procedure has to be performed only for one fixed interval I_j_max_,m,and then applies for everymsimultaneously.

2.4. Local small bias condition

The idea behind Condition1is that we take ‘idealised’ classes of densitiesFfor which we compute sharp thresholds ζ_n. The danger arises that the true densityf₀may be very different from the elements inF, which may lead to wrong thresholds (and inference). We have to assess the error that comes from replacingf₀ by an element fromF, in a neighborhood of a given pointx. This can be fundamentally quantified in terms of the log-likelihood ratio between f₀and its local (atx) approximand inF. As we shall see, one of the deeper reasons behind the fact that propagation methods imply adaptation results is that this error can be related to the usual bias term in linear estimation.

Condition 2. Given real numbersΔj,x, 0< x≤1,j∈N∪ {0}satisfyingΔ_l,x≤Δl,x for everyl> l,we say thatf0

satisfies the local small bias condition atx∈(0,1]and withΔ_j,x≡Δ_j,x(f0)if VarK_j,x(f₀)log f₀

K_j,x(f₀)≤Δj,x(f0) for allj∈N.

The local ‘cost’ of transporting a product measure_n

i=1f₀(x_i)to_n

i=1K_j,x(f₀)(x_i)can be quantified byntimes the variance featuring in the above condition, and we shall have to restrict ourselves to resolution levelsj for which this transportation cost is at most a fixed constant times the logarithm of the sample sizen. The smallest resolution level for which this is still the case will be defined asj^∗(x): More precisely, for some fixed positive constantΔ, define the local resolution level

j^∗(x):=j^∗(x, n, Δ, f0)=min j ∈N: j≤jmax, nΔj,x(f0)≤Δlogn

. (5)

While this is an information-theoretic definition ofj^∗, a key observation of this subsection is that it has the classical

‘bias-variance’ tradeoff generically built into it for suitable choices ofΔ_j,x(f₀).

Lemma 2. Supposef₀is bounded by some finite numberM >0and that

0<xinf≤1f0(x)≥δ >0.

Thenf₀satisfies Condition2with

Δj,x(f0)=M

δ²2⁻^jf0−Kj,x(f0)²

∞.

Proof. First, observe thatK_j,x(f₀)is bounded byM and bounded below byδ >0. Then, using thatK_j,x(f₀)coin- cides withf₀outside ofI_j,k(x)and the inequality|logx−logy| ≤max(x⁻¹, y⁻¹)|x−y|, we get

VarK_j,x(f₀)log f0

K_j,x(f₀)

≤ log f₀(y) K_j,x(f₀)(y)

2

K_j,x(f₀)(y)dy

(6)

≤

max

f₀(y)⁻², K_j,x(f₀)(y)⁻²

f₀(y)−K_j,x(f₀)(y)2

K_j,x(f₀)(y)dy

≤M

δ² f₀(y)−K_j,x(f₀)(y)2

dy≤M

δ²2⁻^jK_j,x(f₀)−f₀²

∞.

The lemma shows that the quantity(n/logn)Δ_j,x(f0)can be viewed as the square of the ‘bias divided by the variance’ of linear projection estimators forf₀(x).Hence, to choose the smallestj ≤j_maxsuch that(n/logn)Δ_j,x(f₀) is still bounded by a fixed constantΔmeans to locally balance the ‘variance’ and ‘bias’ term in the nonparametric setting.

To be more concrete, let us briefly discuss what this means in the classical situation where the bias is bounded by local regularity properties of the unknown densityf0. Since we are interested in spatial adaptation, we wish to take locally inhomogeneous smoothness into account by appealing to local Hölder conditions: Let 0< t≤1 and let us say that a functiong:R→Ris locallyt-Hölder atx∈Rif for someη >0

sup

0<|m|≤η

|g(x+m)−g(x)|

|m|^t <∞.

Define further a ‘local’ Hölder ball of bounded functions C(t, x, L, η):=

g:R→R,max

g_∞, sup

0<|m|≤η

|g(x+m)−g(x)|

|m|^t

≤L

.

Condition2then has the following more classical interpretation in terms of local smoothness properties off₀: Lemma 3. Iff₀∈C(t, x, L, η)for some0< t≤1,then the local biasf₀−K_j,x(f₀)∞is bounded byc2⁻^{j t} for some constantc≡c(t, L, η).Furthermore,if

0<xinf≤1f₀(x)≥δ >0, then Condition2is satisfied with

Δ_j,x(f₀)=c²L

δ²2⁻^{j (2t}⁺¹⁾. (6)

Proof. Lety∈I_j,k(x)be arbitrary. Then, using the substitution 2^jz=2^jy−u, f0(y)−K_j,x(f0)(y)=

2^j

Ij,k(x)

f0(y)−f0(z) dz

≤ ₁

−1

f₀(y)−f₀(x)+f₀(x)−f₀

y−2⁻^judu

≤2f0(y)−f0(x)+ ₁

−1

f0(x)−f0

y−2⁻^judu.

By definition ofx, y, I_j,k(x)we have|y−x| ≤2⁻^j, and also|y−2⁻^ju−x| ≤2⁻^j⁺¹by the triangle inequality, so that for 2⁻^j⁺¹≤ηthe last quantity is bounded byc₀2⁻^{j t} in view off₀∈C(t, x, L, η). If 2⁻^j> η/2, then the quantity in the last display can still be bounded by 6f0∞≤6L, so that choosingc1=6L(2/η)^t establishes the desired bound

forc=max(c0, c₁). To prove the second claim, apply Lemma2.

Using the bound from the last lemma to verify Condition 2, we see that, by definition of j^∗(x)and for f0∈ C(t, x, L, η),

n2⁻^j^∗^(x) logn ∼

n logn

t /(2t+1)

(7)

is the locally (atx) optimal adaptive rate of convergence, so that the local small bias condition constructs a minimax optimal resolution levelj^∗(x)at everyx∈ [0,1].

2.5. Main results

We now state the main results, starting with the following ‘oracle’ inequality. Note that the oraclef_n(j^∗(x), x)is not an estimator in itself as it depends on unknown quantities.

Theorem 1. Letfˆ_n(·)be the density estimator defined in(3)with thresholdsζ_nthat satisfy the uniform propagation conditionUP(α,Fj,k)for someFj,k.Supposef0satisfies Condition2for every0< x≤1,and letj^∗(x)be as in(5).

Then we have

E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆ_n(x)−f_n(j^∗(x), x) s_n(j^∗(x), x)

≤ ζ_n

√logn+ α

nn^Δe^4U (8)

for anyUsatisfying U≥ sup

0<x≤1

log f₀ K_j∗(x),x(f₀)

∞

. (9)

Ifζ_n=O(√

logn)– as follows under the conditions of Lemma1– and if one choosesΔ <1/2,Uas in the remark below, then the r.h.s. of (8) is O(1)asntends to infinity. Theorem1thus implies that the estimatorfˆ_nwith resolution levels chosen by the propagation approach is close to the linear ‘oracle estimator’ evaluated at the locally optimal resolution levelj^∗(x), and this uniformly so on(0,1].

Remark 1. IfFj,kis as in Lemma1andf₀is bounded byMand bounded below byδ,we may apply Lemma2(using 2⁻^j^max≥d(logn)²/n)to obtain the bound

log f0

K_j∗(x),x(f₀)=log

1+f0−Kj^∗(x),x(f0) K_j∗(x),x(f₀)

≤log

1+ Δ

dMlogn

,

which tends to zero asntends to infinity.

Our results then imply the following uniform spatial adaptation result:

Theorem 2. Assume thatf0is bounded byMand satisfiesinf0<x≤1f0(x)≥δ >0.Letfˆ_n(·)be the density estimator from(3)with thresholds ζ_n=O(√

logn)that satisfy the uniform propagation conditionUP(α,Fj,k)forFj,kas in Lemma1.Letj^∗(x)be as in(5)withΔ <1/2and withΔ_j,xas in Lemma2.Then

sup

0<x≤1

n2⁻^j^∗^(x)

logn fˆ_n(x)−f₀(x)=OPrf0(1). (10)

Thus the fully data-driven estimatorfˆ_n for f₀ achieves the locally optimal risk of the ‘oracle’ based onj^∗(x), uniformly at all points in(0,1]. Ifj^∗(x)– withΔ <1/2 – is based onΔ_j,x as in Lemma3, then (10) holds true and the ‘oracle’ rate is the adaptive locally minimax rate of convergence at every 0< x≤1 wheref₀is locallyt-Hölder with 0< t≤1, see the discussion in Section2.4surrounding (7). This means that at any given pointx our estimator is rate-adaptive to local Hölder smoothness (with the usual lognpenalty for adaptation).

One may ask further if spatial adaptation in the minimax sense occursuniformlyfor everyx∈(0,1]. A consequence of Theorem2is the following.

(8)

Theorem 3. Suppose the assumptions of Theorem2are satisfied and that the true densityf₀lies inC(t (x), x, L(x), η(x)), 0< x≤1, for some t (·), L(·), η(·)that are bounded and uniformly bounded away from zero on (0,1].Let j^∗(x)be as in(5)withΔ <1/2and withΔ_j,xas in Lemma3.Then

sup

0<x≤1

n logn

t (x)/(2t (x)+1)

fˆ_n(x)−f₀(x)=OPrf0(1).

The assumptions on the functionst, L, ηneed discussion. For densities that locally look like|x−x_m|^α^m we would wish to chooset (x)equal to their pointwise Hölder exponentst (xm)=αmandt (x)=1 otherwise, but thenηis not uniformly bounded away from zero for points x→x_m. However, Theorem3holds foranychoice of the functions t, L, η for whichf₀satisfies f₀∈C(t (x), x, L(x), η(x)), 0< x≤1. In other words, in the above example we can chooset (x)=α_mon the interval(x_m−η0, x_m+η0)andt (x)=1 otherwise, whereη0is some arbitrary lower bound forη(x). This comes at the expense of not being adaptive nearxm, i.e., forx∈(xm−η0, xm+η0)\ {xm}, which is sensible as we cannot expect adaptation for points x arbitrarily close tox_m from a finite sample. Inspection of the proofs (particularly the dependence onηin Lemma3) shows that, for fixedn, the above theorem holds for densities C(t (x), x, L(x), η(n)), 0< x≤1, whereη(n)can be taken of ordern⁻^1/3, the binwidth corresponding to the maximal smoothnesst=1 one wants to adapt to in our setting, and this is again reasonable: Hölder smoothness off₀in an interval [x±rn]wherern=o(n⁻^1/3)does not allow to control the bias atx with the locally optimal binwidth of ordern⁻^1/3. By the same arguments multifractal densitiesf₀which change their Hölder exponent continuously can be handled by takingt (x)piecewise constant on a partition of(0,1]into bins of size of ordern⁻^1/3, the estimator achieving the local uniform minimax rate on each bin of the partition.

3. Proofs

3.1. Proof of Theorem1

A first idea is to use a moment bound, localised at any pointx of estimation, on the log-likelihood ratio betweenf₀ and its approximand inV_j,k.

Lemma 4. If,for fixed0< x≤1, VarK_j,x(f₀)log f₀

K_j,x(f₀)≤Dlogn

n (11)

for some0< D <∞and everyn∈N,then,for everyn∈N, EK_j,x(f₀)

_n

i=1

f₀(X_i) K_j,x(f₀)(X_i)

2

≤n^2De^4U holds for anyUsatisfying

U≥

log f₀ K_j,x(f₀)

∞

.

Proof. Since the Kullback–Leibler distance K

f0, Kj,x(f0)

= −EK_j,x(f₀)log f₀ K_j,x(f₀)≥0 is nonnegative, we have

EK_j,x(f₀)

_n

i=1

f₀(X_i) K_j,x(f₀)(X_i)

2

≤

EK_j,x(f₀)e^2(log(f⁰^/K^j,x^(f⁰⁾⁾⁻^E^{Kj,x (f0}⁾^log(f⁰^/K^j,x^(f⁰⁾⁾⁾n

(9)

by the i.i.d. assumption. Using the power series expansion of the exponential function and that the variables in the exponent are centered, one easily bounds the previous display by

1+2De^4Ulogn n

n

≤e^2De^4U^logn=n^2De^4U.

Here is the proof of Theorem1: We first note that Condition2allows us to takeΔ_j,x(f₀)to be constant on the intervalsI_j,k. Consequently,j^∗(·)from (5) is then constant on every intervalI_j_max_,m, and we set

j_m^∗= sup

x∈I_jmax,m

j^∗(x).

To prove the theorem, we split E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆn(x)−fn(j^∗(x), x) sn(j^∗(x), x)

≤ E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆn(x)−fn(j^∗(x), x) sn(j^∗(x), x)

1_{{ ˆ}_j

n(x)<j^∗(x)}

+E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆ_n(x)−f_n(j^∗(x), x) s_n(j^∗(x), x)

1_{{ ˆ}_j

n(x)≥j^∗(x)}

=:I+II

according to whetherjˆn(x)comes to lie below the local resolution levelj^∗(x)or not. By definition ofjˆn(x)in (1) one immediately has

I≤ ζ_n

√logn. AboutII: Define

S_m= sup

x∈I_jmax,m

j_m^∗≤maxj≤jmax

n2⁻^j logn

fˆn(j, x)−fn(j, x) sn(j, x)

. (12)

Using that on the eventjˆ_n(x)≥j^∗(x)we necessarily havefˆ_n(x)= ˆf_n(j^∗(x), x), we see that II≤E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆ_n(j^∗(x), x)−f_n(j^∗(x), x) s_n(j^∗(x), x)

≤E_f₀max

m sup

x∈I_jmax,m

jm∗≤maxj≤j_max

n2⁻^j logn

≤2^j^maxmax

m Ef₀Sm. (13)

We use the Cauchy–Schwarz inequality to bound E_f₀S_m=

· · ·

S_m(x₁, . . . , x_n) n

i=1

f₀(x_i)dx1· · ·dxn

=

· · ·

S_m(x₁, . . . , x_n) n

i=1

f₀(x_i) K_j∗

m,x(f₀)(x_i) n

i=1

K_jm∗,x(f₀)(x_i)dx1· · ·dxn

≤ E_K

jm,x∗ (f0)S_m² E_K

jm,x∗ (f0)

_n

i=1

f0(Xi) K_jm∗,x(f0)(X_i)

2

(10)

by the square-root of the second moment ofS_munder the ‘idealised’ densityK_jm∗,x(f₀)times the square-root of the second moment of the likelihood ratio. (Here,x is any point inIj_max,m.) Using Condition1and Lemma4, we obtain a bound for the last term in (13) of order

2^j^maxmax

m E_f₀S_m≤ α

nn^Δe^4U,

which concludes the proof of the theorem.

3.2. Proofs of Theorems2and3 We first prove Theorem2: Clearly,

sup

0<x≤1

n2⁻^j^∗^(x)

logn fˆ_n(x)−f₀(x)≤ sup

0<x≤1

n2⁻^j^∗^(x) logn

fˆ_n(x)−f_n(j^∗(x), x) s_n(j^∗(x), x)

f_n

j^∗(x), x

+ sup

0<x≤1

n2⁻^j^∗^(x) logn fn

j^∗(x), x

−f0(x).

The first factor of the first summand is bounded in probability in view of Theorem1and of Lemma1and the hypoth- esisζn=O(√

logn). The second factor of the first summand is also bounded in probability since sup

0<x≤1 jmax≤jmax

f_n(j, x)−E_f₀f_n(j, x)=oPf0(1)

by Proposition2, using 2⁻^j^max≥d(logn)²/n, and since sup_x,j|E_f₀f_n(j, x)| ≤ f₀_∞<∞. It remains to prove that the second summand is bounded in probability, and we achieve this by bounding the moment

E_f₀ sup

0<x≤1

n2⁻^j^∗^(x) logn f_n

j^∗(x), x

−E_f₀f_n

j^∗(x), x

+ sup

0<x≤1

n2⁻^j^∗^(x) logn Ef₀fn

j^∗(x), x

−f0(x)

≤E_f₀ sup

0<x≤1 jmax≤jmax

n2⁻^j

lognf_n(j, x)−E_f₀f_n(j, x) +max

m sup

x∈I_jmax_,m

n2⁻^j^∗^(x)

logn E_f₀f_n

j^∗(x), x

−f₀(x).

The first term is bounded by a fixed constant using Proposition 2 below. Recalling the definition ofj_m^∗ from the beginning of the proof of Theorem1and choosingΔj,x(f0)from Lemma2, the second term is bounded by

maxm

n2⁻^j^m^∗

logn sup

x∈Ijmax,m

Ef₀fn

j_m^∗, x

−f0(x)≤max

m

n2⁻^j^m^∗

logn Kj_m^∗,x(f0)−f0

∞

≤δ Δ

M, wherexis any point inI_j_max_,m, and this completes the proof.

We next prove Theorem3: Using the hypotheses ont (·), L(·), η(·), the proof of Lemma3shows thatf0satisfies Condition2with

Δj,x(f0)=c2⁻^{j (2t (x)}⁺¹⁾, (14)

(11)

0< c<∞, wherecdoes not depend onx. Using thatt (·)is bounded below by some positive number implies that Δj_max,x(f0)=c2⁻^j^max^{(2t (x)}⁺¹⁾≤Δlogn

n

holds fornlarge enough (independent ofx∈(0,1]), so thatj^∗(x), when based onΔ_j,x(f₀)as in(14), is asymptoti- cally equivalent to the minimax optimal locally adaptive rate, uniformly so for allx.

3.3. Proof of Lemma1

The proof relies on Propositions1and2which are given below. Recall first from Section2.1that forf ∈V_j,k and j≥j we necessarily haveE_ff_n(j, x)−f (x)=0 for everyx∈I_j,k, so the bias atx∈I_j,kis exactly zero, a fact we shall use repeatedly below without separate mentioning. Write

sup

x∈Ijmax,m

maxj≥j

n2⁻^j

logn

= sup

x∈I_jmax_,m

max

j≥j

n2⁻^j

logn

l>j

f_n(l, x)−f_n(j, x) s_n(j, x)

1_{{ ˆ}_j

n(j,x)=l}. (15)

To treat the indicator, observe that jˆ_n

j, x

=l

⊆ n2⁻^lfn

l, x

−fn(l−1, x)> ζn

fn(l−1, x)for somel≥l

⊆

n2⁻^lf_n l, x

−E_ff_n l, x

+E_ff_n(l−1, x)−f_n(l−1, x)> ζ_n

√f (x)

2 for somel≥l

∪

min≥j

f_n( , x)≤

√f (x) 2

.

Observe that the first set is a subset of n2⁻^lf_n

l, x

−E_ff_n

l, x≥ ζ_n√ f (x)

4 for somel≥l

∪

n2⁻^(l⁻¹⁾f_n(l−1, x)−E_ff_n(l−1, x)>ζn√ f (x) 4

⊆

max≥j

n2⁻ f_n( )−E_ff_n( )

Ijmax,m>ζ_n

fIj,k(m)

4

=:B₁ and that, usingy≥√

δyfory≥δ, the second set is contained in

max≥j

fn( , x)−Effn( , x)>f (x) 2

⊆

max≥j

fn( , x)−Effn( , x)>

√δf (x) 2

⊆

sup

x∈I_jmax,m, ≥j

f_n( , x)−E_ff_n( , x)>

δfIj,k(m)

2

:=B₂;