A NNALES DE L ’I. H. P., SECTION B
L UC D EVROYE
The double kernel method in density estimation
Annales de l’I. H. P., section B, tome 25, n
o4 (1989), p. 533-580
<http://www.numdam.org/item?id=AIHPB_1989__25_4_533_0>
© Gauthier-Villars, 1989, tous droits réservés.
L’accès aux archives de la revue « Annales de l’I. H. P., section B » (http://www.elsevier.com/locate/anihpb) implique l’accord avec les condi- tions générales d’utilisation (http://www.numdam.org/conditions). Toute uti- lisation commerciale ou impression systématique est constitutive d’une infraction pénale. Toute copie ou impression de ce fichier doit conte- nir la présente mention de copyright.
Article numérisé dans le cadre du programme Numérisation de documents anciens mathématiques
http://www.numdam.org/
The double kernel method in density estimation
Luc DEVROYE School of Computer Science,
McGill University
Vol. 25, n° 4, 1989, p. 533-580. Probabilités et Statistiques
ABSTRACT. - Let
fnh
be the Parzen-Rosenblatt kernel estimate of adensity f
on the realline,
based upon asample
of n iid random variables drawn fromf,
and withsmoothing
factor h. Let gnh be another kernel estimate based upon the samedata,
but with a different kernel. We choose thesmoothing
factor H so as tominimize )
, andstudy
theproperties of fn H and gn H. It
is shown that the estimates are consistent for all densitiesprovided
that the characteristic functions of the two kernels do not coincide in an openneighborhood
of theorigin. Also,
forsome
pairs
ofkernels,
and all densities in the saturation class of the firstkernel,
we show that/ /* ,
where C is a constant
depending
upon thepair
of kernelsonly.
Thisconstant can be
arbitrarily
close to one.Key words : Density estimation, asymptotic optimality, nonparametric estimation, strong convergence, kernel estimate, automatic choice of the smoothing factor.
RESUME. - Soit
fnh
l’estimateur du noyau de Parzen-Rosenblatt d’une densitef
sur la droitereelle,
apartir
d’un echantillon de n variablesC’lassi, f’ication A.M.S. : AMS 1980 Subject Classifications. 62 G 05, 62 H 99, 62 G 20.
Annales de l’Institut Henri Poincaré - Probabilités et Statistiques - 0246-0203 Vol. 25/89/04/533/48/$ 6,80/© Gauthier-Villars
aleatoires
independantes équidistribuées
de densitéf
et avec facteurrégula-
risant h. Soit gnh un autre
estimateur
de noyaux base sur les mêmes donnees mais avec un noyaudifferent,
nous choisissons le facteurregularisant
Hafin de minimiser , et nous etudions les
proprietes
defn H
etgn H~ Nous montrons que les estimateurs sont consistants pour toutes les densites pourvu que les
fonctions caracteristiques
des deux noyaux ne coincident pas sur unvoisinage
ouvert del’origine.
Deplus,
pour certainscouples
de noyaux, pour toutes les densites de la classe de saturation dupremier
noyau nous montrons queou C est une constante
qui depend
seulement ducouple
de noyaux. Cette constantepeut
etre rendue arbitrairementproche
de 1.TABLE OF CONTENTS 1. Introduction.
2.
Consistency.
2.1 The purpose.
2. 2. The
decoupling
device.2. 3. Uniform convergence with
respect
to h.2 . 4. Behavior of the
minimizing integral.
2. 5.
Necessary
conditions of convergence.2 . 6. Proof of Theorem Cl.
3.
C-optimality.
3.1. The main result.
3 . 2. A better estimate.
3 . 3.
Complete
convergence of the L 1 error.3 . 4. Relative
stability
of the estimate.3 . 5. Proof of Theorems 01 and 02.
3 . 6. Remarks and further work.
3.7. The final series of Lemmas.
4.
Properties
of theoptimal smoothing
factor.Annales de l’Institut Henri Poincaré - Probabilités et Statistiques
5.
Acknowledgments.
6. References.
1. INTRODUCTION
We consider the standard
problem
ofestimating
adensity f
onR1
froman iid
sample Xi, ... , Xn
drawnfrom f
Thedensity
estimates considered in this note are the well-known kernel estimateswhere h > 0 is a
smoothing factor,
K is anabsolutely integrable
functioncalled the
kernel, K =1,
andKh (x) _ ( 1 /h)
K{x/h) (Parzen, 1962 ;
Rosen-blatt, 1956).
We areparticularly
interested insmoothing
factors that arefunctions of the data
(which
are denotedby
H to reflect thatthey
arerandom
variables).
Mostproposals
for functions H found in the literature minimize somecriterion;
forexample,
manyattempt
tokeep (fn
H -f )2
as small as
possible. Extending
Stone(1984),
we saythatfn
isasymptotically optimal
forf
if H is such thatas Stone defined this notion with
L2
errors instead ofL~
1 errors, without theexpected values,
and with almost sure con-vergence to one for the ratio. In Theorem
S l,
we will show that ingeneral
Einf and infE can be usedinterchangeably,
and thath h
|fn H-f|/E||fn H-f| ~ 1
almostsurely,
so that all definitions areequi-
valent for our method.
It should be noted
that In
may beasymptotically optimal
for somef ’
and K and not for other choices. A
perhaps
too trivialexample
is that in which K is the uniformdensity
on[ -1, I],
andH = cn -1 ~5
where c is known to beoptimal
for the normal(0,1) density
and thegiven
K.Obviously,
such a choice leads ingeneral
toasymptotic optimality
for thenormal
(0,1) density,
and tosuboptimality
innearly
all other cases. It isVol. 25, n° 4-1989.
clearly
of interest to thepractitioner
to make the class of densities onwhich
asymptotic optimality
is obtained aslarge
aspossible. Remarkably,
in the
L2
work of Stone(1984), asymptotic optimality
was established for all bounded densities and all bounded kernels ofcompact support
if H is chosenby
theL2
cross-validation method of Rudemo(1982)
and Bowman(1984).
This H is of little use inL1,
and the method isdangerous:
forsome unbounded
densities,
we have lim infE I fn H - f I >_ 1 (Devroye,
1988).
Since data-basedtechniques
forchoosing
H aresupposed
to beautomated and inserted into software
packages,
it isimportant
that themethod be consistent.
It is
perhaps
useful to reflect on thepossible strategies.
Hall and Wand(1988).
haveproposed
aplug-in adaptive method,
in which unknownquantities
in the theoretical formula for theasymptotically optimal
hareestimated from the data
using
othernonparametric estimates,
and thenplugged
back in the formula to obtain H. Similarstrategies
have worked in thepast
forL2 [see
e. g. Woodroofe(1970), Nadaraya (1974)
andBretagnolle
and Huber(1979)].
Theadvantages
of thisapproach
areobvious: the
designer clearly
understands what isgoing
on, and theproblem
isconceptually
cut inclearly
identifiablesubproblems.
On theother
hand,
how does one choose thesmoothing parameters
needed for thesecondary nonparametric
estimates ?And, assuming
that the conditions for the theoretical formula for h are notfulfilled,
isn’t itpossible
toobtain inconsistent
density
estimates ? To avoid the latterdrawback,
it isimperative
to go back to firstprinciples.
Cross-validated maximum likeli- hoodproducts
have been studiedby
many: Duin(1976)
andHabbema,
Hermans and Vandenbroek
(1974) proposed
themethod,
andChow,
Geman and Wu
(1983),
Schuster andGregory (1981),
Hall(1982), Devroye
and
Gyorfi (1985),
Marron(1985)
andBroniatowski,
Deheuvels andDevroye (1988)
studied theconsistency
and rate of convergence. Unfortu-nately,
the maximum likelihood methods forchoosing
hpertain
to theKullback-Leibler distances between
densities,
and bear little relation to theL~
criterion underinvestigation
here. TheL2
cross-validation methodproposed by
Rudemo(1982)
and Bowman(1984)
has nostraightforward
extension to
L1.
Itsproperties
inL2
are now wellunderstood,
see e. g.Hall
(1983, 1985),
Stone(1984),
Burman(1985),
Scott and Terrell(1987)
and Hall and Marron
(1987).
This seems to leave usempty-handed
wereit not for the
versatility
of the kernel estimate itself.Indeed,
the methodwe are about to propose does not
easily generalize beyond
the class of kernel estimates.Annales de l’Institut Henri Poincaré - Probabilités et Statistiques
The estimator
proposed
below has twoadvantages:
A. It is consistent for
all f,
i. e.E(|fn H-f|) ~ 0
forall _ f :
B. For a
large family
of "nice"densities,
we haveC-optimality, i.
e.there exists a constant C such that for
all f
in theclass,
The constant C can be as close to one as desired. We define our H
simply
as the h thatminimizes ,f,~h - gnh I,
where gnh is the kernel estimate based upon the samedata,
but with kernel L instead of kernel K. Thekey
idea is that most kernels considered inpractice
have built-in limita-tions, including
the class of all kernels with compactsupport.
For any such kernelK,
it isfairly
easy to construct another kernel L whose bias isasymptotically superior
in the sense thatwhere * is the convolution
operator.
The class of densitiesf
for whichthis
happens
coincidesroughly speaking
with the class of densities forwhich ~ f
*Kh - f ~ tends
to zero at the bestpossible
rate(or:
saturationrate)
for thegiven
K. These classes arerich,
butthey
won’tsatisfy
everyone. What the
improved
kernel can do for us issimple:
it is verylikely
that gnh, the kernel estimate with Kreplaced by L,
is much closer tof than
and thus that~ f,~h - I
is of the order ofmagnitude
ofI Jnh
’We won’t worry here about the numerical details. First of
all,
if K andL are
polynomial
and ofcompact support (as they
oftenare),
then theintegral
to be minimized can be rewrittenconveniently
as finite sum withO
(n)
terms,by considering
that each kernel estimate ispiecewise polyno-
mial with 0
(n) pieces
at most. The minimization withrespect
to h is a bit harder to do.Observing
that the function to beoptimized
isuniformly
Vol. 25, n° 4-1989.
continuous on any interval
(a, b) c [0, oo ),
we see that the minimum exists and is a random variable. For moregeneral non-polynomial
K and
L,
under some smoothnessconditions,
we still have,
for some c > 0. For
h, h’
closeenough,
this can be made smaller than which is much smaller than the smallest
possible Li
error(which
is1/ J528 n by Devroye, 1988). Thus,
the mini-mization can be carried out over a
grid
ofpoints,
and in any case, it ispossible
to define a random variable H with theproperty
that|fn H-gn H|~ing |fnk-gnk|.
J
hJ
There is another
interesting by-product
of thismethod, namely
that weend up with two kernel estimates
fn H and
gn H, where for the class of densities underconsideration,
gn H isprobably
betterthan fn
H.Interestingly, fn H
isasymptotically optimal
for K but gn H isusually
notasymptotically optimal
for L. One way oflooking
at our method is as atechnique
forcreating
a better estimate(gnh)
withoutimposing
additional smoothness conditions on the densities. Anotherby product
of the method is that/* /*
I is
arough
estimate of the actualerror fn H-f|.
Unfortu-nately,
if one decides to use gn H insteadthen provides
little information about the actual error obtained
with gn
H.Not all
pairs (K, L)
are useful. The mostimportant property
needed to be fulfilled is that the characteristic functions of K and L do not coincide in an openneighborhood
of theorigin.
This often forces K and L to bekernels of a different order. In
addition,
we will see that the constant Ccan be chosen
equal
to( 1 + u)/( 1- u),
where u = 4V / L2/ K2.
The
length
of the paper ispartially explained by
the fact that we wanted to state as manyproperties
of the estimate aspossible
in a"density-free"
manner. This also renders the results more useful for future work on the
same
topic. Among
thedensity-free results,
we cite:The
consistency (Theorem Cl).
The
complete
convergence(Lemma 02).
The
strong
relativestability
of the estimate(Lemma 05).
Bounds on the error that are uniform over all h
(Theorem
C2 andLemma
01).
Necessary
conditions of convergence(Lemmas
Cl andC2).
A universal lower bound for the
expected
error(Lemma 05).
Universal lower bounds for the variation in the error
(Lemmas 07, 08).
Annales de l’lnstitut Henri Poincaré - Probabilités et Statistiques
While it is
good
to know that the estimatealways
converges and isC-optimal
forvirtually
all densities in the saturation class of a kernelK,
it is informative to find out what we have not been able to achieve. First of
all,
the kernels K considered here forC-optimality
are class skernels,
i. e. all their moments up to but not
including
the s-th moment vanish.This
implies,
as we will see, that theexpected
error can go to zero no faster than a constant timesn - s~t2s + 1 ~
for anydensity.
In thisrespect,
weare
severely limited,
since it is well-known that for verysmooth
densitieskernels can be found that
yield
error rates that areO ( 1 /
or comeclose to it
[see
e. g. Watson and Leadbetter(1963)
for anequivalent
statement in the
L2 setting].
We can exhibit constants C and D such thatfor n
large enough,
This
inequality implies
thatC-optimality
canonly
behoped
for when the bestpossible
error rate for thepresent
K andf
is at leastbe log n/n.
Unfortunately,
this would exclude suchinteresting
densities as the normaldensity,
for which we canget
4Jn) (Devroye, 1988).
Inparticu- lar,
it seems that foranalytic
densities ingeneral,
thetechniques presented
here need some
strengthening.
But
perhaps
thebiggest
untackledquestion
is whathappens
to theexpected
error for densitiesf
that are not not in the saturation class ofK ;
these areusually
densities that are not smoothenough
or not small-tailed
enough
to attain the raten - So2$ + 1 >. Despite this,
it may still bepossible
toapply
thepresent
minimizationtechnique
to obtaingood asymptotic performance
for most of them. All that is needed is toverify
the fact that for the
pair (K, L),
. Onthe other
hand,
it is alsopossible
that ageneral
result such as the oneobtained
by
Stone(1984)
forL2
errors does not exist in theL1 setting.
2. CONSISTENCY
2.1. The purpose
The purpose of this section is to prove the
following:
THEOREM Cl. - Let K and L be
absolutely integrable
kernels such that their(generalized)
characteristicfunctions
do not coincide on any openVol. 25, n° 4-1989.
neighborhood of
theorigin,
and let,f’
be anarbitrary density.
ThenE |and E| |gn H-f tend to
zero as n - 00.One of the difficulties with this sort of Theorem is that it needs to be shown for all densities
f,
even thosef
for which theprocedure
forselecting
H is not
specifically designed. Furthermore,
theL 1
errors are noteasily decomposed
into bias and variance terms, since Hdepends
upon thedata,
so that conditional on
H,
the summands in the definition of the kernel estimate are notindependent.
We willprovide
a mechanism for "decou-pling"
H and the data. In the finalanalysis,
theproof
of Theorem Cl rests on anexponential inequality
ofDevroye (1988)
and some otherproperties
of the random function(considered
as a function ofh).
Theproof
will be cut into manylemmas,
some of which will be useful outside this paper and in other sections.The condition on K and L
implies that I K - L >0.
It ispossible
tohave
consistency
even if the characteristic functions of K and L coincideon some open
neighborhood
of theorigin,
but suchconsistency
wouldnot be
universal ;
it wouldapply only
to densities whose characteristic function vanishes off acompact
set. The details for such cases can be deduced from theproof.
We have also unveiled where it ispossible
to go wrong: it suffices to have the said coincidence of the two characteristicfunctions, while f’
has a characteristic function with an infinite tail. In those cases, the H mayactually
end uptending
to apositive
constant asn - oo .
Unfortunately,
as is wellknown,
for suchdensities,
it isimpossible
to have
consistency
unless H --~ 0. Fromthis,
we retain that the behavior of the characteristic function of K - L near theorigin
is somehow a measure of thediscriminatory
power of the method.Usually,
we take astandard
nonnegative
kernel for K whose characteristic function varies as1- at2
neart = 0,
whereas for L we can take a kernel whose characteristic function is flatter near theorigin, behaving possibly
as1- bt4
or evenidentically
1 on an openneighborhood
of theorigin.
2.2. The
decoupling
deviceWe have seen that
X~, ... , Xn
are iid random variables withdensity f,
and that
H = H (n)
is a sequence of random variables whereH (n)
ismeasurable with
respect
to thea-algebra generated by Xl,
...,Xn,
i. e. it isa function
of Xl,
...,Xn.
Consider nowindependent identically
distributedcopies
of the two sequences, denotedby X~,
...,~n,
... and H = H(n) respectively. Density
estimates based upon the former data are denotedAnnales de l’Institut Henri Poincaré - Probabilités et Statistiques
by
gnh,fn H and
gn Htypically,
while for the latterdata,
we will writefnh
and so forth.In our
decoupling,
we will show that] and I ~ f ’n H -, f ’ I
areclose in a very
strong
sense. Note that the second error is that committed if H is used in adensity
estimate constructed with a new data set.The
independence
thus introduced will make theensuing analysis
moremanageable.
Tokeep
the notationsimple,
we will writeEn
for the condi- tionalexpectation given
...,Xn,
andEn,
for the conditionalexpecta-
tiongiven X1,
... With thisnotation,
note thatI
isdistributed as
En I ,f,~ H -f ~ ,
and thus that E( .fn H -.f ~ - f [
.2. 3. Uniform convergence with
respect
to hThe first
auxiliary
result is so crucial that we are arepermitting
ourselvesto elevate it to a Theorem:
THEOREM C2. - Let M be an
arbitrary absolutely integrable function,
and
define
where... , Xn
are iid random varia-bles with an
arbitrary density f,
and h > 0 is a real number. Thensup |mnh I
~ 0 almostsurely
as n ~ oo .In
fact, for
every ~>0,
there exists a constant y > 0possibly depending
uponf, M
and ~, such thatfor
all nlarge enough,
To see how Theorem C2
exactly provides
us with therequired decoupl- ing
between H and thedata,
consider thefollowing
COROLLARIES OF THEOREM C2. - Let gnh be kernel estimates with kernels K and L
respectively,
anddefine I
andA.
sup EJnh |~0
almostsurely
as n - oo .h>U
Vol. 25, n° 4-1989.
B. For any random variable H
(possibly
notindependent of
thedata), En
0 almostsurely
as n ~ ~.C. For any random variable H
(possibly
notindependent of
thedata),
0 in
probability implies 0, En
0 inprobability, 0,
and
EJn
s - 0.~roof of
Theorem C2. - Note firstthat ~
mnh --_ ~
M - whenunh is the kernel estimate with kernel M’. The fact that the bound does not
depend
upon h and that M’ isarbitrary
means that we needonly
showthe Theorem for all M that are continuous and of
compact support (since
the latter collection is dense in the space of
L1 functions).
The first
auxiliary
result is thefollowing inequality,
valid for all fixedh,
n, Mand f :
(Devroye, 1988).
It is thisinequality
that will be extended to an interval of h’susing
a rather standardgrid technique.
Set, ,. ,. ,
the
following
inclusion of events is valid:where k is an
integer
solarge
thatb,
and c > 1 is such thatSuch a c can indeed be
found,
since for allabsolutely integrable M,
limI = 0 (see
e. g.Devroye, 1987, pp. 38-39).
The second union in the inclusioninequality
is a union ofempty
events sinceAnnales de l’Institut Henri Poincaré - Probabilités et Statistiques
Thus,
This tends to 0 when k increases at a
polynomial
rate in n. To see howlarge k just is,
note that so thatk >_ log (b/a)/log (c).
Since c is aconstant
depending
upon s and Monly,
it suffices to have limits a and b that are such that for some constant d. We willonly
need thepresent inequality
withbla=O(n),
so that k can be taken smaller thand + log (n)
for some constant d.Recapping,
we needonly
establish that for somesequences a = a (n) and b = b (n)
with thatAssume that M vanishes
off[ -1, 1 ] .
Take-~_ ~
where ~ > © is a2 n
constant to be
picked
further on. Note that 1> l I
MI > n
Nwhere N is the number of for which
[X i -
2 a,X~
+ 2a]
has at leastone
~~
Theinequality
isuniform
overh __ a.
~Ve haveBy
Markov’sinequality,
this can be made smaller than agiven
smallconstant s’ if
But
E11T/n , f(x)min 1, n f(y)dy
dx and theright-hand-side
tends
to , f ’ (x) min { 1,
2~, f ’ (x))
dxby
theLebesgue
dominated convergence theorem and theLebesgue density
theorem(see
e. g. Wheeden andZyg- mund, 1977).
It can be made as small as desiredby
our choice of 03B4. WeVol. 25, n 4-1989.
conclude that for any E, E’ >
0,
we can find b > 0 smallenough
such thatWe should note here that if Markov’s
inequality
isreplaced by
an exponen- tialbounding method,
then it should be obvious that for b smallenough
the
probability
inquestion
can be boundedby e-dn
for some constantWe
finally proceed
to show thatcan be made
arbitrarily
smallby choosing
alarge enough
constant b. Thiswould conclude the
proof
of the Theorem since asrequired.
We have
where T is a
large
constant, and N is the number ofXi’s
withLet 03C9 be the modulus of
continuity
of M definedby
sup
f M {x) - M (x + y) ~ . By
ourassumptions
onM,
je
as Then
when
h >__ b. Furthermore, by Hoeffding’s inequality (Hoeffding, 1963),
so
that, combining
allthis,
Annales de l’Institut Henri Poincaré - Probabilités et Statistiques
From
this,
we have since( mnh| ~~|M|,
For
fixed ~>0,
we choose T and b solarge
that the terms on theright
hand side are
-, ~ E
ando ( 1 ) respectively.
Then6 6
for all n
large enough.
This concludes theproof
of Theorem C2. .2.4. Behavior of the
minimizing integral
Although
this seems ratherobvious,
we will nevertheless state and prove thefollowing property:
THEOREM C3. - Let H
minimize
gn~I
. For allf
and allabsolutely
integrable
K andL, ( ,~n H - tends
to 0 almostsurely
and in the mean.Also, En ( --~
0 almostsurely,
andE ~
H - gnH ~ ~
~-Proof of
Theorem C3. - Let the sequence h* = h*(n)
be such thatE ~ (
we know that whenever h -~ 0 and~
h~, it follows
that |fnh-f| ~
0 almostsurely
and in the mean(see
e. g.,
Devroye, 1983). In particular, inf |fnh - f| ~
0 almostsurely,
andh
E
~ fnh* - ~nh* ~ ~ 0
as n -; oo . Assume that n is solarge
thatVol. 25, n° 4-1989.
E
I] e /2. Then,
since] ] |fnh*-gnh* | by definition,
we see that for such n, ,
VVe can now
apply
the corollaries of TheoremC2,
to conclude thatEn I fn
H -gn H|~
0 almostsurely,
and EI fn H - gn H| ~
0. []The
decoupling
necessary for theproof
of Theorem C 1 is nowcomplete.
2 . 5.
Necessary
conditions of convergenceLEMMA Cl. - Let K and L be
absolutely integrable
kernels such that their(generalized) characteristic functions
do not coincide on any openneighborhood of
theorigin,
and letf
be anarbitrary density. If
E
I J n H - gn H| ~ 0,
thenH ~
0 inprobability (and
thus H ~ 0 inprobabim lity)
as n ~ oo .Also, for
everyE > o,
Proof of
Lemma Cl. -By
the conditional form of Jensen’sinequality,
as n - oo .
By
theindependence
ofH
and thedata,
thisimplies
thatWe can now
replace
Hby
H sincethey
areidentically
distributed. Assume thatf
and K - L have characteristic functions(suitably generalized
whenK - L is not
nonnegative)
cpand 03C8 respectively. By
theL~ inequality
for functions and their Fourier
transforms,
we haveAnnales de l’Institut Henri Poincaré - Probabilites et Statistiques
where E > 0 is
arbitrary.
If theright-hand-side
tends to zero, then wededuce that P
(H > E)
-~ 0 if we can show thatWe note that K =
lL,
sothat 03C8 (0)
= 0.Furthermore,
the absolute inte-grability
of K - Limplies
thatgs
isuniformly
continuous. Let b > 0 and c > 0 be suchthat |03C6(t)|~b>0
on[ - c, c]. Then,
Since B)/
is notidentically
zero on every openneighborhood of
theorigin,
the last lower bound is
positive
forevery E>O.
This concludes the firstpart
of Lemma Cl.For the second statement of the
Lemma,
define in such away that
H depends
upon nonly,
and- , - -
If this infimum tends to zero,
then, by
the firstpart
of theLemma, A -
0deterministically.
But this isimpossible,
sinceH >_
E. This concludes theproof
of Lemma Cl..LEMMA C2. - Let K and L be
absolutely integrable
kernels such that~
K -L I > 0,
and letf
be anarbitrary density. If E ~
H - ~0,
thenn H -
oo inprobability (and
thus n H - oo inprobability)
as n - ~.Also, for
every E >0,
Proof of
Lemma C2. - We can assume without loss ofgenerality
that~-I --~
0 inprobability,
because for anysubsequence
on which --~1,
we have n
H --~
oo inprobability.
Hence it suffices to consideronly
subse-quences
along
whichH -~ 0
inprobability.
Observe that withVol. 25, n° 4-1989.
by
the fact thatH -~
0 inprobability.
Here weapplied
Theorem 6 .1 ofDevroye
andGyorfi (1985).
Assume that in
probability,
i. e. for all n in an infinite subse- quence, we have for some boo. For notationalconvenience,
we will assume that thesubsequence
is the full sequence. We will then show that thisimplies
thatwhich is a contradiction. The
proof
is in twoparts.
First we show that there exists apositive
constant d such thatTo do so, let M = K -
L,
choose alarge
constant c, and let p =c/2-c/2
-c/2| M|I
and q
= ~ M ( -
p. Let N be the number ofXi’s
for which[Xi - ch, Xi
+ch]
contains no
XJ with j~
i. We haveBy
theLebesgue density
theorem and Fatou’slemma,
the limit infimum of the lower bound is at leastequal
toThis is
positive
if we first choose c solarge
that q/I[
M] 1 /2
and thenchoose d small
enough.
For such choices of c andd,
Annales de l’Institut Henri Poincaré - Probabilités et Statistiques
where we used the fact that the last term in the lower bound is o
(1)
sinceand,
ash~ 0,
M| ~0.
For the second
part
of theproof,
leth*=h*(n)
be a sequence withd/nh*max(b,
andWe have
We have
already
shown that the second term in the minimum ispositive.
So it suffices to consider the
positiveness
of the first term.By
Fatou’slemma,
it is at leastequal
to ’Let us now use the
following
lower bound forE £ Z;
where... ,
Zn
are iid zero mean random variables:. - .. - .-
where
u > o,
andZl’ Z i
are iid[see
e. g.Devroye
andGyorfi (1985,
p.138);
theinequality
can be obtained with some work from Szarek’sinequality (Szarek, 1976)
and Cantelli’s form ofChebyshev’s inequality].
We willapply
this with andu=cj(nh)
forn some constant c>O.
Now,
Vol. 25, n° 4-1989.
where
C ~ ~ v : ~ M (v) I >_ 2 c ~
andD ~ ~ v : I M (v) ~ c ~ . By
theLebesgue
density theorem, applied
first tocompact
restrictions ofC,
and then usedby removing
thecompactness,
we have for almost all x,where
~. (C)
is theLebesgue
measure for C. A similar formula holds for where DC is thecomplement
of D. Fromthis,
we havefor almost all x.
Recapping,
we havea
lim ~
n - infE w( [
E dxwhich is
positive
when ~,(C)
> 0which
can be assuredby choosing
esmall
enough, since ~ M (
> 0).
For the second statement of the
Lemma,
defineH
=Hn
in such a waythat
H depends
upon nonly,
nH __
£, andIf this infimum tends to zero,
then, by
the firstpart
of the Lemma,deterministically.
But this isimpossible,
since Thisconcludes the
proof
of Lemma C2..Annales de I’In.stitut Henri Poincaré - Probabilités et Statistiques
2 . 6. Proof of Theorem C 1
From Theorem
C3,
we see thatE | fn
H - gnH| ~
0. From Lemmas C 1 and C2 we retain that H ~ 0 and h H -; oo inprobability
as n ~ ~. SinceHand
H areidentically distributed,
the same statement is true for H.By
Theorem 6 . 1 of
Devroye
andGyorfi ( 198 5)
or Theorem 3 . 3 ofDevroye ( 1987),
thisimplies
that~ . f ’n H - f ~ -~
0 inprobability
for allf
and allabsolutely integrable K,
andsimilarly
forI
gnH -f|
. This in turnimplies
convergence in the mean of both
quantities. []
3. C-OPTIMALITY
3 .1. The main result
This is the main
body
of the paper, eventhough
it is concernedonly
with
specific
subclasses of densities. The kernel estimates considered hereare class s-estimates
(s
is an evenpositive integer),
i. e. estimates based upon classs-kernels,
which are kernels Khaving
thefollowing properties:
Note that
nonnegative
kernels are at best class 2 kernels. It is worthrecalling
that for everydensity/,
no matter how h ispicked
as a functionof n,
where c > 0 is a constant
depending
upon Konly [for s
= 2 andK > 0,
seeDevroye
and Penrod(1984),
and forgeneral
s, seeDevroye, 1988].
Thislower bound is not achievable for many densities. The rate
n - ~~~2s + 1 ~
can Vol. 25, n° 4-1989.however be attained for densities with s -1
absolutely
continuous derivati-ves
(i.
e.,f, f ~ 1 ~,
... ,fcs -1 ~
all exist and areabsolutely continuous)
satis-fying
the tailcondition , Jf
oo[see
e. g. Rosenblatt( 1979),
Abou-Jaoude( 1977),
orDevroye
andGyörfi ( 198 5)] .
The class of such densities will be calledFS (or
F when no confusion ispossible).
To handle the tails of
f satisfactorily,
it is necessary to introduce aminor tail
condition, slightly stronger than f
~: we let W be theclass of all
f
for whichI x I
1 +£f {x)
dx oo for some E >o,
and let V be the class of allf
for which(x)
dx oo, wheree _
u f (x)
= supf (x
+y). Devroye
andGyörfi (1985)
noted that oo is lylivirtually equivalent
to~ x ( f(x)
dx ooalthough
there areexceptions
both ways.
Thus, Fs (~
W is not much smaller thanFS.
The same is truefor
V, since , Jf
ooimplies , uf
oo for most smooth densities{e.
g.it is
always implied
whenf
is monotone in thetails).
Next,
we willimpose
a weak smoothness condition on anabsolutely integrable
function M: M is said to be smooth if there exists a constant C such thatfor all c > 1. Smoothness of a kernel
implies
that smallchanges
in h induceproportionally
smallchanges
on withregard
to theL~
distance. Itseems vital to control these
changes
for any method that is based upon the minimization of a criterioninvolving
h. Consider now theproblem
ofpicking
the "smoothness constant" C. Forexample,
ifM > 0
is unimodal at theorigin
andM = 1,
then we canalways
take C = 2[see Devroye
and
Gyorfi (1985,
p.187)]. However,
this is notinteresting
for us, sincewe need to have smoothness for the difference function K -
L,
which takeson
negative
values. When M isabsolutely continuous,
we can takeAnnales de l’Institut Henri Poincaré - Probabilités et Statistiques
The claim is now obtained
by considering lo
~ 2014 oo as well.Finally,
a kernel K is said to beregular
if it isbounded,
and if there exists asymmetric
unimodalintegrable nonnegative
function M such thatOur main result can now be announced as follows:
THEOREM 01. 2014 Let K be a smooth
regular class s kernel. Assume
thatf
~Fs
~W,
and that L is chosen such that:C. L is
symmetric, smooth,
andregular.
D. The
generalized
characteristicfunctions o. f ’
K and L do not coincideon any open
neighborhood of
theorigin. This implies
thatI K - L ~
>0.
Then,
the kernel estimate withsmoothing factor
Hminimizing
I fnh-gnh|,
isC-optimal
whereC = ( 1 + u)/( 1- u)
and u = 4J L2 I K2 .
When
f ~Fs
n V n W‘, the same is true,provided that, additionally,
L has compaet support.A
pair (K, L)
is chosen as a function of s and the constantConly.
Rescaling
Lchanges L2,
and can thus be used topush
the value of C asclose to one as desired.
Vol. 25, ri 4-1989.
Example
I. - Let us illustrate the choice of L on asimple
butimportant example
with s = 2 andnonnegative
kernels K. There isplenty
of evidence in favor ofchoosing
Bartlett’s kernelK(x)=3 4(1-x2)+
in those cases, seee. g. Bartlett
(1963), Epanechnikov (1969)
andDevroye
and Penrod(1984).
This 2-kernel is
smooth, regular, absolutely integrable
and ofcompact support.
It is easy to find 4-kernels with the sameproperties.
A littlealgebra
shows that we cantake,
forexample,
the continuous kernelL x 75 1- x2 105 1- x4 . ~ ) Interestingly,
this coincides with an16
~ )+
32
~ )+
g Yoptimal
4-kernel described e. g. inGasser,
Muller and Mammitzsch(1985,
Table
1).
Example
2. - It isinteresting
to note that the functional form of Lcan be fixed for all sand K once and for all. It suffices for
example
toconsider bounded smooth
symmetric
kernels L whose characteristic func- tion is zero in an openneighborhood
of theorigin satisfying
the momentcondition A of Theorem 01 for all s. This can be done
by defining
thecharacteristic function of L as the convolution of the uniform function
on
[ - 1,1]
with asymmetric
boundedinfinitely
many timescontinuously
differentiable function with
support
on[ -1 /2, 1 /2].
Example
3. - There is asystematic
way ofcreating higher
order kernels.Stuetzle and Mittal
(1979) pointed
out that 2 K - K * K is a class 2 s kernel whenever K is a class s kernel. This can be iterated at will. Thereare other tricks. For
example,
if M is a class rkernel,
and K is a class skernel,
the K + M - K * M is a class s + r kernel. We can also usehigher
convolutions for
creating
better kernels. One canverify
that3 K - 3 K * K + K * K * K is a class 3 s kernel whenever K is a class s kernel. Good sources of
possible
definitions of families of kernels that areoptimal
in certain senses are Muller(1984), Gasser,
Muller and Mam- mitzsch(1985), Su-Wong,
Prasad andSingh (1982)
andSingh (1979).
3. 2. A better estimate
We have mentioned that gnh is
probably preferable
over eventhough
it is not
asymptotically optimal
in any sense for its kernelL,
it has a smaller error than that ofbasically
because the class s kernel used infnh
limits itsperformance.
It isinteresting
to observe that for anyabsolutely integrable compact support
kernelK,
we must for some finite s[see
e. g. Theorem 22 ofHardy
andRogosinski (1962]). Thus,
forAnnales de l’Institut Henri Poincaré - Probabilités et Statistiques