A NNALES DE L ’I. H. P., SECTION B
G ILLES B LANCHARD
The “progressive mixture” estimator for regression trees
Annales de l’I. H. P., section B, tome 35, n
o6 (1999), p. 793-820
<http://www.numdam.org/item?id=AIHPB_1999__35_6_793_0>
© Gauthier-Villars, 1999, tous droits réservés.
L’accès aux archives de la revue « Annales de l’I. H. P., section B » (http://www.elsevier.com/locate/anihpb) implique l’accord avec les condi- tions générales d’utilisation (http://www.numdam.org/conditions). Toute uti- lisation commerciale ou impression systématique est constitutive d’une infraction pénale. Toute copie ou impression de ce fichier doit conte- nir la présente mention de copyright.
Article numérisé dans le cadre du programme Numérisation de documents anciens mathématiques
793
The "progressive mixture" estimator for regression trees
Gilles
BLANCHARD
1DMI, Ecole Normale Superieure, 45 rue d’Ulm, 75230 Paris Cedex, France Article received on 31 August 1998, revised 10 May 1999
Vol. 35, n° 6, 1999, p. .820. Probabilités et Statistiques
ABSTRACT. - We
present
a version of O. Catoni’s"progressive
mixtureestimator"
( 1999)
suited for ageneral regression
framework.Following basically
Catoni’ssteps,
we derivestrong non-asymptotic
upper bounds for the Kullback-Leibler risk in this framework. Wegive
a moreexplicit
form for this bound when the models considered are
regression
trees,present
a modified version of the estimator in an extended framework and propose anapproximate computation using
aMetropolis algorithm.
@
Elsevier,
ParisRESUME. - Nous donnons une
version, adaptee
a un cadre deregres-
_
sion,
de l’estimateur dit de"rnelange progressif"
introduit par O. Catoni(1999).
Defagon analogue
aCatoni,
nous donnons une borne a horizon fini pour laperte
de Kullback-Leibler de l’ estimateur dans ce cadre. Nousexplicitons
la forme de cette borne dans le cas d’ arbres deregression, pre-
sentons une variante dans un cadre
etendu,
et proposons une methode de calculapproche
paralgorithme
deMetropolis.
©Elsevier,
Paris~ E-mail: gblancha@dmi.ens.fr
Annales de l’Institut Henri Poincaré - Probabilités et Statistiques - 0246-0203
1. INTRODUCTION
The so-called
"progressive
estimator" was introducedby
Catoni[6]
(1997) using
ideasinspired by
the recent work of Willems et al.[ 11, 12]
on universalcoding.
Theprinciple
of aprogressive
estimator has also beenproposed independently by
B arron andYang [4,3].
One ofthe main
attracting
features of this estimator is that we caneasily
derivestrong non-asymptotic
bounds on its mean Kullback-Leibler risk in anextremely general
framework.Our
goal
in this paper is topresent
a version of this estimator first in ageneral regression
estimationframework,
then moreprecisely
forclassification and
regression
trees(CARTs).
As thecomputation
of thisestimator involves
performing
sums of aBayesian type
on a verylarge (and possibly non-finite)
set ofmodels,
we arenaturally
led inpractice
to
compute
these sumsapproximately using
a Monte Carloalgorithm.
A very similar method
(a Bayesian
search among CARTsusing
astochastic
procedure)
has beenproposed by Chipman
et al.[7] (1997).
It is
interesting
to note that if we start from a different theoreticalpoint
ofview,
which allows us to derive nonasymptotic
bounds on the Kullback- Leiblerrisk,
we are led to a very similaralgorithm (see
discussion at the end of thepaper).
On the otherhand,
the work ofChipman
et al.suggested
to us the idea of adata-dependent Bayesian prior
on the set ofmodels which we
develop
in Section 5.Finally
we also want topoint
out the relation of thisalgorithm
ofrandom walk in the model tree space with the method of tree selection
by
local
entropy
minimizationdeveloped by Amit,
Geman and Wilder[ 1, 2] .
This work has been a
large
source ofinspiration
and in someregards,
the local
entropy
minimization can be seen as the deterministic(or
"zero-temperature") analogous
of the Monte Carlo methodpresented
here. Thiswill also be discussed in some more detail in Section 7.
The structure of the paper is as follows. In Section
2,
wepresent
aversion
of theprogressive
mixture estimator forregression
estimationin a very
general
framework. Section 3 focuses on classification andregression
trees andgives
aprecise
andexplicit
form of the estimator in this case. Section 4briefly
deals with the case of a"fixed-design"
treewhere an exact
computation
of the estimator can beperformed
thanks tothe tree
weighting
recursivealgorithm developed by Tjalkens,
Willemsand Shtarkov in
[ 11 ] (other
authors have also used thistechnique
forclassification trees in a different
framework,
see[10]).
Section 5presents
an extension of the framework to a
possibly
uncountablefamily
of modeltrees,
using
adata-dependent prior,
and we show for aprecise
and nontrivial
example
that the estimator thus obtained still achieves the minimax rate of convergence. In Section6,
weexplain
how to construct a Monte Carlo chain to obtain anapproximate computation
of the estimator forgeneral
tree models. Section 7 concludes the paper with a discussion of the results.2. CATONI’S PROGRESSIVE ESTIMATOR IN A REGRESSION FRAMEWORK
In this section we
present
a version of Catoni’s estimatoradapted
forregression
estimation purposes.Let us
specify
the framework and notations. Letbe a sequence of random
variables, taking
their values in some measur-able space Z x
y
withprobability
distributionPN.
Theonly hy- pothesis
we will make onPN
is that it isexchangeable, i.e.,
that thisdistribution is invariant under any
permutation
of the variablesZi .
Ourgoal
in aregression
frameworkis, given
asample
.z2, ..., zN, toget
an estimation of the conditionalprobability
of a new observationY1 given Xi
I(the
fact that this "new" observation isactually given
the index 1 ispurely
for later notational convenience and should not bemisleading).
Wethus want to estimate
PN(YI Zf) .
This can include any case where the order of the observations(including
the testexample Zi)
is not ofimportance,
because we can then re-draw them in a random order. This,
case is of course of but minor
practical
relevance since in such apredic-
tive
framework,
the testexample
isusually given separately
and cannotbe drawn at random
(in
which case thehypothesis
ofexchangeability
hasto hold
fully). However,
it covers forexample
thecustomary procedure
used for classifier
testing,
when one has a fixed database ofexamples,
that is
split
at random between a set used to train the classifier and a set used to validate its accuracy.We will assume that there exists a
regular
version of this conditionalprobability, and,
moregenerally throughout
this paper, we will assumethat all of the conditional
probabilities
that we aredealing
with havea
regular version, allowing
us to make use of all standardintegration properties.
For that purpose we will assume that Z is a Borel space.Vol. n° 6-1999.
To
begin with,
we willsplit
thesample
data setz2
into twosubsamples
E = called "estimation set" and T =
z 1
called "test set" for somefixed
integer
L > 0(note
that we consider the new observation zi 1 to bepart
of the testset).
The reason for this will becomeclearer
in thesequel
and the issue of the choice of L will be discussed later on. Note that theordering
of the variables ispurely arbitrary
since theprobability
distribution P
according
to whichthey
have been drawn isexchangeable.
We then assume that we have at our
disposal
a countablefamily
of estimators that
is,
afamily
of conditionalprobabilities depending
on the observations of the estimation setQm (Y E ~ ~ X, E),
also denoted
by Qm (Y E ~ ~ X )
for a morecompact writing (in
a more .general
way in this paper we will often use asuperscript
E forprobability
distributions
depending
on the estimation set, thuswriting equally PE(. ( .)
forP(. ! ., E ) ) .
We willspeak
of m~M as a "model".Our purpose is to find an estimator in this
family
which will minimize(approximately)
the Kullback-Leibler(K-L)
risk withregard
toPN (Y1 I Xi, Zf), i.e.,
find a m E Machieving
where H is the Kullback-Leibler
divergence:
for p, v twoprobability
distributions on the space
y:
To
get
a solution to our estimationproblem
we willactually
notperform
a model selection as firstannounced,
but rather build acomposite
estimator for which we will be able to
explicitly
upper bound the Kullback-Leibler risk.Suppose
that we have chosen someprobability
distribution yr on the model set M . If wekeep
the estimation set E"frozen",
we can considerthe estimator
probabilities 3~(T ~ X )
assimple product (conditional) probabilities
on the test set T. We then define the mixture conditionalprobability
on the test set obtainedby summing
theseproduct
probabilities Q; weighted by
theprior
n :We
eventually
build Catoni’sprogressive
mixture estimatorQn (Yi
E.
I
XI,zf)
also denotedby E . ~ jci,
in thefollowing
way:In order to make the last formula somewhat
clearer,
let us consider for a moment thesimple
case where thespace y
is discrete(this
isactually
thecase that will be considered in the
following sections,
for classificationproblems).
In that case the definition( 1 ) yields
thefollowing expansion (everywhere
the denominator does notvanish):
Let us
point
out that each individual term of the sum over M isnothing
but a(predictive) Bayesian
estimator based on theprobabilities
Q(dYl ( X1 ) (when
the estimationsample
E iskept frozen), using
theprior
ThusQ:
is a Cesaro mean ofBayesian
estimatorswith
asample
of
growing
size.zM.
Thesubsample
E is used for estimation within each model m E A4("estimation set")
whereas the othersubsample z2
is thenused to "choose" between the models in a
Bayesian
way("test set").
We now state a theorem
upper-bounding
the risk for this estimator.THEOREM 1. - With the
previous assumptions
weget the inequality:
Proof -
There isactually
little to bechanged
incomparison
to[6]
sowe will follow
quite closely
Catoni’ssteps..
Assume that the estimators
e ’
have densitiesq; (YI JCi)
with
respect
to somedominating
conditional measure Wesimilarly
will denote
by
lower-case letters the densities withrespect
to p of the otherprobability
distributions. In case there is no "obvious" choice for p we can build thefollowing dominating
measuredepending
on theVol. n° 6-1999.
estimation set E :
where a is some
injective mapping
of MintoZ+.
Take an m if
the
inequality
is true for this m ; else we consider the difference(whose
second term is now
finite)
Because -
log
is a convex function we haveNow
using
thehypothesis
thatPN
is anexchangeable probability distribution,
we can swap the roleplayed by Xi
1 and 1 in the Mthterm of the sum
(formally replacing XL+i, YL+i by Xi, Yi
whenevernecessary).
We thus
get
to:where
by
definitionLet us
point
out that we have no moreconditioning
withrespect
to xM+ i in the last denominator because of the relationcoming straightforward
from the definition of
Thus the sum in
(4)
reduces to the first component of its last term since the other terms cancel each othersuccessively, leading
to:It has been
pointed
out that an "aesthetic" flaw of theprogressive
mixture estimator is that it is not a
symmetric
function of the data(different orderings give
differentestimators),
because of thearbitrary
division of the
sample
in two sets and because of the"progressive"
sumin the definition of the
estimator,
which is a Cesaro average ofBayesian
estimators with a
growing sample.
The reason for thisapparently surprising progressive
sum is to be able to bound the Kullback-Leibler distance betweenprobability
distributions on theonly
variableZi by
theK-L distance between full multivariate distributions on the whole test set
(for
which the idea of a mixture distibution comes from thecompression
and information
theory litterature),
as appears in theproof
of the theorem.However,
it ispossible
to definetheoretically
afully symmetric
estimatorby averaging I XI, zf)
over all the(N - I ) ! possible orderings
of the
sample zf,
that willsatisfy
the samemajorations
asQn (as
canbe seen
using
once more theconvexity
of the K-L distance withrespect
toJL).
3. APPLICATION TO A CART MODEL
In this section we deal with a classification
problem,
thatis,
thevariable Y will take its values in
y
={0,1,..., c } (thus
there are(c + 1 )
distinct classes and c is the number of
free
parameters for a distributionon
y),
and we want to"recognize"
the observation Xtaking
its values in a certain spacex, deciding
to which class itbelongs, having already
observed a
"learning
set"zf.
Inorder
to do that we want to make use ofa classification tree
using
the"20-questions game" principle:
suppose wehave at our
disposal
a countable set of"questions" Q
we can ask on theobservation
X,
to which we canonly
answerby
0 or 1(for example,
wecan think of X as a finite or infinite
binary
sequence, thequestions being
of the form "what is the n th bit of
X ?"). Having
asked a firstquestion,
weare then allowed to choose a second one
depending
on the first answer,and so on. At each
step
we can decide whether to ask a newquestion
or tostop
and guess X’s class(or, alternatively,
to estimate theprobability
of Xbelonging
to a certainclass,
if weprefer
aregression framework).
If therule is that we can
only
ask at most dquestions,
all thepossible
sequences ofquestions
and answers can berepresented
as acomplete binary
tree ofdepth
at mostd,
whose internal nodes are labeled withquestions (CART
tree
model).
Our set of models M will therefore be a set of
couples (T, FT),
whereT is a tree and
FT =
the set ofquestions
indexedby
theinternal nodes of T. A
question f E Q
isnothing
but the indicator ofa certain measurable set
A f
C x. We will think of abinary
tree T as asubset of r
= {0}
U{0,1}* (the
set of allfinite, possibly empty strings
formed of zeroes and
ones)
such that every ancestor(that is,
every
prefix string) t
of s alsobelongs
to T.By definition,
0 is the rootof the tree, and if s is a
node,
its two sons are s0 and s 1. In our case we willonly
considercomplete binary
trees(such
thatevery s ~ T
eitheris a terminal node or has two
sons).
If m is such a model and x is anobservation,
we denoteby m (x)
the terminal node associated to xby
m
(obtained by asking questions
on x andletting
him follow the tree branches until we reach a terminal node ofm ). Also,
we will denoteby
am the set of terminal nodes
(or leaves)
of the model tree m.Given m =
(T, FT )
we now have to define our set of estimatorsE).
For the remainder of the paper, whendealing
with treemodels
we
will take forQm
a conditionalLaplace
estimator at every terminal node:given
a terminal node s E T we define thefollowing
counters on the estimation set E =
where
lA
denotes the indicator function of the set A.Then,
= s, we take for a multivariate distribution definedby
Let us then denote
by (Y [ X)
thegeneral
conditional multivari-ate distribution based on the labeled tree m, where 8m = is a set
of real vectors of
parameters
allbelonging
to the c-dimensionalsimplex Se = {0
E[0,1]~~ !
=1}, and
indexedby
the terminal nodes ofm, defined
by
.(where (es ) denotes
the i th coordinate of the vectorparameter 8S ).
Once
again following
veryclosely [6],
wegive
thefollowing
upperbound for the risk of the
Laplace
estimator:THEOREM 2. - For any
exchangeable
distributionPN
on the vari-ables
ZN1
E({0,
...,c}
xX)N, for
agiven
model m, the conditionalLaplace
estimatorQm satisfies
where
I
~mI
is the numberof terminal
nodesof
the tree model m.Proof. -
To shorten theequations
below we will write in thesequel equally YN+i
forXi, Yi .
For a
given
terminalnode s ~ ~m,
let us first introduce the modified countersas,
i =0,...,
c:and denote
by Es
the total number ofexamples
at node s .With these new counters we
get
thus
Vol. n° 6-1999.
where the first
inequality
follows from theexchangeability
of thedistribution
PN.
In the secondequality
weput 0 log 0
= 0 if necessary.Now for a
fixed s E am,
if at least oneexample
reached thisnode,
thatis,
if
hs
>0,
we haveNow this last
inequality obviously
remains trueif ais = 0
for all i and thusNow suppose we have chosen some a
priori
measure 7r on the treemodels fl4
(fl4
is countablesince Q
iscountable);
we thusget
as astraightforward
consequence of Theorems 1 and 2: .COROLLARY 1. - With the
previous
choiceof
the estimatorsQm for
labeled tree
models,
and aprior
n on .M we getfor
the estimatorQn defined
in Section 2 theinequality
For
example,
let us take forprior 7r
the distribution of thegenealogy
tree of a
branching
process ofparameter p 1 /2 (an
individualgets
twosons with
probability
p, or dies without descendance withprobability
1 -
p) this prior
will be used several times in the next sections. Thisprior
isexplictly given by
Annales de l’Institut Henri Poincaré - Probabilités
so that the first
penalization
term inEq. (6)
isthus the natural choice for L
(the
size of the testsample)
in this case isof the form
k N,
with k E[0,1]
such that the twopenalization
terms ofEq. (6)
are balanced.Remark. -
Although
the K-Lrisk
isinteresting
in itself and has beencustomarily
used for classification tree construction(through
"local en-tropy minimization",
see, e.g.,[1,2]),
one may belegitimately
interestedby
the true classification error of the estimator. Athorough
discussion on aprecise
control of this error isbeyond
the scope of this paper; wesimply
recall some basic
inequalities allowing
to control the classification errorthrough
the K-L risk.Let us
forget
for a moment theconditioning
withrespect
to X and let P be aprobability
distributionon y
={0,...,c}.
If P is the"true"
distribution,
the best classification rule is topredict
the classof
highest probability.
LetP1l
= maxiP (i ),
then the lowest attainable average classification error isL*
= 1 -P*.
Now suppose we have some. estimate
Q
of P andpredict
the class a = arg maxi6(0;
the average classification errorfor
thisrule
isL ( Q)
= 1 -P (a) .
Then thefollowing elementary inequalities
hold:The first
inequality
can be found in[9]
and the second one, forexample, in [8],
p. 300.Now if me make this
reasoning
conditional to someX, taking
theexpectation
over X of the abovequantities
weget (somewhat informally)
Using
the bounds derived for the K-Ldistance,
we can thus obtain a coarse upper bound for the difference between the lassification. errorobtained with the estimator
Q(Y I X)
and theoptimal
average errorEL*(X).
Vol. n° 6-1999.
4. EXACT COMPUTATION FOR A FIXED-DESIGN TREE In this section we will assume that there is
actually
no choice in thequestion
to be asked at each internal node of the model tree. In this case we willspeak
of a"fixed-design"
treemodel;
the set of models is then the set of allcomplete
subtrees of the maximalcomplete binary
tree ofdepth d
denotedby
T. The most classicalexample
for that is a "context tree" model where X is abinary string
oflength d
and all nodes atdepth
r are labeled with the
question
"what is the rth bit of X?". Recall that since we aredealing with
a discrete spacey,
thedeveloped
formula for theprogressive
mixture estimator isgiven by Eq. (2).
We will take for
prior 1T
the distribution of thegenealogy
tree of abranching
process ofparameter
pstopped
atdepth d, meaning
that every node of the tree will have two sons withprobability
p or will be a terminal node withprobability
1 - p,except
for the nodes atdepth
d which arealways
terminal. In this framework we can use an efficientalgorithm
found
by Willems,
Shtarkov andTj alkens [ 11 ]
for universalcoding using weighted
context trees tocompute
a sum of the formNamely,
let us define for every node(internal
orterminal) s
of Tthe counters
ns, bs
for i =0,...,
ccontaining
the number ofexamples
xk of class i whose
"path"
passesthrough
node s(i.e.,
for which s isan ancestor of or is
equal
toT(xk)), respectively,
for the estimation set E =and
the truncated test setwhere ~
means "to be an ancestor of or to beequal
to".Let us now define the local
(Laplace)
estimator at node s(estimated using E,
andapplied
to the setif1):
Annales de l’Institut Henri Poincaré - Probabilités
(note
that any other local estimator couldactually
be taken here instead of theLaplace estimator;
this isonly
for the sake ofsimplicity
andcoherence with the
previous section).
By
backward induction on thedepth
of the nodes s E T we build thequantities
(we
recall that s 0 and s 1 are the two sons of the nodes ) .
It has been
proved
in[11] ] that
with this constructionTo
compute
all the terms of the Cesaro sumdefining
theprogressive estimator,
we have toperform
2L such sums, buthaving performed
onesum the next one is obtained
only by adding
an observation(Xi, Yi);
so it is sufficient to
update
the calculations for the counters anda (s) along
the branch itfollows, giving
rise to dcomputations. However,
todetermine
completely
the distributionQn
we have to consider every2d possibilities
forT(xl ) (that is,
it can be any terminal node ofT).
Inconclusion the
complexity
of thealgorithm
is of orderd2d N;
this is tobe
compared
to the number of models considered which isgreater
than22d_ ~ .
It has to be noted that the form of the
prior 1f
is crucial inorder
this
algorithm
to be used(namely,
theprinciple
is that it allows anice factorization of
Eq. (7)). However,
we can make astraightforward
extension to a more
general
set ofpriors by allowing
thebranching
parameter
p todepend
on the node s considered in the tree, in whichcase one has
just
toreplace
the constant pby
the the localbranching
parameter ps
inEq. (8).
This set ofpriors
islarge enough
to allowa wide range of
flexibility,
forexample,
one can choose togive
lessa
priori weight
to "shallow" trees which weexpect
not to be sorelevant, by taking Ps
close to 1 for those nodes s that are closeto
the root..Vol. n° 6-1999.
5. DATA-DEPENDENT SCALING FOR CONTINUOUS-VALUED OBSERVATIONS
In this section we will consider the case when X takes its values in
JRP,
where each coordinatecorresponds
to a "feature" of the observation.Denoting by xk
the kth coordinate of x, we will restrict ourselves toquestions
of the form"x k
a ?"depending
on k and a. We can thusrepresent
allpossible questions by couples (k, a)
E{ 1, ... , p }
x R. Inprinciple
it would bepossible
to discretize the values takenby a
overR,
for instanceby restricting a
EQ,
in order toget
a countableset Q
of
questions.
If we knownothing
about the range of valuestaken by x however,
it seems clear thatchoosing
some apriori
distributionon Q
would
give quite
bad results inpratice
unless we have ahuge
number ofexamples.
Hence we propose to
modify slightly
the construction of Catoni’s estimatorby letting
theprior depend
on the data. A model is now atriplet
m =
(T, KT , AT)
where T is a tree structure, andKT, AT
are familiesof
integers
and realnumbers, respectively,
both indexedby
the internal nodes of T. Aquite
natural way to define adistribution
on the set of models M is to choose a distribution on the tree structureT,
then onKT given T,
then onAT given ( T , KT ) .
In thesequel
we will denoteby
r =
(T, KT)
the tree with its internal nodes labeled with the"type"
ofquestion
to ask(i.e.,
on which coordinate of X thequestion
should beasked).
..In our case we will choose the two first distributions in a deterministic
(data-independent)
way which we will denoteby 7r(T).
Now for the lastone we notice that in fact the value of the estimator
I Xl, zf)
.depends only
on the statistic~(~i),..., m (xn ) ,
thatis,
on the way thesample xf
is"split"
across the tree. In otherwords,
we can group thepossible
values ofA T
intostatistically (with regard
to thesample) equivalent
sets ofquestions, corresponding
to the differentpossible splittings
of thesample xf (a "splitting"
is here amapping
from the set=
1,..., N }
to the set 9r of the leaves of the treer).
Let0 t (x 1 )
be the set of all such
possible splittings
whenA T
varies in Whatwe propose is
simply
togive equal weight
to each one of thesesplittings.
We thus define the
data-dependent
mixtureprobability
where denotes the common value of the estimators
Qm
for the treer and the
splitting
a .It is
important
to note here thatwith
thisdefinition,
the value of the mixtureprobability
on asingle point
stilldepends
on thewhole
X -sample xf,
since thedata-dependent prior
stilldepends
on thenumber of different ways we can
split
the entireX -sample,
and that iswhy
theconditioning
willalways
involvexf . (Recall
that on the otherhand,
thedependency
of withrespect
to the "estimation"examples ranging
from L + 1 toN,
is denotedby
thesuperscript E.)
_ We
can then define thedata-dependent progressive
mixture estimatorQn by Eq. ( 1 ) just replacing
with andintroducing
the suitable modification as for theconditioning,
that is:We now
give
a result similar to Theorem 1:THEOREM 3. - With the above
definitions,
theprogressive
mixtureestimator with
data-dependent prior Qz satisfies, for
anyexchangeable probability
distributionPN,
-
_ _ ~ _ / - ,
where |~m|
denotes the numberof
terminal nodesof
the tree model m.If we choose for the estimators
Qm
the same conditionalLaplace
esti-mators as in the
previous section,
thenusing
Theorem 2 weimmediately
deduce
COROLLARY 2. - Under the same
hypotheses
the estimatorQn
satis-fies
where
Mm,em
is the conditional multivariate distribution associated with the tree model m and the setof parameters defined
in Section 3.Proof of
Theorem 3. - We canapply
the same firststeps
as in theproof
of Theorem 1
(however,
in this case, since we aredealing
with discretedistributions,
we canskip
the considerations about relativedensities).
Namely,
let m be agiven model,
we want toupper-bound
thequantity
where the first
inequality
follows from theconcavity
of thelogarithm,
the
following equality
is obtainedby swapping
the role of(Xi, Fi)
andusing
theexchangeability
ofPN (replacing (X L+1, YL+.l ) by (Xi, Fi)
ifnecessary),
and the last oneby
the chain rule.Finally
thefollowing inequality
holds:It can be derived in the
following
way: for our fixed model m =(r, AT)
there exists a
splitting
ffa whichseparates
the dataexactly
inthe same way as does the tree model m . Thus
and
keeping only
thecorresponding
term in the sumdefining
as inthe
proof
ofTheorem 1,
weget
the desiredinequality.
Now the
sample reaching
an internal node s is at most of size N andthus can be
split
at most in N + 1 ways at this node since weonly
makeuse of order statistics for a certain coordinate. Thus
where 1 int(m) I,
denotes the number of internal nodes of the model tree m ;but
since for acomplete binary tree |~m| = |int(m) |+1;
thus we
get
to the desired conclusion. DAnnales de l’Institut Henri Poincaré - Probabilités
This result can be
enlarged
in several ways.First,
one can notice thatwe can
easily
extend this result in a moregeneral
framework.Namely,
assume we have a
possibly
uncountablefamily
of models.M,
but thatevery model can be written in the from m =
(r, hr)
where r takes itsvalues in some countable set. In the same way as
above,
we can choosea fixed
prior
distribution for T and adata-dependent
distribution forgrouping
the values ofkr
intostatistically equivalent
sets to whichwe
assign
anequal weight.
We can thenapply
the samereasoning
andthe main
remaining point
is toget
a bound forEPN log [ A~ (x~ ) I.
Forinstance,
if instead ofquestions
of thetype "xk
a ?" we choose at each node a set ofquestions
ofVapnik-Cervonenkis
dimension boundedby D,
then weeasily
come towhere C is a real constant.
Secondly,
inpractice
it seems uneasy tocompute
Instead ofa uniform distribution on all the
global possible splittings,
we would liketo take the distribution
given by
theproduct
of the uniform distributionson all local
possible splits
at each node. To state it in a more formal(and probably clearer)
way, let us denoteby
a agiven partitioning
of thesample
over the tree T andby 1 ~
the number ofexamples reaching
aninternal node s
using
thepartitioning a .
If we chooseuniformly
at eachinternal node between the different
possible splits,
then the newdistribution cv on the
possible partitionings
isgiven by
The
advantage
of this distribution is that it can becomputed easily
in arecursive way for every or.
Furthermore,
the bound ofTheorem 3
still holds true sinceFinally,
notice that the bound in Theorem 3 is not as’good
as inTheorem
1, namely,
there is a term of orderlog ( N ) / L
in the newbound whereas the
penalty
term wasonly
of order1 / L
in Theorem 1.If we take L =
kN,
0 k1,
this means that if the real conditional distribution1 Xf)
is in the model(that is,
is of the form Vol. 35, n° 6-1999.the modified estimator converges towards it at a rate of order
log N/N,
to becompared
with1/ N
in Theorem 1. This is becausewe
dropped
thehypothesis
of a countablefamily
of models.However,
in theparticular
case considered in thissection, namely,
conditional multivariate distributionsMm,em depending
on the tree model m andthe set of
parameters
thefollowing
theorem states that this rate ofconvergence is
actually
the minimax rate within the set ofmodels,
andtherefore cannot be
significantly improved.
THEOREM 4. - Assume X is a real random variable drawn
according
to the
uniform
distribution U in[0, 1 ],
then with the setof tree
models A4defined
in thissection,
there exists a real constant C such thatfor
any estimatorQ(y ~ x, zf) depending of
asample of
sizeN,
thefollowing
lower bound holds true
for sufficiently big
N :where the
expectation
onZN
is taken with respect to theproduct
distribution
I (in
otherwords,
whenZN
isdrawn i.i.d.
according
toU(dX).Mm,em (dY X ) ).
To prove this theorem we will make use of the
following
version ofFano’s lemma
(see,
e.g.[5], Corollary 2.9):
LEMMA 1. - Let P be a
finite
setof probability
distributions on ameasurable space x such
that IPI
=J > 4,
Fa function taking
its values in some metric space(F, d).
Assume there existpositive
constants
K,
y, t such thatthen
for
anyfunction 03A6:X ~
F and any p > 0 thefollowing
lowerbound holds
Annales de l’Institut Henri Poincaré - Probabilités
Proof of
Theorem 4. - We can restrict ourselves to the case c = 1 since the minimax rate of convergence canonly
increase with c. Thus the distributions we aredealing
with are conditionalBernoulli,
which wewill denote
by Bm, em (dY X ) .
We will
apply
Fano’s lemma in thefollowing
framework: as the metric space F we will take the set of all distributions on[0,1]
xf 0, 1}
and ford the
Hellinger
distance on E. Below we will construct the set P as afinite subset of the the set of
product probability
distributionson
([0,1]
x(0, 1})N.
A naturalfunction 03C8:D ~
F is03C8:P~N ~
Pand any estimator
depending
on asample
of size N is indeed a function~ :
([0, 1]
x{O, 1 ))"
-+ F.Let us choose a,
~6
E[0,1]
which will remain fixed for the rest of theproof.
Let us then consider the conditionalprobability
of Ygiven by
aBernoulli distribution of
parameter ~6
if Xbelongs
to some subinterval I C[0, 1 ] of length
r~, and ofparameter
a if X does notbelong
to I. Thisdistribution
clearly belongs
to our model since it can berepresented
as atree with two
questions corresponding
to theendpoints
of I . Ifp ®N
are Nth direct
products
of two such distributions on(X, Y) (drawing
Xuniformly
on[0,1]),
obtained with twodisjoint
intervalsIi
andIz
of thesame
length
~, then weget easily
Now we can find as many as J
= L 1 J
suchdisjoint
intervals in[0,1] ]
and take for P the set of distributions
thus
obtained.Taking
the
hypotheses
of Fano’s lemma are now satisfied for K= 2 log N,
y =
1 /4, t
=C1logN/N
for a constant’
and N
big enough
so that(iii)
is satisfied.This
yields
the desired result if we choose p = 2 in theinequality obtained,
andfinally apply
the well-knowninequality H ( P, ()) ~ 2d2(P, Q)
as a finalstep.
Dn° 6-1999.