A NNALES DE LA FACULTÉ DES SCIENCES DE T OULOUSE
P ASCAL M ASSART
Some applications of concentration inequalities to statistics
Annales de la faculté des sciences de Toulouse 6
esérie, tome 9, n
o2 (2000), p. 245-303
<http://www.numdam.org/item?id=AFST_2000_6_9_2_245_0>
© Université Paul Sabatier, 2000, tous droits réservés.
L’accès aux archives de la revue « Annales de la faculté des sciences de Toulouse » (http://picard.ups-tlse.fr/~annales/) implique l’accord avec les conditions générales d’utilisation (http://www.numdam.org/conditions).
Toute utilisation commerciale ou impression systématique est constitu- tive d’une infraction pénale. Toute copie ou impression de ce fichier doit contenir la présente mention de copyright.
Article numérisé dans le cadre du programme Numérisation de documents anciens mathématiques
http://www.numdam.org/
- 245 -
Some Applications of Concentration Inequalities
to
Statistics
PASCAL
MASSART(1)
E-mail: [email protected]
Annales de la Faculte des Sciences de Toulouse Vol. IX, n° 2, 2000
pp. 245-303
Nous presentons quelques applications d’inegalites de con-
centration a la resolution de problemes de selection de modeles en statis-
tique. Nous etudions en detail deux exemples pour lesquels cette ap-
proche s’avere particulirement fructueuse. Nous considerons tout d’abord le classique mais delicat probleme du choix d’un bon histogramme. Nous presentons un extrait de travail de Castellan sur la question, mettant en evidence que la structure meme des inegalites de concentration de Tala-
grand pour des processus empiriques influence directement la construc- tion d’un critere de selection de type Akaike modifie. Nous presentons egalement un nouveau theoreme de selection de modeles bien adapte a la
resolution de problemes d’apprentissage. Ce resultat per met de reinterpre-
ter et d’ameliorer la methode dite de minimisation structurelle du risque due a Vapnik.
ABSTRACT. - The purpose of this paper is to illustrate the power of con-
centration inequalities by presenting some striking applications to various model selection problems in statistics. We shall study in details two main examples. We shall consider the old-standing problem of optimal selection of an histogram (following the lines of Castellan’s work on this topic) for
which the structure of Talagrand’s concentration inequalities for empir- ical processes directly influences the construction of a modified Akaike criterion. We shall also present a new model selection theorem which can
be applied to improve on Vapnik’s structural minimization of the risk
method for the statistical learning problem of pattern recognition.
(1) Equipe de "Probabilites, Statistique et Modelisation", Laboratoire de Mathemati- que UMR 8628, Bat. 425, Centre d’Orsay, Universite de Paris-Sud 91405 Orsay Cedex.
1. Introduction
Since the last ten years, the
phenomenon
of the concentration of measurehas received much attention
mainly
due to the remarkable series of worksby
MichelTalagrand
which led to avariety
of newpowerfull inequalities (see
inparticular [37]
and[39]).
Our purpose in this paper is toexplain why
concentrationinequalities
for functionals ofindependent
variables areimportant
in statistics. We shall also present someapplications
to randomcombinatorics which in turn can lead to new results in statistics. Our choice here is to present some selected
applications
in details rather thanprovide
a more or less exhaustive review of
applications.
Some of these results areborrowed from very recent papers
(mainly
from Castellan[23], Birge
andMassart
[12]
andBoucheron, Lugosi
and Massart[16])
and some others arenew
(see
Section4).
Since the seminal works of
Dudley
in theseventies,
thetheory
ofprob- ability
in Banach spaces hasdeeply
influenced thedevelopment
of asymp- toticstatistics,
the main tools involved in theseapplications being
limittheorems for
empirical
processes. This led to decisive advances for the the- ory ofasymptotic efficiency
insemi-parametric
models for instance and the interested reader will find numerous results in this direction in the booksby
van der Vaart and Wellner
[43]
or van der Vaart[42].
The maininteresting
feature of concentration
inequalities
isthat,
unlike central limit theoremsor
large
deviationsinequalities, they
arenonasymptotic.
We believe that the introduction of these new tools is animportant
step towards the con-struction of a non
asymptotic theory
in statistics.By
nonasymptotic,
wedo not mean that
large samples
of observations are not welcome butthat,
for instance, it is of
great importance
to allow the number of parameters of aparametric
model todepend
on thesample
size in order to be ableto warrant that the statistical model is not far from the truth. Let us now
introduce a framework where this idea can be
developed
in greatgenerality.
1.1. Introduction to model selection
Suppose
that one observesindependent
variables ...,fln taking
theirvalues in some measurable space E. Let us furthermore assume, for the sake of
simplicity,
that these variables areidentically
distributed with commondistribution P
depending
on some unknown"parameter"
s E S(note
thatthe results we are
presenting
here are nonasymptotic
and remain valideven in the non
stationary
case, where P is the arithmetic mean of the distributions of the variables~;
... ,gn
and therefore maydepend
onn) .
Theaim is to estimate s
by using
as fewprior
information aspossible.
One cantypically
think of s as a functionbelonging
to some infinite dimensionalspace S The two main
examples
that we have in mind arerespectively
thedensity
and theregression
frameworks. Moreprecisely:
. In the
density framework,
P = is some unknown nonnegative
function and S can be taken as the set of
probability
densities with respect to. In the
regression framework,
thevariables
=YZ)
areindepen-
dent
copies
of apair
of random variables(X, Y),
where X takes itsvalues in some measurable space X.
Assuming
the variable Y to be squareintegrable,
one defines theregression
function s as s(x) -
E
[Y X
=x]
for every x E X.Denoting by
the distribution ofX,
,one can set S
== L~ (~).
One of the most common used method to estimate s is minimum contrast estimation.
1.1.1. . Minimum contrast estimation
Let us consider some contrast function q, defined on S x
" which
meansthat
In other
words,
the functional t - E(7 (t,
achieves a minimum atpoint
s. The heuristics of minimum contrast estimation is
that,
if one substitutes theempirical
criterionto its
expectation P ~~y (t, .)~
= E~~y (t, ~1)J
and minimizes theempirical
cri-terion -yn on some subset S of S
(that
we call amodel),
there is somehope
to get a sensible estimator
of s,
at least if sbelongs (or
is closeenough)
tomodel S. This estimation method is
widely
used and has beenextensively
studied in the
asymptotic parametric setting
for which one assumes that S is agiven parametric model,
sbelongs
to S and n islarge. Probably,
the most
popular examples
are maximum likelihood and least squares esti- mation. For thedensity
and theregression frameworks,
thecorresponding
contrast functions can be defined as follows.
. Assume that one observes
independent
andidentically
distributed random variables(~1, ...~n)
with common distribution sp. Then the contrast functionleading
to maximum likelihood estimation issimply
for every
density t
and thecorresponding
loss function l isgiven by
where K
(s, t)
denotes the Kullback-Leibler information number be- tween theprobabilities
Sjj and i.e.if sp is
absolutely
continuous withrespect
totp
and K(s, t)
= +00otherwise.
. Assume that one observes
independent
andidentically
distributedcopies
of apair (X, Y),
the distribution of Xbeing
denotedby
p.Then the contrast function
leading
to least squares estimation is de- fined forevery t
EL2 (J.t) by
and the loss function l is
given by
where 11.11
denotes the norm inL2 (~c) .
The main
problem
which arises from minimum contrast estimation in aparametric setting
is the choice of a proper model S on which the minimum contrast estimator is to be defined. In otherwords,
it may be difficult to guess what is theright parametric
model to consider in order to reflect the nature of data from the real life and one canget
intoproblems
wheneverthe model S is false in the sense that the true s is too far from S. One could then be
tempted
to choose S asbig as possible. Taking
S as S itself or as a"huge"
subset of S is known to lead to inconsistent(see [5])
orsuboptimal
estimators
(see [8]).
We see thatchoosing
some model S in advance leads to some difficulties. If S is a "small" model
(think
of someparametric model,
definedby
1 or 2 parameters for
instance)
the behavior of a minimum contrast estimator on S issatisfactory
aslong
as s is closeenough
to S butthe model can
easily
turn to be false.. On the contrary, if S is a
"huge"
model(think
of the set of all contin-uous functions on
[0,1]
in theregression
framework forinstance),
theminimization of the
empirical
criterion leads to a very poor estimatorof s.
It is therefore
interesting
to consider afamily
of models instead of asingle
one andtry
to select someappropriate
model from the data in or-der to estimate s
by minimizing
someempirical
criterion on the selected model. The idea ofestimating s by
model selection viapenalization
hasbeen
extensively developed
in[10]
and[7].
.1.1.2. Model selection via
penalization
Let us describe the method. Let us consider some countable or finite
(but possibly depending
onn)
collection of models and the cor-responding
collection of minimum contrast estimatorsIdeally,
one would like to consider m
(s) minimizing
the risk E[1 (s,
with respectto m E
Mn.
The minimum contrast estimator8m(s)
on thecorresponding
model is called an omcle
(according
to theterminology
introducedby
Donoho and
Johnstone,
see[21]
forinstance). Unfortunately,
since the riskdepends
on the unknown parameter s, so does m(s)
and the oracle is notan estimator of s.
However,
the risk of an oracle is a benchmark which will be useful in order to evaluate theperformance
of any data driven selectionprocedure
among the collection of estimators The purpose is therefore to define a data driven selectionprocedure
within thisfamily
ofestimators,
which tends to mimic anoracle,
i.e. one would like the risk of the selected estimator ? to be as close aspossible
to the risk of an oracle.The trouble is that it is not that easy to compute
explicitly
the risk of aminimum contrast estimator and therefore of an oracle
(except
forspecial
cases of interest such as least square estimators on linear models for exam-
ple).
This is the reasonwhy
we shall rather compare the risk of s to anupper bound for the risk of an oracle. In a number of cases
(see [9]),
agood
upper bound
(up
toconstant)
for the risk of8m
has thefollowing
formwhere
Dm
is a measure of the "size" of the model(something
like the number of parametersdefining
the model8m)
and l(s,
=inft~Sm
l(s, t).
Thismeans
that,
at a more intuitivelevel,
an oracle ismaking
agood compromise
between the "size" of the model(that
one would like tokeep
as small aspossible)
and thefidelity
to the data. The model selection viapenalization
method can be described as follows. One considers some function penn :
Mn
-R+
which is called thepenalty function.
Note that penn canpossibly depend
on the observations~1, ..., ~~,
but not of course on s.Then,
for everym E
Mn,
one considers the minimum contrast estimator within modelSm
Selecting
fii as a minimizer ofover
Mn,
onefinally
estimates sby
the minimumpenalized
contrast esti-mator
Since some
problems
can occur with the existence of a solution to the pre-ceding
minimizationproblems,
it is useful to considerapproximate
solutions(note
that even if8m
doesexist,
it is relevant from apractical point
of viewto consider
approximate
solutions since8m
willtypically
beapproximated by
some numericalalgorithm). Therefore, given
0(in practice,
tak-ing
pn =n-2
makes the introduction of anapproximate
solutionpainless),
we shall consider for every m E
Mn
someapproximate
minimum contrastestimator
8m satisfying
and say that ? is a
pn-minimum penalized
contrast estimator of s ifTn
(S)
+ penn In(t)
+ penn(m)
+03C1n,~m
EMn
and Vt E(4)
The reader who is not familiar with model selection via
penalization
canlegitimately
ask thequestion:
where does the idea ofpenalization
comefrom? It is
possible
to answer thisquestion
at two different levels:. at some intuitive level
by presenting
the heuristics of one of the first criterion of this kind which has been introducedby
Akaike(1973);
. at some technical level
by explaining why
such a strategy of model selection has some chances to succeed.We shall now
develop
these twopoints.
We first present the heuristics of Akaike’s criterion on the caseexample
ofhistogram
selection which will be studied in details in Section3, following
the lines of[23].
.1.2. A case
example
and Akaike’s criterionWe consider here the
density
framework where one observes nindepen-
dent and
identically
distributed random variables with commondensity s
with respect to the
Lebesgue
measure on[0,1].
LetMn
be some finite(but possibly depending
onn)
collection ofpartitions
of[0,1]
into intervals. For anypartition
m, we consider thecorresponding histogram
estimator8m
defined
by
--where tt denotes the
Lebesgue
measure on[0,1],
and the purpose is to select"the best one". Recall that The
histogram
estimator on somepartition
m is known to be the maximum likelihood estimator on the model
Sm
of densities which arepiecewise
constants on thecorresponding partition
m and therefore falls into our
analysis.
Then the natural loss function to be considered is the Kullback-Leibler loss and in order to understand the construction of Akaike’scriterion,
it is essential to describe the behavior ofan oracle and therefore to
analyze
the Kullback-Leibler risk. First it easy tosee that the Kullback-Leibler
projection
sm of s on thehistogram
model8m
(i.e.
the minimizer of t --> K(s, t)
on8m)
issimply given by
theorthogonal projection
of s on the linear space ofpiecewise
constant functions on thepartition
m and that thefollowing Pythagore’s
typedecomposition
holdsHence the oracle should minimize K
(s, 8m)
+E[K (sr,.~, s~.,.~)~
orequivalently,
since s - sm is
orthogonal
tolog (s~.,z),
Since sm log sm depends
on s, it has to be estimated. One could think ofj 8m log 8m
asbeing
agood
candidate for this purpose but since E~s~.,.z~
=the following identity
holdswhich shows that it is necessary to remove the bias
of j 8m log 8m
if onewants to use it as an estimator
of j
smlog
sm . In order to summarize thepreceding analysis
of the oracleprocedure (with
respect to the Kullback- Leiblerloss),
let us setThen the oracle minimizes
The idea
underlying
Akaike’s criterion relies on two heuristics:.
neglecting
the remainder termRm (which
is centered at its expecta-tion),
.
replacing
E[K (sm , [K (s.",,, s"~)~ by
itsasymptotic equivalent
when n goes to
infinity
which isequal
toDm/n,
where1+D",
denotesthe number of
pieces
of thepartition
m(see [23]
for aproof
of thisresult).
Making
these twoapproximations
leads to Akaike’s method which amounts toreplace (6) by
and proposes to select a
partition
mminimizing
Akaike’s criterion(7).
Anelementary computation
shows thatIf we denote
by yn
theempirical
criterioncorresponding
to maximumlikelihood
estimation,
i.e.(t)
=Pn [- log (t)~,
we derive that Akaike’s criterion can be written asand is indeed a
penalized
model selection criterion of type(3)
with penn(m) = Dm /n.
It will be one of the main issues of Section 3 to discuss whether this heuristicapproach
can be validated or not but we canright
now try to guesswhy
concentrationinequalities
will be useful and in what circumstances Akaike’s criterion should be corrected.Indeed,
we have seen that Akaike’s heuristicsrely
on the fact that somequantities Rm
stay close to their ex-pectations (they
areactually
all centered at0). Moreover,
this should hold with a certainuniformity
over the list ofpartitions Mn .
This means that if the collection ofpartitions
is not toorich,
we canhope
that theIUn’s
willbe concentrated
enough
around theirexpectations
to warrant that Akaike’s heuristicsworks,
while if the collection is toorich,
concentrationinequali-
ties will turn to be an essential tool to understand how one should correct
(substantially)
Akaike’s criterion.1.3. The role of concentration
inequalities
The role of the
penalty
function isabsolutely
fundamental in the defi- nition of thepenalized
criterion(3)
or(4),
both from a theoretical and apractical point
of view.Ideally,
one would like to understand how to choose thispenalty
function in anoptimal
way, that is in order that theperfor-
mance of the
resulting penalized
estimator be as close aspossible
as thatof an oracle. Moreover one would like to
provide explicit
formulas for thesepenalty
functions to allow theimplementation
of thecorresponding
modelselection
procedures.
We shall see that concentrationinequalities
are notonly helpful
toanalyze
the mathematicalperformance
of the minimum pe- nalized contrast estimator in terms of risk bounds but also tospecify
whatkind of
penalty
functions are sensible. In fact thesequestions
are two dif- ferent aspects of the sameproblem
since in ouranalysis,
sensiblepenalty
functions will be those for which we shall be able to
provide
efficient risk bounds for thecorresponding penalized
estimators. Thekey
of thisanalysis
is to take l
(s, t)
as a loss function and notice that the definition of the pe- nalizedprocedure
leads to a verysimple
but fundamental control forl ( s, s’ . Indeed, by
the definition of s wehave,
whatever m EMn
and sm Eand therefore
If we introduce the centered
empirical
processand notice that E
~~y (t, ~1 )~ -
E~7 (u, $1)]
= l(s, t) - l (s, u)
forall t, u
ES,
we
readily
get from(8)
Roughly speaking, starting
frominequality (9),
one would like to choose apenalty
function in such a way that it dominates the fluctuations of the variableyn (s~) - ~yn (S).
The trouble is that this variable is not that easy to control since we do not know where m is located. This is the reasonwhy
we shall rather control
yn (s~.,.z) - ~yn (s~.,.t~ ), uniformly
over m’ E At thisstage
we shall useempirical
processestechniques
and moreprecisely
concen-tration
inequalities
that willhelp
us to understand howthe "complexity"
of the collection of models has to be taken into account in the definition of the
penalty
function.We can indeed derive from this
approach
somequite general
way of de-scribing
thesepenalty
functions. Let be afamily
ofnonnegative weights
suchthat,
for some absolute constant ~Most of the time E can be taken as 1, but it is useful to
keep
some flex-ibility
with the choice of thisnormalizing
constant. We should think ofas some
prior
finite measure on the collection of models which in some sense measuresits " complexity" .
For every m Efl4n,
let~~
besome
quantity measuring
thedifficulty
forestimating
within model Assuggested by (2),
one shouldexpect
thatam
=Dm/n,
ifDm
is a prop-erly
defined "dimension" of the modelSm (typically,
in our caseexample
ofhistograms
above 1 +Dm
is the number ofpieces
ofpartition m).
Thenwhere
Ki
andK2
are proper constants.Obviously
one would like to know what is theright
definition foram
and what values for~1
andK2
are al- lowed. The advances for a betterunderstanding
of the calibration ofpenalty
functions are of three different types.
. When the centered
empirical
process"1 n
isacting linearly
onSm
which is itself some
part
of aDm
dimensional linear space, it is pos- sible to get a veryprecise
idea of what the minimalpenalty
functionsto be used are. This
idea,
first introduced in[10],
was based on a pre-liminary
version ofTalagrand’s
deviationinequalities
forempirical
processes established in
[38].
Itreally
became successful with the ma-jor improvement
obtainedby Talagrand
in[39].
This new version istruly
a concentrationinequality
around itsexpectation
for the supre-mum of an
empirical
process and not a deviationinequality
fromsome unknown
(or unrealistic)
numerical constant as was theprevi-
ous one. As a
result,
in the linear situations describedabove,
it ispossible
tocompute explicitly cr~ together
with a minimal value for the constantKi . This,
of course, is of greatimportance
in the situa- tion where Xm behaves like a corrective term and not like aleading
term in formula
(10),
whichconcretely
means that the list of mod- els is not too rich(one
model per dimension is atypical example
ofthat kind like in the
problem
ofselecting
the bestregular histogram
for
instance).
This program has beensuccessfully applied
in variousframeworks
(see [2], [23], [24], [12]
where randompenalty
functionsare also considered and
[3]
where the context ofweakly dependent
data is even more
involved).
. When the collection of models is too
rich,
like in theproblem
ofselecting irregular histograms
for instance, Xm nolonger
behaves likea corrective term but like a
leading
term and it becomes essential to evaluate the constantK2.
This program can be achieved whenever onehas at one’s
disposal
someprecise exponential
concentrationbounds,
that isinvolving explicit
numerical constants. This isexactly
the casefor the Gaussian frameworks considered in
[11],
where one can usethe Gaussian concentration
inequality
due toCirelson, Ibragimov
andSudakov
(see [1 7]
andInequality
11below).
More thanthat,
thanks toLedoux’s way of
proving Talagrand’s inequality (as
initiated in[28])
it is also
possible
to evaluate the constants involved inTalagrand’s
concentration
inequalities
forempirical
processes(see [33]),
whichleads to some evaluations of
K2.
This idea isexploited
in[12]
andwill be illustrated in Section 3
following
the lines of Castellan’s work[23].
. For the linear situations mentioned
above,
there is some obvious def- inition for the dimensionDm
ofSm
In the nonlinear case,things
arenot that clear and one can find in the literature various ways of defin-
ing analogues
of the linear dimension. In[7],
various notions of metric dimensions are introduced(defined
fromcovering by
differenttypes
of
brackets)
while an alternative notion of dimension is theVapnik-
Chervonenkis dimension
(see [40]).
Since theearly eighties Vapnik
has
developed
astatistical learning theory (see
thepioneering
book[40]
and more recent issues in[41]).
One of the main methods intro- ducedby Vapnik
is what he called the structural minimizationof
therisk which is a model selection via
penalization
method in the abovesense but with a calibration for the
penalty
function which differssubstantially
from(10).
A minor difference with what can be found in[7]
is that inVapnik’s theory,
theVapnik-Chervonenkis
dimensionis used to measure the size of a
given
model instead of" bracketing"
dimensions. But there exists some much more
significant
differenceconcerning
the order ofmagnitude
of thepenalty
functions since thecalibration of the
penalty
inVapnik’s
structural minimization of the risk method is rather of the order of the square root of(10).
Ourpurpose in Section 4 will be to show that the reason for this is that this calibration for the
penalty
function is related to"global" (that
isof
Hoeffding type)
concentrationinequalities
forempirical
processes.We shall also propose some
general
theorem(for
bounded contrastfunctions)
based onTalagrand’s inequality
in[39] (which
is "local", that is of Bernstein’s
type)
which allows to deal with any kind of dimension(bracketing
dimension orVC-dimension).
From this newresult we can recover some of the results in
[7]
and remove the squareroot in
Vapnik’s
calibration of thepenalty
function which leads toimprovements
on the risk bounds for thecorresponding penalized
es-timators.
Moreover, following
the idea introduced in[16],
we showthat it is
possible
to use a random combinatorial entropy number rather than the VC dimension in the definition of thepenalty
func-tion. This is made
possible by using again
a concentration argument for the random combinatorial entropy number around its expecta- tion. The concentrationinequality
which is used for this purpose isestablished in
[16].
It is an extension to nonnegative
functionals ofindependent
variables of a Poissonian boundgiven
in[33]
for thesupremum of non
negative empirical
processes. There existapplica-
tions of this bound to random combinatorics which are
developed
in[16]
and that we shall notreproduce
here.However,
thisinequality
is so
simple
to state and to prove that we could not resist to thetemptation
ofpresenting
itsproof
in Section 2 of the present paper dedicated to MichelTalagrand.
2. Concentration
inequalities
and some directapplications
The oldest
striking
resultillustrating
the concentration ofproduct prob- ability
measuresphenomenon
is the concentration of the standard Gaussianmeasure on Let P denote the canonical Gaussian measure on the Eu- clidean
space RD
andlet (
be someLipschitz
function onRD
withLipschitz
constant L.
Then,
for every x0,
where M denotes either the mean or the median
of (
with respect to P(of
course the sameinequality
holds whenreplacing ( by -(
whichimplies
that
(
concentrates aroundM).
Thisinequality
has been established inde-pendently by
Cirelson and Sudakov in[18]
and Borell in[15]
when M isa median and
by Cirelson, Ibragimov
and Sudakov in[17]
when M is themean. The
striking
feature of theseinequalities
is the fact thatthey
do notdepend
on the dimension D(or
moreprecisely only through
M andL)
whichallows to use them for
studying
infinite dimensional Gaussian measures and Gaussian processes for instance(see [27]). Extending
such results to moregeneral product
measures is not easy.Talagrand’s approach
to thisproblem
relies on
isoperimetric
ideas in the sense that concentrationinequalities
forfunctionals around their median are derived from
probability inequalities
for
enlargements
of sets with respect to various distances. Atypical
resultwhich can be obtained
by
his methods is as follows(see [36]
and[37]
forthe best
constant).
Let~a,
beequipped
with the canonical Euclidean distance andlet (
be some convex andLipschitz
function on(a,
withLipschitz
constant L. Let P be someproduct probability
measure on[a, b]n
and M be some median
of ( (with
respect to theprobability P), then,
forevery x 0
Moreover,
the sameinequality
holds for-(
instead of(.
For thisproblem,
the
isoperimetric approach developed by Talagrand
consists inproving
thatfor any convex set A of
[a,
,where d
(., A)
denotes the Euclidean distance function to A . The latterinequality
can beproved by
at least two methods: theoriginal proof by Talagrand
in[37]
relies on a control of theLaplace
transform of d(., A)
which is
proved by
induction on the number of coordinates while Marton’sor Dembo’s
proofs (see [31]
and[20] respectively)
are based on some in-formation
inequalities.
An alternativeapproach
to thisquestion
has beenproposed by
Ledoux in[28].
It consists infocusing
on the functional(,
ratherthan
starting
from sets, andproving
alogarithmic
Sobolevtype inequality
which leads to some differential
inequality
on theLaplace
transform of(.
Integrating
this differentialinequality yields,
forevery ~ ~ 0,
which in turn
implies
via Chernoff’sinequality
It is instructive to compare
(12)
and(13).
Since(12)
holds whenreplacing ( by -(,
somestraightforward integration
shows that for someappropriate positive
numerical constant Cwhich is of the same nature as
(13)
but isobviously
weaker from the point of view ofgetting
asgood
constants aspossible.
For theapplications
thatwe have in
view,
deviationinequalities
of a functional from its mean(rather
than from its
median)
are more suitable. For the sake ofpresenting inequal-
ities with the best available constants, we shall record below a number of
inequalities
of that type obtainedby
direct methods such asmartingale
dif-ferences
(see [35])
orlogarithmic
Sobolevinequalities (see [28]).
Webegin
with
Hoeffding
typeinequalities
forempirical
processes.2.0.1.
Hoeffding type inequalities for empirical
processesThe connection between convex
Lipschitz
functionals andHoeffding type inequalities
comes from thefollowing elementary
observation. Let~1, ..., ~n
be
independent ~a, b]-valued
random variables and let be somefinite
family
of realnumbers,
thenby Cauchy-Schwarz inequality,
the func-tion (
defined on[a, by
is convex and
Lipschitz
on withLipschitz
constant y, whereo~2
= supt~T03A3ni=1 ai t
Therefore(13) implies
that the random variable Z definedby
satisfies for every x 0
This
inequality
is due to Ledoux(see inequality (1.9)
in[28])
and canbe
easily
extended to thefollowing
moregeneral
framework. Let Z = SUPtETXi,t,
wherebi,t
for some real numbers andbi,t,
for all i x n and all t E T . .
Then, setting L2
= suptET~2 1 (bZ,t
-one has for every x > 0
(see [33]),
The classical
Hoeffding inequality (see [25])
ensures that when T= ~ 1 ~,
the variable Z =
~Z ~
satisfiesfor every
positive
x. When we compare(15)
to(16),
we notice that somefactor 4 has been lost in the exponent of the upper bound. As a matter of
fact,
at theprice
ofchanging
Inequality (16)
can be shown to hold for Z = supt~T03A3ni=1 XZ,t.
This re-sult derives from the
martingale
difference method as shownby
McDiarmidin
[35].
A useful consequence of this result concernsempirical
processes.Indeed,
if ...,~n
areindependent
random variables and .?~’ is a finite orcountable class of functions such
that,
for some real numbers a andb,
one hasa
f
b for everyf
thensetting
Z =sUPfEF ~Z 1 f (~2 ) --1E [1 (~2 )~,
we
get by
monotone convergenceIt should be noticed that
(17)
does notgenerally provide
asubgaussian inequality.
The reason is that the maximal variancecan be
substantially
smaller thann(b-a)2 /4
and therefore(17)
can bemuch worse than its "sub-Gaussian" version which should make
0’2
appear instead of n(b - a)2 /4 .
It isprecisely
our purpose now toprovide sharper
bounds than
Hoeffding
typeinequalities.
Webegin by considering
nonneg- ative functionals.2.1. A Poissonian
inequality
fornonnegative
functionalsWe intend to present here some extension to
nonnegative
functionals ofindependent
random variables due toBoucheron, Lugosi
and Massart(see [16])
of a Poissonian bound for the supremum of nonnegative empirical
pro-cesses established in
[33] by using
Ledoux’sapproach
to concentration in-equalities.
The motivations forconsidering general nonnegative
functionals ofindependent
random variables came from random combinatorics. Several illustrations aregiven
in[16]
but we shall focus here on the caseexample
ofrandom combinatorial
entropy
since thecorresponding
concentration result will turn out to be very useful fordesigning
randompenalties
to solve the model selectionproblem
for patternrecognition (see
Section4). Roughly speaking,
under some condition(C)
to begiven below,
we shall show thata
nonnegative
functional Z ofindependent
variables concentrates around itsexpectation
like a Poisson random variable withexpectation
E[Z] (this comparison being expressed
in terms ofLaplace transform).
This Poissonianinequality
can be deduced from theintegration
of a differentialinequality
for the
Laplace
transform of Z which derives from akey
information bound.A very remarkable fact is that Han’s
inequality
for Kullback-Leibler infor- mation is at the heart of theproof
of this bound and is alsodeeply
involvedin the verification of condition
(C)
for combinatorialentropies.
2.1.1. Han’s
inequality for entropy
Let us first recall some well known fact about Shannon’s
entropy
and Kullback-Leibler information. Given some random variable Ytaking
its val-ues in some finite set
y,
Shannon entropy is definedby
Setting,
qy = P(Y
=y)
for anypoint
y in the supportof Y,
Shannon entropycan also be written as
hs (Y)
= E[- log qy],
from which onereadily
seesthat it is a
nonnegative quantity.
Therelationship
between Shannon entropy and Kullback-Leibler information isgiven by
thefollowing identity.
LetQ
be the distribution of
Y,
P be the uniform distribution onV
and N be thecardinality
ofy,
thenWe derive from this
equation
and thenonnegativity
of the Kullback-Leibler information thatwith
equality
if andonly
if Y isuniformly
distributed on its support. Han’sinequality
for Shannon entropy can be stated as follows(see ~19~,
p. 491 fora
proof ) .
PROPOSITION 2.1
(Han’s inequality).
- Let us consider some randomvariable Y with values in some
finite product
spaceyn
and write =(Yl,
...,YZ-1, YZ+1, ...Yn) for
every i E~1,
...,n~.
. ThenIn view of
(18)
Han’sinequality
for discrete variables can benaturally
extended to
arbitrary
distributions in thefollowing
way. Let(on, An,
(~Z 1 SZi, ~i 1.~4Z, ~Z 1 ~i) be
someproduct probability
spaceand Q
besome
probability
distribution on which isabsolutely
continuous with re-spect to Pn. Let Y be the
identity
map onon,
=(Yl,
...,Yi-1, YZ+1, ...Yn )
for