2003 Éditions scientifiques et médicales Elsevier SAS. All rights reserved S0246-0203(02)00017-1/FLA
LAPLACE TRANSFORM ESTIMATES AND DEVIATION INEQUALITIES
ESTIMÉES DE LA TRANSFORMÉE DE LAPLACE ET INÉGALITÉS DE DÉVIATION
Olivier CATONI
Laboratoire de probabilités et modèles aléatoires, U.M.R. 7599 du C.N.R.S., case 188, Université Paris 6, bureau 4 E 19, bât Chevaleret, 4, place Jussieu, 75252 Paris cedex 05, France
Received 16 February 2000, revised 11 December 2001
ABSTRACT. – We derive deviation inequalities from non-asymptotic bounds of the log-Laplace transform of a function ofNrandom variables. We assume either that these random variables are independent or that they form a Markov chain. We assume also that the partial finite differences of order one and two of the function are suitably bounded, or more generally that they have some exponential moments. The estimates we get are sharp enough to induce a central limit theorem whenN goes to infinity and to prove non-asymptotic “almost Gaussian” deviation bounds.
2003 Éditions scientifiques et médicales Elsevier SAS MSC: 60E15; 60F05; 94A17; 60G42
Keywords: Concentration of product measures; Deviation inequalities; Markov chains;
Maximal coupling; Central limit theorem
RÉSUMÉ. – Nous démontrons des inégalités de déviation à partir d’estimées du logarithme de la transformée de Laplace d’une fonction deN variables aléatoires. Nous supposons soit que ces variables sont indépendantes, soit qu’elles forment une chaîne de Markov. Nous supposons aussi que les différences finies partielles d’ordre un et deux de la fonction sont convenablement bornées, ou plus généralement qu’elles ont certains moments exponentiels. Nos estimées sont suffisamment précises pour induire un théorème de la limite centrale quandN tend vers l’infini et pour prouver des inégalités de déviation “presque Gaussiennes”.
2003 Éditions scientifiques et médicales Elsevier SAS
E-mail address: [email protected] (O. Catoni).
Introduction
We are going to give “almost Gaussian” finite sample bounds for the log-Laplace transform of some functions of the type f (X1, . . . , XN), where the random variables (X1, . . . , XN)are assumed to be independent, or to form a Markov chain.
We will use throughout the paper a normalisation that parallels the classical case of the sum
√1 N
N i=1
Xi
of real valued random variables. As is usual in these matters, we will deduce from upper log-Laplace estimates finite sample almost sub-Gaussian deviation inequalities for f (X1, . . . , XN). We will also obtain that the central limit theorem holds when N goes to infinity. Although limit laws are not the main subject of this paper, they will be a guide for using a relevant normalisation of constants.
To make sure that this is feasible, it is necessary to make assumptions not only on the first order partial derivatives (or more generally first order partial finite differences) of f, but also on the second order partial derivatives (or more generally on the second order partial finite differences) off. Indeed, the simple example of
f (X1, . . . , XN)=g 1
√N N i=1
Xi
,
should immediately convince the reader that Lipschitz conditions are not enough to enforce a Gaussian limit.
A proper normalisation being chosen, we will be interested in expansions of Z def= f (X1, . . . , XN)−E(f (X1, . . . , XN))of the type
logEexp(λZ)=λ2
2V(Z)+ · · ·,
whereλis “of order one”,V(Z)=E{[Z−E(Z)]2}and the remaining terms are small whenN is large.
Our line of proof will be a combination of the martingale difference sequence approach initiated by Hoeffding [6] and Yurinskii [21] and the statistical mechanics philosophy we already used in [4]. The martingale approach to deviation inequalities is also reviewed in [10, p. 30] and [16]. More precisely, we will decompose Z into its martingale difference sequence Z =Ni=1Fi and we will take appropriate partial derivatives of the log-Laplace transform
(λ1, . . . , λN)→logEexp N
i=1
λiFi
.
We will consider first the case of independent random variablesX1, . . . , XN, ranging in some product of probability spaces Ni=1(Xi,Bi, µi). In the first section, we will
assume that the partial finite differences of f (x1, . . . , xN) of order one and two are bounded. In the second section, we will assume that they have exponential moments instead, and in the third section, we will study the case of Markov chains.
1. Bounded range functionals of independent variables
Let the collection of random variables X =(X1, . . . , XN) take its values in some product of measurable spaces Ni=1(Xi,Bi). We will assume in the following that (X1, . . . , XN) is the canonical process. Let P = Ni=1µi be a product probability measure on Ni=1(Xi,Bi).
For any bounded measurable functionW of(X1, . . . , XN), we will define the modified probability distributionPW by
dPW= exp(W ) E[exp(W )]dP,
and we will use the notation EW for the expectation operator with respect toPW. On the other hand, ifFis some sub-sigma algebra of Ni=1Bi, thenEFwill be used as a short notation for the conditional expectation with respect toF.
As(EF)W =(EW)F, we will simply writeEFW for this conditional expectation which we can more explicitely define as
EFW(U )=E[Uexp(W )|F] E[exp(W )|F] .
Note that we have EW(EFW)= EW, whereas E(EFW) =EW and EW(EF) =EW in general. In the same way we will use the notation VW(U )forEW{[U−EW(U )]2}and the notation M3W(U ) for EW{[U −EW(U )]3}. Note that for any bounded measurable functionW,
∂
∂λlogEexp(λW )=EW(W ),
∂2
∂λ2logEexp(λW )=VW(W ), and
∂3
∂λ3logEexp(λW )=M3W(W ).
For eachi=1, . . . , N, letFi be the sigma algebra generated by(X1, . . . , Xi)and let Gi be the sigma algebra generated by(X1, . . . , Xi−1, Xi+1, . . . , XN).
To stress the role of the independence assumption, we will put the superscript “i” on the equalities and inequalities requiring this assumption.
Let us introduce some notations linked with the martingale differences of a bounded random variableW measurable with respect toFN. We will put
Gi(W )=W −EGi(W ), Fi(W )=EFi(W )−EFi−1(W )
=i EFi(Gi).
As we explained in the introduction, we will study the log-Laplace transform of Z=f (X)−Ef (X)
of some bounded measurable real valued functionf:Ni=1Xi →R. We will decompose Z into the sum of its martingale differences
Z= N i=1
Fi(Z), and use the short notationZi=EFi(Z).
In this context, it is natural to assume that for some positive constants Bj, j = 1, . . . , N, for any(x1, . . . , xN)∈Nj=1Xj, for anyi=1, . . . , N, anyyi ∈Xi,
f (x1, . . . , xN)−f (x1, . . . , xi−1, yi, xi+1, . . . , xN) Bi
√N.
The reader should understand that we are interested mainly in the case when the constantsBi are of order one. Although all these scaling factors are not really needed for finite sample bounds, we have found them useful to indicate what should be considered to be small and what should not.
To ensure that f (X) is almost Gaussian, we need to make also an assumption on its second partial differences. This corresponds to conditions on the second partial derivatives off in the case when the random variables (X1, . . . , XN) take their values in some finite dimensional vector space, are bounded, andf is a smooth function.
For simplicity, we will use the short notationx1N for(x1, . . . , xN). Let us put for any x1N∈Nj=1Xj, anyyi ∈Xi,
ifx1N, yi
=fx1N−fx1i−1, yi, xiN+1.
For a fixed value of yi,if may be seen as a function ofx1N, and when we will write jif (x1N, yi, yj), we will mean that we applyj to this function and toyj. (A more accurate but lengthy notation would have beenj(if (·, yi))(x1N, yj).)
Let us assume that for some nonnegative exponentζ, for anyi=j, for some positive constantCi,j, and for anyx1N∈Nk=1Xk,yi ∈Xi,yj∈Xj,
ijfx1N, yj, yi
Ci,j
N3/2−ζ.
Note that ijf (x1N, yj, yi)=jif (xN1, yi, yj) and therefore that we can assume thatCi,j =Cj,i. We will moreover assume by convention thatCi,i=0.
The normalisation is made so thatζ =0 corresponds to the case of f (X1, . . . , XN)=√
N g X1
N , . . . ,XN
N
,
where g:[0,1]N →R is a smooth real function of N real bounded arguments with bounded first and second partial derivatives.
Another class of functions satisfying these hypotheses are the functions of the type fx1N= 1
N3/2
1i<jN
ψi,j(xi, xj),
whereψi,j are bounded measurable functions. Hereζ=0, and we can take Bi= 2
N j <iψj,i∞+
j >i
ψi,j∞
, and
Ci,j =Cj,i=4ψi,j∞, i < j.
More generally
f (xN1)=N1/2−r
1i1<···<irN
ψ(i1,...,ir)(xi1, . . . , xir) also satisfies our hypotheses, when the functionsψ(i1,...,ir)are bounded.
THEOREM 1.1. – Under the previous hypotheses, for any positiveλ, logEexpλf (X)−λEf (X)−λ2
2 Vf (X)
λ3
N1/2−ζ BCB
4N2 + λ3
√N N i=1
Bi3 3N.
COROLLARY 1.1. – Thusf (X)satisfies the following deviation inequalities:
Pf (X)Ef (X)+εexp
− ε2
2(V(f (X))+V(f (X))ηε )
, Pf (X)Ef (X)−εexp
− ε2
2(V(f (X))+V(f (X))ηε )
, with
η= 1
2N1/2−ζ BCB
N2 + 2 3√ N
N i=1
Bi3 N .
Remark 1.1. – We obtain for some constant K depending on max1iNBi and max1i,jNCi,j, but not onN, that
logEexpλf (X)−λEf (X)−λ2
2Vf (X) Kλ3 N1/2−ζ.
Therefore if we consider a sequence of problems indexed byNsuch that the constantsBi
andCi,j stay bounded, and such thatV(f (X))converges, we get a central limit theorem
as soon as ζ <1/2 (with the caveat that the limiting distribution may degenerate to a Dirac mass if the asymptotic variance is 0): f (X)−E(f (X))converges in distribution to a Gaussian measure.
Remark 1.2. – The critical valueζc=1/2 is sharp, since when f (X)=g
1
√N N
i=1
Xi
,
the central limit theorem obviously does not hold in general, andζ =1/2.
Proof. – After decomposingZinto the sum of its martingale differences, we can view the log-Laplace transform ofZas a function ofN equal “temperatures”:
logEexp(λZ)=logE
exp N
i=1
λiFi(Z)
, λi=λ, i=1, . . . , N .
The first step is to take three derivatives with respect to λi, for i ranging from N backward to 1:
logEexp(λZi)=logEexp(λZi−1)+ λ 0
EλZi−1+αFi(Z)
Fi(Z)dα.
Therefore
logEexp(λZ)= N i=1
λ 0
EλZi−1+αFi(Z)
Fi(Z)dα
= N i=1
λ 0
(λ−β)VλZi−1+βFi(Z)
Fi(Z)dβ
= N i=1
λ2 2 EλZi−1
Fi(Z)2+ λ 0
(λ−γ )2
2 M3λZi−1+γ Fi(Z)
Fi(Z)dγ .
Thus, using the fact that for any real random variable
Eξ−E(ξ )32ξ∞Eξ−E(ξ )22ξ∞Eξ22ξ3∞, we obtain
LEMMA 1.1. –
logEexp(λZ)−λ2 2
N i=1
EλEFi−1(Z)
Fi(Z)2
N i=1
λ3Bi3 3N3/2.
Remark 1.3. – The upper bound can easily be improved in the following way: notice that for any nonnegative γ, EλZi−1+γ Fi(Z)[Fi(Z)]0, because this expression is equal to zero when γ =0 and is nondecreasing with respect to γ. As moreover for any real random variable r →E[(ξ −r)3] is nonincreasing, we see that when E(ξ )0, E{[ξ−E(ξ )]3}E(ξ3)ξ3∞. This shows that indeed
logEexp(λZ)−λ2 2
N i=1
EλEFi−1(Z)
Fi(Z)2 N
i=1
λ3Bi3 6N3/2.
To proceed in the proof of Theorem 1.1, we have now to approximateEλZi−1[Fi(Z)2] byE[Fi(Z)2], in order to get the variance ofZ, that can be written as
N i=1
EFi(Z)2.
Let us put for shortVi=Fi(Z)2and let us introduce its martingale differences:
EλZi−1
Vi−E(Vi)=
i−1
j=1
EλZi−1
Fj(Vi).
To deal with the jth term of this sum, we introduce the conditional expectation with respect toGj:
EGλZji−1
Fj(Vi)=EGλGjj(Zi−1)
Fj(Vi)
=EGjFj(Vi)
=0i
+ λ 0
EGαGjj(Zi−1)
Fj(Vi)Gj(Zi−1)
−EGαGjj(Zi−1)Gj(Zi−1)dα. (1) As a consequence, lettingU=Fj(Vi)and W =(Gj(Zi−1)−EGαGjj(Zi−1)Gj(Zi−1))and applying the Cauchy–Schwartz inequality, we get
EGλZji−1
Fj(Vi) λ 0
EGαGjj(Zi−1)
U2EGαGjj(Zi−1)
W2dα.
Reminding that we are analysing the case whenj < i, we can now observe that Gj(Zi−1)=i EFi−1Gj(Z),
and therefore that its conditional range is upper bounded by ess supGj(Zi−1)|Gj
−ess infGj(Zi−1)|Gj
Bj
√N.
This implies that its variance is bounded by B
2 j
4N.
(Let us remind that by definition, for any real random variable W on a probability space(),F,P), and any sub sigma algebraG⊂F,
ess sup(W |G)def=supλ∈R: P(Wλ|G) >0. It can equivalently be defined by the relation
ess sup(W |G) > λ=P(W λ|G) >0, λ∈Q,
which shows that it belongs toL0(),G,P)– the set of measurable functions factorised by almost sure equality with respect to P. Here and below, inequalities involving conditional expectations and conditional essential suprema and infima are meant without further notice to hold almost surely.)
Let us consider now on some enlarged probability space two independent random variables Xi and Xj, such that (X1, . . . , XN, Xj, Xi) is distributed according to
N
k=1µk⊗µj⊗µi. We have Fj
Fi(Z)2=Fj
EFiif (XN1, Xi)2. Moreover for any functionh(XN1),
Fj(h(X)2)=EFjjh2(XN1, Xj)
=EFjh(X1N)+h(Xj1−1, Xj, XNj+1)jh(X1N, Xj) 2 ess suph(X)EFjjh(XN1, Xj).
Applying this toh(X)=EFi(if (XN1, Xi)), we get Fj
Fi(Z)22 Bi
√NEFjEFijif (XN1, Xi, Xj)2BiCi,j
N2−ζ . Therefore
EλZi−1Fj(Vi)λBiCi,jBj
N5/2−ζ .
This, combined with Lemma 1.1, ends the proof of Theorem 1.1. The derivation of its corollary is standard: it is obtained by taking
λ= ε
V(f )+ηε/V(f ), and by applying successively the theorem tof and−f. ✷
2. Extension to unbounded ranges
The boundedness assumption of the finite differences of f can be relaxed to exponential moment assumptions. To achieve this, we will suitably modify the bounds of the previous section, still assuming that f is bounded, and will afterwards let more general functions be approximated by bounded ones.
THEOREM 2.1. – Let λ be a positive real constant. Let Ni=1(Xi,Bi, µi) be some product of probability spaces. Letf ∈L2( Ni=1(Xi,Bi, µi))be such that
Eexpλf (X)<+∞,
whereX=(X1, . . . , XN)is the canonical process. Let(X1, . . . , XN, X1, . . . , XN)be the canonical process on( Ni=1(Xi,Bi))⊗2and letEbe the expectation with respect to the probability measure ( Ni=1µi)⊗2 defined on this enlarged space. Let us introduce the short notations
i=if (X1N, Xi), 1iN,
j,i=jif (XN1, Xi, Xj), 1j < iN, ,(r)=exp(r)−1−r−r2
2 = r
0
(r−u)2
2 exp(u) du, and let us decide by convention that 0×(+∞)= +∞. Then
logEexpλf (X)−λEf (X)−λ2
2 Vf (X)
5
N i=1
ess supEGi,λ|i| +√
2λ2 N i=1
i−1
j=1
ess supEGi(2i)1/2ess supEFj−1(2j,i)1/2
×ess supEGjλ|j|exp(2λ|j|)−11/2. Let us begin with a lemma.
LEMMA 2.1. – In the case whenf ∈L∞,
logEexp(λZ)−λ2 2
N i=1
EλZi−1
Fi(Z)2 5
N i=1
ess supEFi−1,λFi(Z). Proof. – We can notice that for any bounded real random variableξ
E[ξ−E(ξ )]3E|ξ|(ξ−E(ξ ))2+E(|ξ|)E(ξ−E(ξ ))2 E(|ξ|3)+2E(ξ2)E(|ξ|)+E(|ξ|)3+E(|ξ|)E(ξ2) 5E(|ξ|3).
(Indeed E(ξ2)E(|ξ|) E(|ξ|3), from the convexity of g:β → log[E(|ξ|β)], which implies thatg(1)−g(0)g(3)−g(2).)
Coming back to the proof of Lemma 1.1 we can write the following chain of inequalities:
λ o
(λ−γ )2
2 M3λZi−1+γFi(z)
Fi(Z)dγ
5 2 λ 0
(λ−γ )2EλZi−1+γ Fi(Z)
|Fi(Z)|3dγ
EλZi−1
5 2 λ
0
(λ−γ )2EFi−1expγ|Fi(Z)||Fi(Z)|3dγ
5
2ess supEFi−1 λ
0
(λ−γ )2|Fi(Z)|3expγ|Fi(Z)|dγ
.
(As in the case of Lemma 1.1 – see Remark 1.3 – the upper bound can be improved, noticing that for any real random variable E{[ξ−E(ξ )]3}E(ξ3)as soon asE(ξ ) 0.) ✷
Proof of Theorem 2.1. – Let us prove the theorem first whenf ∈L∞. In addition to those introduced in Theorem 2.1, we will need the following short notation:
j
i=if(X1, . . . , Xj−1, Xj, XjN+1), Xi, 1j < iN.
Let us come back to Eq. (1) and notice that
Fj(Vi)=EFjEFii+ji
EFi(j,i), Gj(Zi−1)=EFi−1(i),
and therefore, from the Cauchy–Schwartz inequality and the convexity ofr→r2, that Fj(Vi)EFji+ji
21/2
EFj2j,i1/2. We can thus bound the integrand ofEGGjj(Zi−1)in the last term of Eq. (1) by
Fj(Vi)Gj(Zi−1)−EGGjj(Zi−1)Gj(Zi−1)A×B, where
A=EFj2j,i1/2exp
−α
2Gj(Zi−1)
, B=exp
α
2Gj(Zi−1)EFji +ji
21/2Gj(Zi−1)+EGGjj(Zi−1)Gj(Zi−1). Applying the Cauchy–Schwartz inequality and noticing that for any real random variable ξ such thatE(ξ )0,
Eexp(αξ )expαE(ξ )1, we get
EλZi−1
Fj(Vi)EλZi−1
λ 0
EGαGjj(Zi−1)A21/2EGαGjj(Zi−1)B21/2dα
EλZi−1
λ 0
EGjexpαGj(Zi−1)A201/2EGαGjj(Zi−1)
B21/2dα
2EλZi−1
EFj−12j,i1/2
× λ 0
EGαGjj(Zi−1)
expαGj(Zi−1)Gj(Zi−1)2
+EGαGjj(Zi−1)Gj(Zi−1)21/2dα
×ess supEFj2i+ess supEFjj2i1/2,
where we have also used the fact that(a+b)22(a2+b2). To simplify the bound, we can now notice that for any random variableξsuch thatE(ξ )0,[E(ξ )]2E[exp(αξ )] E[ξ2exp(αξ )](because[E(ξ )]2[Eαξ(ξ )]2Eαξ(ξ2)which is the inequality we seek).
We can also notice that
ess supEFj2i=ess supEFjj2iess supEGi2i. These two remarks lead to
EλZi−1
Fj(Vi) 4EλZi−1
EFj−12j,i1/2
× λ 0
EGjexp2αGj(Zi−1)Gj(Zi−1)21/2dα
ess supEGi2i1/2. Remembering that Gj(Zi−1)=EFi−1(j), using the convexity ofr →r2exp(2αr)on the positive real axis and another round of the Cauchy–Schwartz inequality, we get
λ 0
EGjexp2αGj(Zi−1)Gj(Zi−1)21/2dα
λ 2
EGjEFi−1exp(2λ|j|)−1|j|1/2
λ 2
ess supEGjexp(2λ|j|)−1|j|1/2. Thus EλZi−1
Fj(Vi)
2√
2EλZi−1
EFj−12j,i1/2
×ess supEGjλ|j|exp(2λ|j|)−11/2ess supEGi2i1/2.
This combined with Lemma 2.1 proves the following slight improvement of Theorem 2.1 in the case whenf ∈L∞:
logEexpλf (X)−λEf (X)−λ2
2 Vf (X)
5
N i=1
ess supEFi−1,λ|i| +√
2λ2 N i=1
i−1
j=1
ess supEGi2i1/2EλZi−1
EFj−12j,i1/2
×ess supEGjλ|j|exp2λ|j|−11/2. (2) When f satisfies only the weaker assumptions of Theorem 2.1, we can introduce the bounded functionsfT(X)=min{max{f (X),−T}, T},T ∈N.
The terms of the left-hand side of Eq. (2) for fT converge from the dominated convergence theorem to their counterpart forf. Indeed
expλfT(X)max1,expλf (X)∈L1, fT(X)|f (X)| ∈L2.
In the right-hand side, we can use the bounds
ifT(X, Xi)if (X, Xi),
and apply the dominated convergence theorem to prove that, introducing the notations j,i(f )=jif (X, Xi, Xj),
Zi(f )=EFi[f (X)] −E[f (X)],
T→+∞lim EλZi−1(fT)
EFj−1j,i(fT)21/2=EλZi−1(f )
EFj−1[j,i(f )]21/2,
as soon as ess supEGi(2i) <+∞ (in the other case, Theorem 2.1 is trivial). Indeed EλZi−1(f )=EλEFi−1[f (X)],
expλEFi−1fT(X)max1,EFi−1expλf (X)∈L1 and, puttingji(f )=if (Xj, Xi), (so thatj,i(f )=i(f )−ji(f )),
EFj−1[j,i(fT)]22EFj−1i(fT)2+ji(fT)24 ess supEGii(f )2. This proves that inequality (2), and consequently Theorem 2.1 which is weaker, holds forf. ✷
3. Generalisation to Markov chains
We will study here the case when(X1, . . . , XN)is a Markov chain. The assumptions on f will be the same as in the first section. Therefore we will assume throughout this section that
ifx1N, yi
Bi
√N, i=1, . . . , N, and that
jifx1N, yi, yj
Ci,j
N3/2−ζ, 1j < iN.
When the random variables (X1, . . . , XN) are dependent, we have to modify the definition of the operators Gi(Z) and of the sigma algebras Gi. Indeed to generalise the first part of the proof we would like to have the identity
EFi(Gi(Z))=Fi(Z),
whereGi is “as small as possible”. This identity does not hold in the general case with the definition we had forGi, so we will have to change it.
To generalise the second part of the proof, we need to consider a new definition of the sigma algebraGj for which
EGjFj(W )=0, andZ−EGj(Z)is as small as possible.
We propose a solution here where the two objectsGi andGj are built with the help of coupled processes. This use of coupling was inspired by the works of Katalin Marton [11–13].
Let us remind first some general facts about maximal coupling: given some probability space (),S), and two probability measures µ and ν on (),S), a maximal coupling between µ and ν is a probability measure ρ on ()×),S⊗S) such thatρ(dω1)= µ(dω1), ρ(dω2)=ν(dω2)and ρ(ω1=ω2) is maximal (whereω1 andω2 are the first and second coordinates on)×)). In words, a maximal coupling is a joint distribution with prescribed marginals and a maximal weight given to the diagonal. The maximal weight of the diagonal is obviously (µ∧ν)()), whereµ∧ν is defined asd(µ∧ν)= min{d(µdµ+ν),d(µdν+ν)}d(µ+ν) and is necessarily the trace of any maximal coupling distribution on the diagonal; it is also related to the variation distance µ−νvar
def=
1
2|µ−ν|())betweenµandνby the formula(µ∧ν)())=1− µ−νvar. A maximal coupling betweenµ and ν is not unique (its off diagonal behaviour being arbitrary, as long as the marginal constraints are satisfied), however one possible explicit construction is the following: consider on the product space()×)× [0,1],S⊗2⊗B), whereBis the Borel sigma algebra, the product probability measureµ⊗ µ−ν−var1(ν−µ)+⊗U, where (ν−µ)+=ν−(ν∧µ) is the positive part of the signed measure ν−µ and whereU is the Lebesgue measure on[0,1]. Let(ω1, ω2, r) be the three coordinates of