Basic tools from empirical processes
Applications to statistics
Sébastien Loustau
LAREMA, Université d'Angers
Dynstoch meeting, June 2010, Angers
Non serious motivation
Angers 16 19 June 2010
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."
... not so far from
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."
Empirical processes for statistical inference !
Non serious motivation
Angers 16 19 June 2010
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."
... not so far from
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."
Empirical processes for statistical inference !
Non serious motivation
Angers 16 19 June 2010
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."
... not so far from
"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."
Empirical processes for statistical inference !
Serious motivation
Z1, . . . ,Zn i.i.d. from P. GivenG, we aim at:
f∗ =arg min
f∈GEPl(f,Z) =arg min
f∈GR(f), where l is some loss function.
We study the performance (consistency, rates of convergence) of ˆfn=arg min
f∈G
1 n
n
X
i=1
l(f,Zi) =arg min
f∈GRn(f), in terms of the excess risk:
R(ˆfn)−R(f∗).
Where do empirical processes appear ?
R(ˆfn)−R(f∗) ≤ R(ˆfn)−Rn(ˆfn) +Rn(f∗)−R(f∗)
≤ 2 sup
f∈G
|Rn(f)−R(f)|.
⇒Aim: to control uniformly the empirical process Rn(f)−R(f) indexed byG.
Examples
I Density estimation:
ˆpn=arg max
p∈P
1 n
n
X
i=1
log p(Zi).
I Regression:
gˆn=arg min
g∈G
1 n
n
X
i=1
(Yi −g(Xi))2.
I Classication:
ˆfn=arg min
f∈F
1 n
n
X
i=1
(1−Yif(Xi))+.
Outlines
1. Uniform Law of Large Numbers
Entropy/Chaining/Symmetrization/Hoeding's inequality/Consistency of M-estimators.
2. Increments of empirical processes
Finer results at a neighbourhood of xed function give rates of convergence of M-estimators.
3. Statistical Learning theory
Examples of Rademacher of RKHS Balls, Besov Balls give rates of convergence of SVM estimators.
Notations, denitions
I Z1, . . . ,Zn i.i.d. with law P, and for g measurable, Pg :=EPg(Z)and Png := 1
n Xn
i=1
g(Zi).
I G satises ULLN if supg∈G|Png−Pg| −→0 almost surely.
I Np(G, δ,Q) :=smallest number of balls of radiusδto cover G:
∀g ∈ G,∃i ∈ {1, . . . ,Np}:kg−gikLp(Q)≤δ.
Hp(G, δ,Q) :=log Np(G, δ,Q) is called the entropy ofG.
I NpB(G, δ,Q) := smallest number ofδ-brackets to coverG :
∀g ∈ G,∃i ∈ {1, . . . ,NpB}:giL≤g ≤giUandkgiL−giUkLp(Q)≤δ.
HpB(G, δ,Q) :=log NpB(G, δ,Q)is called the entropy with bracketing ofG.
Preliminary result
Lemma
H1B(G, δ,P)<∞,∀δ >0 ⇒ Gsatises ULLN.
Proof :
Let g ∈ G. then∃ [gjL,gjU]δ-bracket, j ∈ {1, . . .N} :
(Pn−P)g ≤ PngjU−Pg = (Pn−P)gjU+P(gjU−g)
≤ (Pn−P)gjU+δ, and similarly(Pn−P)g ≥(Pn−P)gjL−δ.
Since([gjL,gjU])j=1...N is nite, we have (LLN) for n great enough:
j=max1...N|(Pn−P)gjU| ≤δ and max
j=1...N|(Pn−P)gjL| ≤δa.s. Then
gsup∈G
|Png −Pg| ≤2δa.s.
Chaining method
LetG ⊂L2(Q) : supGkgkQ ≤R.
Idea : approximate g ∈ G by a nite class called a chain.
Consider(gjs)j=1...Ns a 2−sR-covering set of G, for s =1, . . . ,S :
∀g ∈ G,∃gs ∈ {g1s, . . . ,gNss}:kg−gskQ ≤2−sR. Then we can write :
g =g −gS+ XS
s=1
(gs−gs−1) S large enough to getkg −gSkQ small enough PS
s=1(gs−gs−1) only involves nite number of functions.
Application: intermediate result
Lemma
Let(ξ1, . . . , ξn) xed. Suppose Wi, i =1, . . .n are such that:
P(|
Xn
i=1
Wiγi| ≥a)≤C1exp(− a2 C2Pn
i=1γi2).
Assume supGkgkQn ≤R where Qn= 1nP
δξi. Then for
√nδ ≥C( Z R
δ/8K
pH2(u,G,Qn)du∨R), we have:
P(sup
g∈G
|1 n
n
X
i=1
Wig(ξi)| ≥δ∧1 n
n
X
i=1
Wi2≤K2)≤C exp(− nδ2 C2R2).
Proof using chaining
Consider(gjs)j=1...Ns 2−sR covering set ofG w.r.t. L2(Qn), for s =1, . . .S. Rewrite, for g ∈ G:
1 n
n
X
i=1
Wig(ξi) = 1 n
n
X
i=1
Wi(g(ξi)−gS(ξi)) + 1 n
n
X
i=1
WigS(ξi).
Choosing S =min{s ≥1:2−sR ≤ 2Kδ } ensures (with Cauchy-Schwartz), on the event{1nPWi2 ≤K2}:
1 n
Xn
i=1
Wi(g(ξi)−gS(ξi))≤Kkg −gSkQn ≤ δ 2. It remains to control the probability:
P( max
j=1,...NS
|1 n
Xn
i=1
WigjS(ξi)| ≥ δ 2).
Write gS =PS
s=1(gs−gs−1) and for P
ηs ≤1:
P(sup
g∈G
|1 n
XS
s=1
Xn
i=1
Wi(gs−gs−1)(ξi)| ≥ δ 2)
≤ XS
s=1
P(sup
g∈G
|1 n
Xn
i=1
Wi(gs−gs−1)(ξi)| ≥ δ 2ηs)
≤ XS
s=1
C1exp(H2(2−sR,G,Qn))exp(− nδ2η2s C22−2sR2).
With a good choice of(ηs), we get the result.
Symmetrization
Goal : replace the study of an empirical process to the study of a symmetrized version.
Idea : consider a ghost sample(Zi0)ni=1 i.i.d. from Z and independent of(Zi)ni=1. Then:
i(f(Zi)−f(Zi0))∼f(Zi)−f(Zi0).
where(i)ni=1 are i.i.d. Rademacher variables.
Then we have in expectation : Esup
g∈G
|Pn−P|(g)≤2Esup
g∈G
|1 n
n
X
i=1
ig(Zi)|.
or in probability : P(sup
g∈G
|Pn−P|(g)≥δ)≤4P(sup
g∈G
|1 n
n
X
i=1
ig(Zi)| ≥ δ 4)
Hoeding's inequality
Theorem
Consider Z1, . . .Zn centered independent r.v. such that ai ≤Zi ≤bi, i =1, . . .n. Then∀a>0:
P(
n
X
i=1
Zi ≥a)≤exp(−2 a2 Pn
i=1(bi−ai)2).
Hence we get fori i.i.d. Rademacher:
P(|
n
X
i=1
iγi| ≥a)≤2 exp(− a2 2Pn
i=1γi2).
ULLN under entropy condition
Theorem
Assume supg∈Gkgk∞<R and:
1
nH2(δ,G,Pn)−→P 0,∀δ >0. ThenG satises ULLN.
Proof:
Apply Symmetrization:
P(sup
g∈G
|Pn−P|(g)≥δ)≤4P(sup
g∈G
1 n
n
X
i=1
ig(Xi)≥ δ 4) Hoeding's inequality:
P(|
n
X
i=1
iγi| ≥a)≤exp(− a2 2Pn
i=1γi2).
Apply intermediate result conditionnaly on Z1, . . . ,Zn on the set An={√
nδ≥C(R r
H2( δ
32,G,Pn)∨R)}, we get:
P(sup
g∈G
|1 n
n
X
i=1
ig(Xi)| ≥ δ
4)≤C exp(− nδ2
C2R2) +P(ACn)n→∞→ 0.
Applications to MLE estimator
Observe X1, . . . ,Xn i.i.d. with density p0 ∈ P with respect to µ σ-nite measure. Suppose there exists :
pˆn=arg max
p∈P Pn(log p).
We want the a.s. convergence of the quantity:
h(ˆpn,p0) = s1
2 Z
(p ˆpn−√
p0)2dµ.
For convexP, we have:
h2(ˆpn,p0)≤(Pn−P)( 2ˆpn ˆpn+p0).
Then it remains to get ULLN for the class G={ 2p
p+p0,p ∈ P}.
Applications to MLE estimator (2)
Examples :
I The class of monotone Lebesgue densities:
P ={p is a decreasing density on[0,1]}.
Here use the convexity of P.
I The class of Lebesgue smooth densities:
P ={p: [0,1]→R+: Z
pdµ=1, Z 1
0 (p(m)(x))2dx ≤M2}.
Here H1B(P, δ, µ)≤Aδ−m1.
Increments of empirical process
Where do empirical processes appear ? (a ner bound) R(ˆfn)−R(f∗) ≤ R(ˆfn)−Rn(ˆfn) +Rn(f∗)−R(f∗)
≤ sup
f∈G(δ)
|(Rn−R)(f −f∗)|.
⇒We study the behaviour of νn(g−g0) =√
n(Pn−P)(g −g0) onG(δ) ={g ∈ G :kg −g0kP ≤δ}.
Agenda :
I Bernstein's inequality
I Uniform Bernstein over G
I application toG(δ) for δ→0.
Bernstein's inequality
Theorem
Letkgk∞≤K and kgkP ≤R. Then:
P(νn(g)≥a)≤exp(− a2 2(√aKn+R2)).
It gives subgaussian or subexponential tails and we are interested in subgaussian:
P(νn(g)≥a)≤exp(− a2
4R2) holds for a≤
√nR2 K , in particular when R →0.
Uniform Bernstein under entropy condition
Theorem
LetG such that supGkgk∞≤K, supGkgkP ≤R and R1
0
qH2B(G,u,P)du<∞. Then take
C0( Z R
0
qH2B(G,u,P)du∨R)≤a≤C1
√nR2 K , then
P(sup
g∈G
|νn(g)| ≥a)≤exp(− a2
C2(C1+1)R2).
Proof : Chaining similar to the intermediate result.
Application to G(δ
n) when δ
n→ 0
Theorem
Suppose H2B(G, δ,P)≤Aδ−α, 0< α <2. Then forδ ≥n−2+α1 , we have
P( sup
g∈G(δ)
|νn(g)−νn(g0)| ≥cδ1−α2)≤C exp(−C0δ−α C00 ).
Proof:
I δ →0 not to fast to have subgaussian tails.
I a∼Rδ 0
pH2(G,u,P)du∼δ1−α2 to ensure uniform Bernstein.
Application to G(δ
n) when δ
n→ 0
Consequence:
(i)P( sup
g∈G(n−2+α1 )
|νn(g)−νn(g0)| ≥Tn−22−α+α)≤exp(−Tn2+αα C ).
(ii)P( sup
g∈GC(n−2+α1 )
|νn(g)−νn(g0)|
kg−g0k1−α2 ≥T)≤exp(−T C).
⇒We arrive at:
gsup∈G
|νn(g)−νn(g0)|
kg −g0k1−α2 ∨n−2+α2 =OP(1).
Proof of ( ii ) : Peeling
Goal: to study the tails of νwn((gg)) from tails of νn(g).
Idea: consider(ms)Ss=1 decreasing sequence and peelG as follows:
G ⊆
S
[
s=1
Gs, where Gs ={g ∈ G:ms ≤w(g)≤ms−1}.
Then write:
P(sup
g∈G
νn(g)
w(g) ≥a) ≤ XS
s=1
P( sup
g∈G(s)
νn(g) w(g) ≥a)
≤
S
X
s=1
P( sup
g:w(g)≤ms−1
νn(g)≥ams).
Here to get(ii), w(g) =kg−g0k, and ms =2−s.
Application to get rates of convergence of MLE
Observe X1, . . . ,Xn i.i.d. with density p0 ∈ P. We want to upper bound the quantity:
h(ˆpn,p0) = s1
2 Z
(p ˆpn−√
p0)2dµ.
Using for instance:
h2(ˆpn,p0)≤(Pn−P)(
√pˆn
√p0).
If H2B(δ,P1/2
p1/20 ,P)≤δ−α, we have:
h2(ˆpn,p0)≤(Pn−P)(
√ˆpn
√p0)≤OP(n−12)h1−α2(ˆpn,p0)∨OP(n−2+α2 ), and gives
h(ˆpn,p0) =OP(n−2+α1 ).
How to choose F ?
A penalized M-estimator for classication
Consider the SVM minimization:
minf∈H
"
1 n
Xn
i=1
(1−Yif(Xi))++αnkfk2H
# ,
where:
I (Xi,Yi)∈Rd× {−1,+1}, i =1, . . .n are i.i.d.,
I H is a given functional space,
I αn smoothing parameter to determine.
Rates of convergence
I Estimation error:
R(ˆfn,f∗)≤C inf
f∈H R(f,f∗) +αnkfk2H
+δn(αn).
I Approximation error:
a(αn) = inf
f∈H R(f,f∗) +αnkfk2H .
Then you have:
R(ˆfn,f∗)≤Ca(αn) +δn(αn)αn∼n
∗
∼ n−?
Local Rademacher
We are interested in E sup
f∈BH(R):Ef(X)2≤δ
1 n
n
X
i=1
if(Xi)
:=ERadn(R, δ), wherei are i.i.d. P(i =1) =P(i =−1) = 12.
I In the RKHS situation, we have:
ERadn(R, δ)≤ √1 n inf
d∈N
√dδ+Rs X
j>d
λj
,
where (λj) eigenspectrum of the integral operator LK :f 7→R
f(x)K(x,·)dx.
I What happens if H=Bspq(Rd)?
Besov case
Theorem
Suppose X admits a bounded densityρ with compact support.Then if s > dp and 1≤p≤2,
∀δ >0,E sup
f∈B(R):Ef(X)2≤δ
1 n
n
X
i=1
if(Xi)
≤ √c nR2ud δ
s−d 2up,
where u=s+d
12 −1p . Consequence: if f∗ ∈ Brpq,
R(ˆgn)−R(f∗) =OP(n−2sr−r2u2u+d), where we chooseαn∼n−2u2u+d.
References
I S. Van De Geer Empirical Processes and M-estimation, 2000.
I J. Wellner and A. Van der Vaart Weak convergence and empirical processes, 2000.
I S. Mendelson Local Rademacher Complexities, 2005.
I P. Massart Some applications of concentration inequalities to statistics, 2000.
I S. Loustau Penalized ERM over Besov spaces, 2009.