Basic tools from empirical processes Applications to statistics

(1)

Basic tools from empirical processes

Applications to statistics

Sébastien Loustau

LAREMA, Université d'Angers

Dynstoch meeting, June 2010, Angers

(2)

Non serious motivation

Angers 16 19 June 2010

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."

... not so far from

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."

Empirical processes for statistical inference !

(3)

Non serious motivation

... not so far from

(4)

Non serious motivation

... not so far from

(5)

Serious motivation

Z₁, . . . ,Z_n i.i.d. from P. GivenG, we aim at:

f^∗ =arg min

f∈GE_Pl(f,Z) =arg min

f∈GR(f), where l is some loss function.

We study the performance (consistency, rates of convergence) of ˆf_n=arg min

f∈G

1 n

n

X

i=1

l(f,Z_i) =arg min

f∈GR_n(f), in terms of the excess risk:

R(ˆf_n)−R(f^∗).

(6)

Where do empirical processes appear ?

R(ˆfn)−R(f^∗) ≤ R(ˆfn)−Rn(ˆfn) +Rn(f^∗)−R(f^∗)

≤ 2 sup

f∈G

|R_n(f)−R(f)|.

⇒Aim: to control uniformly the empirical process Rn(f)−R(f) indexed byG.

(7)

Examples

I Density estimation:

ˆp_n=arg max

p∈P

1 n

n

X

i=1

log p(Z_i).

I Regression:

gˆn=arg min

g∈G

1 n

n

X

i=1

(Yi −g(Xi))².

I Classication:

ˆf_n=arg min

f∈F

1 n

n

X

i=1

(1−Y_if(X_i))+.

(8)

Outlines

1. Uniform Law of Large Numbers

Entropy/Chaining/Symmetrization/Hoeding's inequality/Consistency of M-estimators.

2. Increments of empirical processes

Finer results at a neighbourhood of xed function give rates of convergence of M-estimators.

3. Statistical Learning theory

Examples of Rademacher of RKHS Balls, Besov Balls give rates of convergence of SVM estimators.

(9)

Notations, denitions

I Z1, . . . ,Zn i.i.d. with law P, and for g measurable, Pg :=E_Pg(Z)and P_ng := 1

n Xn

i=1

g(Z_i).

I G satises ULLN if sup_g∈G|P_ng−Pg| −→0 almost surely.

I Np(G, δ,Q) :=smallest number of balls of radiusδto cover G:

∀g ∈ G,∃i ∈ {1, . . . ,N_p}:kg−g_ik_Lp(Q)≤δ.

Hp(G, δ,Q) :=log Np(G, δ,Q) is called the entropy ofG.

I N_p^B(G, δ,Q) := smallest number ofδ-brackets to coverG :

∀g ∈ G,∃i ∈ {1, . . . ,N_p^B}:g_i^L≤g ≤g_i^Uandkg_i^L−g_i^Uk_Lp(Q)≤δ.

H_p^B(G, δ,Q) :=log N_p^B(G, δ,Q)is called the entropy with bracketing ofG.

(10)

Preliminary result

Lemma

H₁^B(G, δ,P)<∞,∀δ >0 ⇒ Gsatises ULLN.

Proof :

Let g ∈ G. then∃ [g_j^L,g_j^U]δ-bracket, j ∈ {1, . . .N} :

(Pn−P)g ≤ Png_jÛ−Pg = (Pn−P)g_jÛ+P(g_jÛ−g)

≤ (P_n−P)g_j^U+δ, and similarly(Pn−P)g ≥(Pn−P)g_j^L−δ.

Since([g_j^L,g_j^U])_j=1...N is nite, we have (LLN) for n great enough:

j=max1...N|(Pn−P)g_j^U| ≤δ and max

j=1...N|(Pn−P)g_j^L| ≤δa.s. Then

gsup∈G

|P_ng −Pg| ≤2δa.s.

(11)

Chaining method

LetG ⊂L²(Q) : supGkgk_Q ≤R.

Idea : approximate g ∈ G by a nite class called a chain.

Consider(g_j^s)_j₌₁_..._N_s a 2⁻^sR-covering set of G, for s =1, . . . ,S :

∀g ∈ G,∃g^s ∈ {g₁^s, . . . ,g_N^s_s}:kg−g_sk_Q ≤2⁻^sR. Then we can write :

g =g −g^S+ XS

s=1

(g^s−g^s⁻¹) S large enough to getkg −g^Sk_Q small enough P_S

s=1(g^s−g^s⁻¹) only involves nite number of functions.

(12)

Application: intermediate result

Lemma

Let(ξ₁, . . . , ξ_n) xed. Suppose Wi, i =1, . . .n are such that:

P(|

Xn

i=1

W_iγ_i| ≥a)≤C₁exp(− a² C2P_n

i=1γ_i²).

Assume supGkgk_Q_n ≤R where Qn= ¹_nP

δ_ξ_i. Then for

√nδ ≥C( Z _R

δ/8K

pH₂(u,G,Q_n)du∨R), we have:

P(sup

g∈G

|1 n

n

X

i=1

W_ig(ξ_i)| ≥δ∧1 n

n

X

i=1

W_i²≤K²)≤C exp(− nδ² C²R²).

(13)

Proof using chaining

Consider(g_j^s)_j₌₁_..._N_s 2⁻^sR covering set ofG w.r.t. L²(Qn), for s =1, . . .S. Rewrite, for g ∈ G:

1 n

n

X

i=1

W_ig(ξ_i) = 1 n

n

X

i=1

W_i(g(ξ_i)−g^S(ξ_i)) + 1 n

n

X

i=1

W_ig^S(ξ_i).

Choosing S =min{s ≥1:2⁻^sR ≤ _2K^δ } ensures (with Cauchy-Schwartz), on the event{¹_nPW_i² ≤K²}:

1 n

Xn

i=1

W_i(g(ξ_i)−g^S(ξ_i))≤Kkg −g^Sk_Q_n ≤ δ 2. It remains to control the probability:

P( max

j=1,...NS

|1 n

Xn

i=1

W_ig_j^S(ξ_i)| ≥ δ 2).

(14)

Write g^S =P_S

s=1(g^s−g^s⁻¹) and for P

η_s ≤1:

P(sup

g∈G

|1 n

XS

s=1

Xn

i=1

W_i(g^s−g^s⁻¹)(ξ_i)| ≥ δ 2)

≤ XS

s=1

P(sup

g∈G

|1 n

Xn

i=1

W_i(g^s−g^s⁻¹)(ξ_i)| ≥ δ 2η_s)

≤ XS

s=1

C₁exp(H₂(2⁻^sR,G,Q_n))exp(− nδ²η²_s C₂2⁻^2sR²).

With a good choice of(η_s), we get the result.

(15)

Symmetrization

Goal : replace the study of an empirical process to the study of a symmetrized version.

Idea : consider a ghost sample(Z_i⁰)ⁿ_i₌₁ i.i.d. from Z and independent of(Z_i)ⁿ_i₌₁. Then:

_i(f(Zi)−f(Z_i⁰))∼f(Zi)−f(Z_i⁰).

where(_i)ⁿ_i₌₁ are i.i.d. Rademacher variables.

Then we have in expectation : Esup

g∈G

|Pn−P|(g)≤2Esup

g∈G

|1 n

n

X

i=1

_ig(Zi)|.

or in probability : P(sup

g∈G

|Pn−P|(g)≥δ)≤4P(sup

g∈G

|1 n

n

X

i=1

_ig(Zi)| ≥ δ 4)

(16)

Hoeding's inequality

Theorem

Consider Z1, . . .Zn centered independent r.v. such that a_i ≤Z_i ≤b_i, i =1, . . .n. Then∀a>0:

P(

n

X

i=1

Zi ≥a)≤exp(−2 a² P_n

i=1(bi−ai)²).

Hence we get for_i i.i.d. Rademacher:

P(|

n

X

i=1

_iγ_i| ≥a)≤2 exp(− a² 2P_n

i=1γ_i²).

(17)

ULLN under entropy condition

Theorem

Assume sup_g∈Gkgk∞<R and:

1

nH₂(δ,G,P_n)−→^P 0,∀δ >0. ThenG satises ULLN.

Proof:

Apply Symmetrization:

P(sup

g∈G

|P_n−P|(g)≥δ)≤4P(sup

g∈G

1 n

n

X

i=1

_ig(X_i)≥ δ 4) Hoeding's inequality:

P(|

n

X

i=1

_iγ_i| ≥a)≤exp(− a² 2P_n

i=1γ_i²).

(18)

Apply intermediate result conditionnaly on Z1, . . . ,Zn on the set A_n={√

nδ≥C(R r

H₂( δ

32,G,P_n)∨R)}, we get:

P(sup

g∈G

|1 n

n

X

i=1

_ig(Xi)| ≥ δ

4)≤C exp(− nδ²

C²R²) +P(A^C_n)ⁿ^→∞→ 0.

(19)

Applications to MLE estimator

Observe X₁, . . . ,X_n i.i.d. with density p₀ ∈ P with respect to µ σ-nite measure. Suppose there exists :

pˆn=arg max

p∈P Pn(log p).

We want the a.s. convergence of the quantity:

h(ˆpn,p0) = s1

2 Z

(p ˆpn−√

p0)²dµ.

For convexP, we have:

h²(ˆp_n,p₀)≤(P_n−P)( 2ˆp_n ˆpn+p0).

Then it remains to get ULLN for the class G={ 2p

p+p₀,p ∈ P}.

(20)

Applications to MLE estimator (2)

Examples :

I The class of monotone Lebesgue densities:

P ={p is a decreasing density on[0,1]}.

Here use the convexity of P.

I The class of Lebesgue smooth densities:

P ={p: [0,1]→R⁺: Z

pdµ=1, Z ₁

0 (p⁽^m⁾(x))²dx ≤M²}.

Here H₁^B(P, δ, µ)≤Aδ⁻^m¹.

(21)

Increments of empirical process

Where do empirical processes appear ? (a ner bound) R(ˆf_n)−R(f^∗) ≤ R(ˆf_n)−R_n(ˆf_n) +R_n(f^∗)−R(f^∗)

≤ sup

f∈G(δ)

|(Rn−R)(f −f^∗)|.

⇒We study the behaviour of ν_n(g−g₀) =√

n(P_n−P)(g −g₀) onG(δ) ={g ∈ G :kg −g0k_P ≤δ}.

Agenda :

I Bernstein's inequality

I Uniform Bernstein over G

I application toG(δ) for δ→0.

(22)

Bernstein's inequality

Theorem

Letkgk_∞≤K and kgk_P ≤R. Then:

P(ν_n(g)≥a)≤exp(− a² 2(^√^aK_n+R²)).

It gives subgaussian or subexponential tails and we are interested in subgaussian:

P(ν_n(g)≥a)≤exp(− a²

4R²) holds for a≤

√nR² K , in particular when R →0.

(23)

Uniform Bernstein under entropy condition

Theorem

LetG such that supGkgk_∞≤K, supGkgk_P ≤R and R₁

0

qH₂^B(G,u,P)du<∞. Then take

C₀( Z _R

0

qH₂^B(G,u,P)du∨R)≤a≤C₁

√nR² K , then

P(sup

g∈G

|ν_n(g)| ≥a)≤exp(− a²

C²(C₁+1)R²).

Proof : Chaining similar to the intermediate result.

(24)

Application to G(δ

_n

) when δ

_n

→ 0

Theorem

Suppose H₂^B(G, δ,P)≤Aδ^−α, 0< α <2. Then forδ ≥n⁻²^+α¹ , we have

P( sup

g∈G(δ)

|ν_n(g)−ν_n(g₀)| ≥cδ¹⁻^α²)≤C exp(−C⁰δ^−α C⁰⁰ ).

Proof:

I δ →0 not to fast to have subgaussian tails.

I a∼Rδ 0

pH2(G,u,P)du∼δ¹⁻^α² to ensure uniform Bernstein.

(25)

Application to G(δ

_n

) when δ

_n

→ 0

Consequence:

(i)P( sup

g∈G(n⁻²^+α¹ )

|ν_n(g)−ν_n(g₀)| ≥Tn⁻²²^−α^+α)≤exp(−Tn²^+α^α C ).

(ii)P( sup

g∈G^C(n⁻²^+α¹ )

|ν_n(g)−ν_n(g₀)|

kg−g0k¹⁻^α2 ≥T)≤exp(−T C).

⇒We arrive at:

gsup∈G

|ν_n(g)−ν_n(g₀)|

kg −g0k¹⁻^α2 ∨n⁻²^+α² =O_P(1).

(26)

Proof of ( ii ) : Peeling

Goal: to study the tails of ^ν_wⁿ₍⁽_g^g₎⁾ from tails of ν_n(g).

Idea: consider(ms)^S_s₌₁ decreasing sequence and peelG as follows:

G ⊆

S

[

s=1

G_s, where G_s ={g ∈ G:ms ≤w(g)≤ms−1}.

Then write:

P(sup

g∈G

ν_n(g)

w(g) ≥a) ≤ XS

s=1

P( sup

g∈G(s)

ν_n(g) w(g) ≥a)

≤

S

X

s=1

P( sup

g:w(g)≤ms−1

ν_n(g)≥ams).

Here to get(ii), w(g) =kg−g₀k, and m_s =2⁻^s.

(27)

Application to get rates of convergence of MLE

Observe X₁, . . . ,X_n i.i.d. with density p₀ ∈ P. We want to upper bound the quantity:

h(ˆpn,p0) = s1

2 Z

(p ˆpn−√

p0)²dµ.

Using for instance:

h²(ˆp_n,p₀)≤(P_n−P)(

√pˆ_n

√p₀).

If H₂^B(δ,^P^1/2

p^1/2₀ ,P)≤δ^−α, we have:

h²(ˆp_n,p₀)≤(P_n−P)(

√ˆp_n

√p₀)≤O_P(n⁻¹²)h¹⁻^α²(ˆp_n,p₀)∨O_P(n⁻²^+α² ), and gives

h(ˆpn,p0) =O_P(n⁻²^+α¹ ).

(28)

How to choose F ?

(29)

A penalized M-estimator for classication

Consider the SVM minimization:

minf∈H

"

1 n

Xn

i=1

(1−Y_if(X_i))₊+α_nkfk²_H

# ,

where:

I (X_i,Y_i)∈R^d× {−1,+1}, i =1, . . .n are i.i.d.,

I H is a given functional space,

I α_n smoothing parameter to determine.

(30)

Rates of convergence

I Estimation error:

R(ˆf_n,f^∗)≤C inf

f∈H R(f,f^∗) +α_nkfk²_H

+δ_n(α_n).

I Approximation error:

a(α_n) = inf

f∈H R(f,f^∗) +α_nkfk²_H .

Then you have:

R(ˆf_n,f^∗)≤Ca(α_n) +δ_n(α_n)^αⁿ^∼ⁿ

∗

∼ n^−?

(31)

Local Rademacher

We are interested in E sup

f∈B^H(R):Ef(X)²≤δ

1 n

n

X

i=1

_if(X_i)

:=ERad_n(R, δ), where_i are i.i.d. P(_i =1) =P(_i =−1) = ¹₂.

I In the RKHS situation, we have:

ERad_n(R, δ)≤ √1 n inf

d∈N





√dδ+Rs X

j>d

λ_j



,

where (λ_j) eigenspectrum of the integral operator L_K :f 7→R

f(x)K(x,·)dx.

I What happens if H=B_spq(R^d)?

(32)

Besov case

Theorem

Suppose X admits a bounded densityρ with compact support.Then if s > ^d_p and 1≤p≤2,

∀δ >0,E sup

f∈B(R):Ef(X)²≤δ

1 n

n

X

i=1

_if(Xi)

≤ √c nR^2u^d δ

s−d 2up,

where u=s+d

12 −¹_p . Consequence: if f^∗ ∈ B_rpq,

R(ˆgn)−R(f^∗) =OP(n⁻^2s^r⁻^r^2u^2u⁺^d), where we chooseα_n∼n⁻^2u^2u⁺^d.

(33)

References

I S. Van De Geer Empirical Processes and M-estimation, 2000.

I J. Wellner and A. Van der Vaart Weak convergence and empirical processes, 2000.

I S. Mendelson Local Rademacher Complexities, 2005.

I P. Massart Some applications of concentration inequalities to statistics, 2000.

I S. Loustau Penalized ERM over Besov spaces, 2009.