• Aucun résultat trouvé

Basic tools from empirical processes Applications to statistics

N/A
N/A
Protected

Academic year: 2022

Partager "Basic tools from empirical processes Applications to statistics"

Copied!
33
0
0

Texte intégral

(1)

Basic tools from empirical processes

Applications to statistics

Sébastien Loustau

LAREMA, Université d'Angers

Dynstoch meeting, June 2010, Angers

(2)

Non serious motivation

Angers 16 19 June 2010

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."

... not so far from

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."

Empirical processes for statistical inference !

(3)

Non serious motivation

Angers 16 19 June 2010

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."

... not so far from

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."

Empirical processes for statistical inference !

(4)

Non serious motivation

Angers 16 19 June 2010

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of statistical inference for stochastic processes by taking advantage..."

... not so far from

"The principal aim of the DYNSTOCH network is to make a major contribution to the theory of stochastic processes for statistical inference by taking advantage..."

Empirical processes for statistical inference !

(5)

Serious motivation

Z1, . . . ,Zn i.i.d. from P. GivenG, we aim at:

f =arg min

f∈GEPl(f,Z) =arg min

f∈GR(f), where l is some loss function.

We study the performance (consistency, rates of convergence) of ˆfn=arg min

f∈G

1 n

n

X

i=1

l(f,Zi) =arg min

f∈GRn(f), in terms of the excess risk:

R(ˆfn)−R(f).

(6)

Where do empirical processes appear ?

R(ˆfn)−R(f) ≤ R(ˆfn)−Rn(ˆfn) +Rn(f)−R(f)

≤ 2 sup

f∈G

|Rn(f)−R(f)|.

⇒Aim: to control uniformly the empirical process Rn(f)−R(f) indexed byG.

(7)

Examples

I Density estimation:

ˆpn=arg max

p∈P

1 n

n

X

i=1

log p(Zi).

I Regression:

n=arg min

g∈G

1 n

n

X

i=1

(Yi −g(Xi))2.

I Classication:

ˆfn=arg min

f∈F

1 n

n

X

i=1

(1−Yif(Xi))+.

(8)

Outlines

1. Uniform Law of Large Numbers

Entropy/Chaining/Symmetrization/Hoeding's inequality/Consistency of M-estimators.

2. Increments of empirical processes

Finer results at a neighbourhood of xed function give rates of convergence of M-estimators.

3. Statistical Learning theory

Examples of Rademacher of RKHS Balls, Besov Balls give rates of convergence of SVM estimators.

(9)

Notations, denitions

I Z1, . . . ,Zn i.i.d. with law P, and for g measurable, Pg :=EPg(Z)and Png := 1

n Xn

i=1

g(Zi).

I G satises ULLN if supg∈G|Png−Pg| −→0 almost surely.

I Np(G, δ,Q) :=smallest number of balls of radiusδto cover G:

∀g ∈ G,∃i ∈ {1, . . . ,Np}:kg−gikLp(Q)≤δ.

Hp(G, δ,Q) :=log Np(G, δ,Q) is called the entropy ofG.

I NpB(G, δ,Q) := smallest number ofδ-brackets to coverG :

∀g ∈ G,∃i ∈ {1, . . . ,NpB}:giL≤g ≤giUandkgiL−giUkLp(Q)≤δ.

HpB(G, δ,Q) :=log NpB(G, δ,Q)is called the entropy with bracketing ofG.

(10)

Preliminary result

Lemma

H1B(G, δ,P)<∞,∀δ >0 ⇒ Gsatises ULLN.

Proof :

Let g ∈ G. then∃ [gjL,gjU]δ-bracket, j ∈ {1, . . .N} :

(Pn−P)g ≤ PngjU−Pg = (Pn−P)gjU+P(gjU−g)

≤ (Pn−P)gjU+δ, and similarly(Pn−P)g ≥(Pn−P)gjL−δ.

Since([gjL,gjU])j=1...N is nite, we have (LLN) for n great enough:

j=max1...N|(Pn−P)gjU| ≤δ and max

j=1...N|(Pn−P)gjL| ≤δa.s. Then

gsup∈G

|Png −Pg| ≤2δa.s.

(11)

Chaining method

LetG ⊂L2(Q) : supGkgkQ ≤R.

Idea : approximate g ∈ G by a nite class called a chain.

Consider(gjs)j=1...Ns a 2sR-covering set of G, for s =1, . . . ,S :

∀g ∈ G,∃gs ∈ {g1s, . . . ,gNss}:kg−gskQ ≤2sR. Then we can write :

g =g −gS+ XS

s=1

(gs−gs1) S large enough to getkg −gSkQ small enough PS

s=1(gs−gs1) only involves nite number of functions.

(12)

Application: intermediate result

Lemma

Let(ξ1, . . . , ξn) xed. Suppose Wi, i =1, . . .n are such that:

P(|

Xn

i=1

Wiγi| ≥a)≤C1exp(− a2 C2Pn

i=1γi2).

Assume supGkgkQn ≤R where Qn= 1nP

δξi. Then for

√nδ ≥C( Z R

δ/8K

pH2(u,G,Qn)du∨R), we have:

P(sup

g∈G

|1 n

n

X

i=1

Wig(ξi)| ≥δ∧1 n

n

X

i=1

Wi2≤K2)≤C exp(− nδ2 C2R2).

(13)

Proof using chaining

Consider(gjs)j=1...Ns 2sR covering set ofG w.r.t. L2(Qn), for s =1, . . .S. Rewrite, for g ∈ G:

1 n

n

X

i=1

Wig(ξi) = 1 n

n

X

i=1

Wi(g(ξi)−gSi)) + 1 n

n

X

i=1

WigSi).

Choosing S =min{s ≥1:2sR ≤ 2Kδ } ensures (with Cauchy-Schwartz), on the event{1nPWi2 ≤K2}:

1 n

Xn

i=1

Wi(g(ξi)−gSi))≤Kkg −gSkQn ≤ δ 2. It remains to control the probability:

P( max

j=1,...NS

|1 n

Xn

i=1

WigjSi)| ≥ δ 2).

(14)

Write gS =PS

s=1(gs−gs1) and for P

ηs ≤1:

P(sup

g∈G

|1 n

XS

s=1

Xn

i=1

Wi(gs−gs1)(ξi)| ≥ δ 2)

≤ XS

s=1

P(sup

g∈G

|1 n

Xn

i=1

Wi(gs−gs1)(ξi)| ≥ δ 2ηs)

≤ XS

s=1

C1exp(H2(2sR,G,Qn))exp(− nδ2η2s C222sR2).

With a good choice of(ηs), we get the result.

(15)

Symmetrization

Goal : replace the study of an empirical process to the study of a symmetrized version.

Idea : consider a ghost sample(Zi0)ni=1 i.i.d. from Z and independent of(Zi)ni=1. Then:

i(f(Zi)−f(Zi0))∼f(Zi)−f(Zi0).

where(i)ni=1 are i.i.d. Rademacher variables.

Then we have in expectation : Esup

g∈G

|Pn−P|(g)≤2Esup

g∈G

|1 n

n

X

i=1

ig(Zi)|.

or in probability : P(sup

g∈G

|Pn−P|(g)≥δ)≤4P(sup

g∈G

|1 n

n

X

i=1

ig(Zi)| ≥ δ 4)

(16)

Hoeding's inequality

Theorem

Consider Z1, . . .Zn centered independent r.v. such that ai ≤Zi ≤bi, i =1, . . .n. Then∀a>0:

P(

n

X

i=1

Zi ≥a)≤exp(−2 a2 Pn

i=1(bi−ai)2).

Hence we get fori i.i.d. Rademacher:

P(|

n

X

i=1

iγi| ≥a)≤2 exp(− a2 2Pn

i=1γi2).

(17)

ULLN under entropy condition

Theorem

Assume supg∈Gkgk<R and:

1

nH2(δ,G,Pn)−→P 0,∀δ >0. ThenG satises ULLN.

Proof:

Apply Symmetrization:

P(sup

g∈G

|Pn−P|(g)≥δ)≤4P(sup

g∈G

1 n

n

X

i=1

ig(Xi)≥ δ 4) Hoeding's inequality:

P(|

n

X

i=1

iγi| ≥a)≤exp(− a2 2Pn

i=1γi2).

(18)

Apply intermediate result conditionnaly on Z1, . . . ,Zn on the set An={√

nδ≥C(R r

H2( δ

32,G,Pn)∨R)}, we get:

P(sup

g∈G

|1 n

n

X

i=1

ig(Xi)| ≥ δ

4)≤C exp(− nδ2

C2R2) +P(ACn)n→∞→ 0.

(19)

Applications to MLE estimator

Observe X1, . . . ,Xn i.i.d. with density p0 ∈ P with respect to µ σ-nite measure. Suppose there exists :

n=arg max

p∈P Pn(log p).

We want the a.s. convergence of the quantity:

h(ˆpn,p0) = s1

2 Z

(p ˆpn−√

p0)2dµ.

For convexP, we have:

h2(ˆpn,p0)≤(Pn−P)( 2ˆpn ˆpn+p0).

Then it remains to get ULLN for the class G={ 2p

p+p0,p ∈ P}.

(20)

Applications to MLE estimator (2)

Examples :

I The class of monotone Lebesgue densities:

P ={p is a decreasing density on[0,1]}.

Here use the convexity of P.

I The class of Lebesgue smooth densities:

P ={p: [0,1]→R+: Z

pdµ=1, Z 1

0 (p(m)(x))2dx ≤M2}.

Here H1B(P, δ, µ)≤Aδm1.

(21)

Increments of empirical process

Where do empirical processes appear ? (a ner bound) R(ˆfn)−R(f) ≤ R(ˆfn)−Rn(ˆfn) +Rn(f)−R(f)

≤ sup

f∈G(δ)

|(Rn−R)(f −f)|.

⇒We study the behaviour of νn(g−g0) =√

n(Pn−P)(g −g0) onG(δ) ={g ∈ G :kg −g0kP ≤δ}.

Agenda :

I Bernstein's inequality

I Uniform Bernstein over G

I application toG(δ) for δ→0.

(22)

Bernstein's inequality

Theorem

Letkgk≤K and kgkP ≤R. Then:

P(νn(g)≥a)≤exp(− a2 2(aKn+R2)).

It gives subgaussian or subexponential tails and we are interested in subgaussian:

P(νn(g)≥a)≤exp(− a2

4R2) holds for a≤

√nR2 K , in particular when R →0.

(23)

Uniform Bernstein under entropy condition

Theorem

LetG such that supGkgk≤K, supGkgkP ≤R and R1

0

qH2B(G,u,P)du<∞. Then take

C0( Z R

0

qH2B(G,u,P)du∨R)≤a≤C1

√nR2 K , then

P(sup

g∈G

n(g)| ≥a)≤exp(− a2

C2(C1+1)R2).

Proof : Chaining similar to the intermediate result.

(24)

Application to G(δ

n

) when δ

n

→ 0

Theorem

Suppose H2B(G, δ,P)≤Aδ−α, 0< α <2. Then forδ ≥n21 , we have

P( sup

g∈G(δ)

n(g)−νn(g0)| ≥cδ1α2)≤C exp(−C0δ−α C00 ).

Proof:

I δ →0 not to fast to have subgaussian tails.

I a∼Rδ 0

pH2(G,u,P)du∼δ1α2 to ensure uniform Bernstein.

(25)

Application to G(δ

n

) when δ

n

→ 0

Consequence:

(i)P( sup

g∈G(n21 )

n(g)−νn(g0)| ≥Tn22−α)≤exp(−Tn2α C ).

(ii)P( sup

g∈GC(n21 )

n(g)−νn(g0)|

kg−g0k1α2 ≥T)≤exp(−T C).

⇒We arrive at:

gsup∈G

n(g)−νn(g0)|

kg −g0k1α2 ∨n22 =OP(1).

(26)

Proof of ( ii ) : Peeling

Goal: to study the tails of νwn((gg)) from tails of νn(g).

Idea: consider(ms)Ss=1 decreasing sequence and peelG as follows:

G ⊆

S

[

s=1

Gs, where Gs ={g ∈ G:ms ≤w(g)≤ms1}.

Then write:

P(sup

g∈G

νn(g)

w(g) ≥a) ≤ XS

s=1

P( sup

g∈G(s)

νn(g) w(g) ≥a)

S

X

s=1

P( sup

g:w(g)≤ms1

νn(g)≥ams).

Here to get(ii), w(g) =kg−g0k, and ms =2s.

(27)

Application to get rates of convergence of MLE

Observe X1, . . . ,Xn i.i.d. with density p0 ∈ P. We want to upper bound the quantity:

h(ˆpn,p0) = s1

2 Z

(p ˆpn−√

p0)2dµ.

Using for instance:

h2(ˆpn,p0)≤(Pn−P)(

√pˆn

√p0).

If H2B(δ,P1/2

p1/20 ,P)≤δ−α, we have:

h2(ˆpn,p0)≤(Pn−P)(

√ˆpn

√p0)≤OP(n12)h1α2(ˆpn,p0)∨OP(n22 ), and gives

h(ˆpn,p0) =OP(n21 ).

(28)

How to choose F ?

(29)

A penalized M-estimator for classication

Consider the SVM minimization:

minf∈H

"

1 n

Xn

i=1

(1−Yif(Xi))+nkfk2H

# ,

where:

I (Xi,Yi)∈Rd× {−1,+1}, i =1, . . .n are i.i.d.,

I H is a given functional space,

I αn smoothing parameter to determine.

(30)

Rates of convergence

I Estimation error:

R(ˆfn,f)≤C inf

f∈H R(f,f) +αnkfk2H

nn).

I Approximation error:

a(αn) = inf

f∈H R(f,f) +αnkfk2H .

Then you have:

R(ˆfn,f)≤Ca(αn) +δnn)αnn

∼ n−?

(31)

Local Rademacher

We are interested in E sup

fBH(R):Ef(X)2≤δ

1 n

n

X

i=1

if(Xi)

:=ERadn(R, δ), wherei are i.i.d. P(i =1) =P(i =−1) = 12.

I In the RKHS situation, we have:

ERadn(R, δ)≤ √1 n inf

dN

√dδ+Rs X

j>d

λj

,

where (λj) eigenspectrum of the integral operator LK :f 7→R

f(x)K(x,·)dx.

I What happens if H=Bspq(Rd)?

(32)

Besov case

Theorem

Suppose X admits a bounded densityρ with compact support.Then if s > dp and 1≤p≤2,

∀δ >0,E sup

fB(R):Ef(X)2≤δ

1 n

n

X

i=1

if(Xi)

≤ √c nR2ud δ

s−d 2up,

where u=s+d

121p . Consequence: if f ∈ Brpq,

R(ˆgn)−R(f) =OP(n2srr2u2u+d), where we chooseαn∼n2u2u+d.

(33)

References

I S. Van De Geer Empirical Processes and M-estimation, 2000.

I J. Wellner and A. Van der Vaart Weak convergence and empirical processes, 2000.

I S. Mendelson Local Rademacher Complexities, 2005.

I P. Massart Some applications of concentration inequalities to statistics, 2000.

I S. Loustau Penalized ERM over Besov spaces, 2009.

Références

Documents relatifs

• In this section we give a brief survey of stochastic calculus for Itˆo-L´evy processes. We begin with a defintion of a L´evy process:.. Definition 1.1. left continuous with

On the other hand, the classical theory of sums of independent random variables can be generalized into a branch of Markov process theory where a group structure replaces addition:

Committee on Economic, Social and Cultural Rights 4 The right to have regular, permanent and free access, either directly or by means of financial purchases, to

An empirical process approach to the uniform consistency of kernel-type function estimators. Uniform in bandwidt consistency of kernel-type

On simulated data sets of two epidemic models, the SIR and the SIRS with seasonal forcing (see [84, Chapter 5]), we study the properties of our estimators based on discrete

the distribution of the supremum for compound recurrent processes and for stochastic processes with stationary independent increments.. The paper also contains some

The corollary shows t h a t the Fourier series of this function can be derived from those of the generating function and its reciprocal... For this we appeal

Let ( x k ) ~ be a q-variate discrete parameter weakly stationary stochastic process (SP) with the spectral distribution measure F defined on B the Borel family