• Aucun résultat trouvé

Concentration inequalities for Markov chains with statistical applications

N/A
N/A
Protected

Academic year: 2022

Partager "Concentration inequalities for Markov chains with statistical applications"

Copied!
36
0
0

Texte intégral

(1)

Concentration inequalities for Markov chains with statistical applications

François Portier

Télécom ParisTech Joint work with Patrice Bertail

May, 14th

(2)

Outline

1 Regeneration for Markov Chains Harris recurrence

The atomic case

Nummelin splitting trick

2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions

Concentration inequality for Markov Chains

3 Application to kernel density estimation

François Portier (Télécom ParisTech) May, 14th 2 / 36

(3)

Regeneration for Markov Chains

Outlines

1 Regeneration for Markov Chains Harris recurrence

The atomic case

Nummelin splitting trick

2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions

Concentration inequality for Markov Chains

3 Application to kernel density estimation

(4)

Regeneration for Markov Chains Harris recurrence

General framework : Harris recurrent Markov chains

Object of interest: X = (Xn)nN, aψ-irreducible Harris recurrent aperiodic time-homogeneous Markov chain, valued in a measurable space(E,E)with transition probabilityΠ(x,dy)and initial distributionν.

Notations:

Pν(respectively,Px forx inE) the probability measure such thatX0∼ν (resp., conditioned uponX0=x)

EA[.]denotes the expectation given that {X0∈A}. Main tool: Regeneration properties of Markov chains.

Refer to the books Meyn and Tweedie (2009)

François Portier (Télécom ParisTech) May, 14th 4 / 36

(5)

Regeneration for Markov Chains The atomic case

Regenerative chains

Definition

X is calledregenerative when it possesses an accessible atom,i.e. a measurable setAsuch thatψ(A)>0 andΠ(x, .) =Π(y, .)for allx,y inA

Define the hitting times

τAA(1) =inf{n ≥1, Xn ∈A}

τA(j) =inf{n > τA(j−1), Xn ∈A} , j ≥2

Idea: the sample paths of the chain may be divided into i.i.d. blocks of random length corresponding to consecutive visits toA:

B1= (XτA(1)+1, ..., XτA(2)), ..., Bj = (XτA(j)+1, ..., XτA(j+1)), ...

(6)

Regeneration for Markov Chains The atomic case

Example 1 : Cramer-Lundberg with a dividend barrier

Number of claims arrival in an interval[0,t]: {N(t),t ≥0, N(0) =0}an homogeneous Poisson process with rateλ,modeling the number of claims. input times (Tn)n∈Ntimes of the claims

Claims sizesUi, i=1,....∞, i.i.d rv’s with cdfF.

S(t) =

NX(t)

i=1

Ui

Constant premium rate (price per unit of time)c.

Reserve of company evolves like

R(t) =u+ct−S(t)

A constant barrierb, over which profit is redistributed.

X(t) = (u+ct−S(t))∧b,

The embedded chainXn =X(Tn) is an atomic Markov chains with an atom atb

François Portier (Télécom ParisTech) May, 14th 6 / 36

(7)

Regeneration for Markov Chains The atomic case

0 20 40 60 80 100

02468

time

Compagny reserves

X(t) Cramer−Lundberg model with a barrier

Figure:Cramér-Lundberg model with a dividend barrier at b, ruin at 0.

(8)

Regeneration for Markov Chains Nummelin splitting trick

Harris chains : Nummelin splitting trick

Definition

A setS ∈ E is said to besmall forX if there existm ∈N,δ >0 and a probability measureΦon(E,E)(with supportS) such that, for allx ∈S, B ∈ E,

Πm(x,B)≥δΦ(B).

Property: Harris recurrent chains have small sets

A simplification: We start by considering the casem =1

François Portier (Télécom ParisTech) May, 14th 8 / 36

(9)

Regeneration for Markov Chains Nummelin splitting trick

The Nummelin splitting trick

Nummelin (1978); Athreya and Ney (1978)

Any Harris recurrent Markov chains can be made atomic!

Expand the initial chain(Xn)n≥1 to(Xn,Yn)n≥1 where(Yn)n≥1 i.i.d.

according toB(δ). Consider the randomization

IfXn ∈S and Yn =1 (with probabilityδ∈]0,1[), thenXn+1∼Φ, IfXn ∈S and Yn =0, thenXn+1∼(1−δ)−1(Π(Xn+1, .) −δΦ(.)).

Property: The setA=S×{1}is an atom for the bivariate Markov chain (X,Y). This chain inherits all its communication and stochastic stability properties fromX (refer to Chapt. 14 of Meyn and Tweedie (2009))

(10)

Regeneration for Markov Chains Nummelin splitting trick

Figure:Splitting a financial time-series exhibiting thresholds and conditional heteroscedasticity, n=1000,α1=0.95,α2=0.45,β=0.35 andσ2=1.

François Portier (Télécom ParisTech) May, 14th 10 / 36

(11)

Regeneration for Markov Chains Nummelin splitting trick

Some literature

The splitting technique was proposed in Nummelin (1978); Athreya and Ney (1978)

functional CLT Levental (1988)

Nice proof of the CLT Bednorz et al. (2008)

Edgeworth expansion, large deviation,U-statistics, Bootstrap Bertail and Clémençon (2004a); Bertail and Clemencon (2011); Bertail and

Clémençon (2004b, 2010)

Concentration inequalities : Adamczak (2008); Dedecker and Gouëzel (2015); Paulin (2015)

(12)

Concentration inequality for Markov chains

Outlines

1 Regeneration for Markov Chains Harris recurrence

The atomic case

Nummelin splitting trick

2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions

Concentration inequality for Markov Chains

3 Application to kernel density estimation

François Portier (Télécom ParisTech) May, 14th 12 / 36

(13)

Concentration inequality for Markov chains Rademacher complexity : the i.i.d case

Empirical process indexed by functions : the i.i.d case

Let(Ω,F,P)be a probability space and suppose thatX = (Xi)i∈Nis a sequence of random variables on(Ω,F,P)valued in(E,E). LetF denote a countable class of real-valued measurable functions defined onE. Letn ∈N, define

Z =sup

f∈F

Xn

i=1

(f(Xi) −E[f(Xi)]) .

The random variableZ plays a crucial role in machine learning and statistics (Boucheron et al., 2013; Bousquet et al., 2003).

Rademacher complexity : (Xi)iNi.i.d.

The Rademacher complexity associated toF is given by

Rn,ξ(F) =Esup

f∈F

Xn

i=1

if(Xi) ,

where the (i)iNare i.i.d. Rademacher random variables, i.e., taking values +1 and−1, with probability 1/2, independent fromX.

(14)

Concentration inequality for Markov chains Rademacher complexity : the i.i.d case

Empirical process indexed by functions : the i.i.d case

Key tools:

- the functionF is an envelope for the classF, if|f(x)|≤F(x)for allx ∈E andf ∈ F

- for a metric space(F,d), the covering numberN(,F,d)is the minimal number of balls of sizeneeded to coverF. The metric that we use here is k.kL2(Q)=kfkL2(Q)={R

f2dQ}1/2.

François Portier (Télécom ParisTech) May, 14th 14 / 36

(15)

Concentration inequality for Markov chains Rademacher complexity : the i.i.d case

Polynomial entropy : VC type classes

Hypothesis on the classF

F of measurable functionsE →Rof VC-type (or Vapnik-Chervonenkis type) for an envelopeF and admissible characteristic(C,v)(positive constants) such thatC ≥(3√

e)v andv ≥1, that is, for all probability measureQ on (E,E)with 0<kFkL2(Q)<∞and every 0< <1,

N kFkL2(Q), F,k.kL2(Q)

≤C−v. NB: the class is countable to avoid measurability issues (but the

non-countable case may be handled similarly by using outer probability and additional measurability assumptions, see van de Vaart and Wellner,2001).

(16)

Concentration inequality for Markov chains Rademacher complexity : the i.i.d case

Giné and Gillou (2002)s’ key theorem

Theorem (i.i.d. case Giné and Guillou (2002))

Let F be a measurable uniformly bounded VC class of functions defined on E with envelop F and characteristic(C,v). Let U >0 such that

|f(x)|≤U for all x ∈E and f ∈ F. Let σ2 be such that E[f(X)2]≤σ2 for all f ∈ F. Then, whenever 0< σ≤U , it holds

Rn,ξ(F)≤M

"

vUlogCU

σ +

r

vnσ2logCU σ

# ,

where M is a universal constant.

Very usefull in association with Talagrand (1996) or McDiarmid (1998) (see also Einmahl and Mason (2005))

Purpose of this work: (i) prove some concentration inequalities on Z for Markov chains under assumptions onf similar to the i.i.d case, (ii)

Application to uniform results on density estimation and MCMC.

François Portier (Télécom ParisTech) May, 14th 16 / 36

(17)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Block decompositions of the empirical process

Decompose the chain according to the elementsXi that belong to complete blocksB1, . . . ,Bln−1 and the elementsXi inB0andBln:

sup

f∈F

Xn

i=1

(f(Xi) −Eπ[f])

≤sup

f∈F

τXA(ln)

iA(1)+1

(f(Xi) −Eπ[f])

+sup

f∈F

τA

X

i=1

(f(Xi) −Eπ[f])

+sup

f∈F

Xn

i=τA(ln)+1

(f(Xi) −Eπ[f]) ,

(B0andBln will be treated separately). BecauseτA(ln) −τA(1) =Pln

k=1`(Bk), where`(Bk)denote the size of blockk, it holds that

τXA(ln)

iA(1)+1

(f(Xi) −Eπ[f])

=

ln

X

k=1

(f0(Bk) −`(Bk)Eπ[f]) .

wheref0(Bk) =PτA(k)

i=τA(k)+1f(Xi).

(18)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Block decompositions of the empirical process

2 main difficulties

ln is random ; very easy to get rid of this randomness asymptotically lnEAnτA.Much more complicated for finiten, because ln is correlated with thef0(Bk).

even iff is bounded, f0 is not and depends on the behavior ofτA

François Portier (Télécom ParisTech) May, 14th 18 / 36

(19)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Block Rademacher complexity of the class F

Block Rademacher complexity of the classF,

Rn,B(F) =EAsup

f∈F

Xn

k=1

kf0(Bk) ,

where(k)kNare Rademacher random variables independent from the blocks (Bk)k∈N.

Issues

how to control this quantity with the original VC-complexity of the class control concentration of Z with this rademacher complexity

(20)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Block Rademacher complexity of the class F

For this define the torusE0=∪k=1Ek, Occupation measureM be given by

M(B,dy) =X

x∈B

δx(y), for everyB ∈E0.

For any functionf :E →R, define the corresponding block functionf0 →R given by

f0(B) = Z

f(y)M(B,dy) =X

x∈B

f(x),

For any classF of real-valued functions defined on E, denote by F0={f0 : f ∈ F}.

François Portier (Télécom ParisTech) May, 14th 20 / 36

(21)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Change of measure

LetQ0 denote a probability measure on(E0,E0)and define the measureQ by Q(A) =EQ0

`(B)× Z

A

M(B,dy)

/Q0(`2), for everyA∈ E, is a probability measure on(E,E).

Lemma

Let Q0 be a probability measure on (E0,E0)such that 0<k`kL2(Q0)<∞. Then we have, for every 0< <∞,

N(k`kL2(Q0),F0,L2(Q0))≤ N(, F,L2(Q)).

Moreover if F is VC with constant envelope U and characteristic(C,v), then F0 is VC with envelope U`and characteristic (C,v).

Define the truncated version F01{`≤L} ={f01{`≤L} : f ∈ F} , it remains VC with envelop UL.

(22)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Idea of the proof (Jensen inequality)

Q0(f02) =EQ0

Z

f(y)M(B,dy) 2!

≤EQ0

`(B) Z

f(y)2M(B,dy)

=Q(f2)Q0(`2).

Apply this inequality to the function f0(B) −fk0(B) =

Z

(f(y) −fk(y))M(B,dy),

forfk the centers of an-cover of the spaceF andkf −fkkL2(Q)≤.

François Portier (Télécom ParisTech) May, 14th 22 / 36

(23)

Concentration inequality for Markov chains Assumptions

(PM) there existsp>1 such thatEApA]<∞, (EM) there existsλ >0 such thatEA[exp(τAλ)]<∞.

Remarks

(EM) Condition (EM) is equivalent to each of the following assertions : (i) the geometric ergodicity of the chainX, (ii) the (uniform) Doeblin condition, as well as (iii) the Foster-Lyapunov drift condition (see Theorem 16.0.2 in Meyn and Tweedie (2009) for the details).

mixing and (PM) Relationship between (PM) and the rate of decay of mixing coefficients investigated in Bolthausen (1982): this condition is typically fulfilled as soon as the strong mixing coefficients sequence decreases as an arithmetic raten−s, for somes>p−1.

(24)

Concentration inequality for Markov chains Assumptions

Main results

Theorem (Block Rademacher complexity)

Let F be VC with constant envelope U and characteristic(C,v). Let σ02 be such that EA

h PτA

i=1f(Xi)2i

≤σ02, for all f ∈ F. For some universal constant M >0, and any L such that0< σ0 ≤LU ,

1 if (PM) holds, then

Rn,B(F)≤M

"

vLUlogCLU

σ0 +

r

vnσ02logCLU σ0

#

+nEApA] Lp−1 ,

2 if (EM) holds, then

Rn,B(F)≤M

"

vLUlogCLU σ0 +

r

vnσ02logCLU σ0

#

+nUexp(−Lλ/2)Cλ,

where Cλ=2EA[exp(τAλ)]/λ.

Slide 11

François Portier (Télécom ParisTech) May, 14th 24 / 36

(25)

Concentration inequality for Markov chains Assumptions

Theorem (Expectation bound)

Let F be a countable class of measurable functions bounded by U . It holds that

Eν

"

sup

f∈F

Xn

i=1

(f(Xi) −Eπ[f])

#

≤4Rn,B(F)

+4 sup

f∈F|Eπ[f]|q

nEA2A] +2U(EνA] +EAA]).

whereν stands for the initial measure.

(26)

Concentration inequality for Markov chains Concentration inequality for Markov Chains

Main concentration inequality

Application of Talagrand or McDiarmid inequality yields to a concentration bound for the empirical process

Theorem (Concentration bound via Rademacher control)

Under (EM) and there exists λ >0 such thatEν[exp(λτA)]<∞. Let F be a countable class of measurable functions bounded by U . Let Rn be such that

Rn ≥4Rn,B(F) +4 sup

f∈F|Eπ[f]|q

nEA2A] +2U(EνA] +EAA]),

σ02≥sup

f∈FEA

τA

X

i=1

f(Xi)

!2

.

Then, for some universal constant K >0, and forτ >0 depending on the tails of the regeneration time,

François Portier (Télécom ParisTech) May, 14th 26 / 36

(27)

Concentration inequality for Markov chains Concentration inequality for Markov Chains

Theorem (cont...)

with probability 1−δ we have,

sup

f∈F

Xn

i=1

(f(Xi) −Eπ(f))

≤KRn+

max √

0 s

Klog K

δ

,log K

δ

τ3Ulog(n) EAA]

! .

(28)

Concentration inequality for Markov chains Concentration inequality for Markov Chains

Generalization to m > 1

ifm>1 then the blocks(Bi)are 1-dependent (see for instance Chen(1999) Corollary 2.3).

Split the sum as follows

ln

X

k=0

f(Bi) =

ln

X

k=0,keven

f(Bk) +

ln

X

k=0,kodd

f(Bk)

because of the 1-dependence property, in each sums the blocks are independent.

Reduce to two sums of at mostn/2 independent blocks that can be treated separately.

François Portier (Télécom ParisTech) May, 14th 28 / 36

(29)

Application to kernel density estimation

Outlines

1 Regeneration for Markov Chains Harris recurrence

The atomic case

Nummelin splitting trick

2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions

Concentration inequality for Markov Chains

3 Application to kernel density estimation

(30)

Application to kernel density estimation

Application to density kernel estimation

Givenn ≥1 observations of a Markov chainsX ⊂Rd, the kernel density estimator of the stationary measureπis given by

^

πn(x) =n−1 Xn

i=1

K((x −Xi)/hn)/hnd,

whereK :Rd →R, called the kernel, is such thatR

K(x)dx=1 and(hn)n≥1

is a positive sequence of bandwidths.

The bias term,Eπ^n−π, is classically treated by using techniques from functional analysis (regularity off).

The variance term,π^n −Eπ^n, treated using empirical process technique in the case of independent random variables.

François Portier (Télécom ParisTech) May, 14th 30 / 36

(31)

Application to kernel density estimation

Hypotheses

We shall consider kernel functionsK :Rd→Rthat taking one of the two following forms,

(i) K(x) =K(0)(|x|), or (ii) K(x) = Yd

k=1

K(0)(xk),

whereK(0)is a bounded function of bounded variation with support[−1,1].

From Nolan and Pollard (1987), the class of function

K={y7→K((x−y)/h) : h >0,x ∈R} is a uniformly bounded VC class.

(32)

Application to kernel density estimation

Theorem

Suppose that π is bounded, that hn →0 and there exists β >0 such that hn ≥n−β.

1 If (PM) holds for p>2 and 0< β(p/(p−1))<1/d , we have

Eν

sup

xRd

|π^n(x) −Eπ[ ^πn(x)]|

=O

slog nhn−1 nhndp/(p−1)

! .

2 If (EM) holds and 0< β <1/d , we have

Eν

sup

x∈Rd

|π^n(x) −Eπ[ ^πn(x)]|

=O

slog(n)2 nhnd

! .

François Portier (Télécom ParisTech) May, 14th 32 / 36

(33)

Application to kernel density estimation

Getting rid of a log (n )

Theorem

Under (EM), Suppose that π is bounded, that hn →0 and there exists β >0 such that p

|log(hn)|/(nhnd)→0. If there exist p>2 and C >0 such that for all x ∈E , π(x)ExpA]≤C , then we have

Eν

sup

xRd

|π^n(x) −Eπ[ ^πn(x)]|

=O

s|log(hn)| nhnd

! .

Main idea : control the variance

EA

τA

X

i=1

K((x −Xi)/hn)

!2

≤chnd, for allx ∈E, (1)

(34)

Application to kernel density estimation

Bibliography I

Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to markov chains.Electronic Journal of Probability 13, 1000–1034.

Athreya, K. B. and P. Ney (1978). A new approach to the limit theory of recurrent Markov chains.Trans. Amer. Math. Soc. 245, 493–501.

Azaïs, R., B. Delyon, and F. Portier (2018). Integral estimation based on markovian design.

Advances in Applied Probability 50(3), 833–857.

Bednorz, W., K. Latuszynski, and R. Latala (2008). A regeneration proof of the central limit theorem for uniformly ergodic markov chains.Electronic Communications in Probability 13, 85–98.

Bertail, P. and S. Clémençon (2004a). Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Probab. Relat. Fields 130(3), 388–414.

Bertail, P. and S. Clémençon (2004b). Note on the regeneration-base bootstrap for atomic Markov chains. TEST 16, 109–122.

Bertail, P. and S. Clémençon (2010). Sharp bounds for the tails of functionals of Markov chains.Th. Prob. Appl. 54(3), 505–515.

Bertail, P. and S. Clemencon (2011). A renewal approach to markovian u-statistics.

Mathematical Methods of Statistics 20(2), 79–105.

François Portier (Télécom ParisTech) May, 14th 34 / 36

(35)

Application to kernel density estimation

Bibliography II

Boucheron, S., G. Lugosi, and P. Massart (2013).Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.

Bousquet, O., S. Boucheron, and G. Lugosi (2003). Introduction to statistical learning theory. InSummer School on Machine Learning, pp. 169–207. Springer.

Dedecker, J. and S. Gouëzel (2015). Subgaussian concentration inequalities for geometrically ergodic markov chains.Electronic Communications in Probability 20.

Einmahl, U. and D. M. Mason (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33(3), 1380–1403.

Giné, E. and A. Guillou (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38(6), 907–921. En l’honneur de J. Bretagnolle, D. Dacunha-Castelle, I. Ibragimov.

Levental, S. (1988). Uniform limit theorems for harris recurrent markov chains.Probability theory and related fields 80(1), 101–118.

McDiarmid, C. (1998). Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed (Eds.),Probabilistic Methods for Algorithmic Discrete Mathematics, Volume 16 ofAlgorithms and Combinatorics, pp. 195–248. Springer Berlin Heidelberg.

Meyn, S. and R. L. Tweedie (2009).Markov chains and stochastic stability (Second ed.).

Cambridge University Press. With a prologue by Peter W. Glynn.

(36)

Application to kernel density estimation

Bibliography III

Nummelin, E. (1978). A splitting technique for Harris recurrent Markov chains. Z.

Wahrsch. Verw. Gebiete 43(4), 309–318.

Paulin, D. (2015). Concentration inequalities for markov chains by marton couplings and spectral methods.Electron. J. Probab. 20, 32 pp.

Talagrand, M. (1996). New concentration inequalities in product spaces. Inventiones mathematicae 126(3), 505–563.

François Portier (Télécom ParisTech) May, 14th 36 / 36

Références

Documents relatifs

Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées.. Si vous n’arrivez pas

If there was time for crystallization or re-equilibration during magma ascent, the original basalt-inherited microlites may be rimmed by low-Ca decompression-induced plagioclase,

In this paper, we use the ORCHIDEE (ORganizing Car- bon and Hydrology in Dynamic EcosystEms; Krinner et al., 2005) process-based global vegetation model to simulate the carbon

We prove a new concentration inequality for U-statistics of order two for uniformly ergodic Markov chains. Working with bounded π -canonical kernels, we show that we can recover

Therefore, we consider here a subclass called Linear Constraint Markov Chains (LCMCs), where the set of distributions associated with a state is defined by linear in- equalities..

We include in particular a reminder of the useful definitions and properties of Markov chains on a general state space (see Section A ), and the presentation of two

Pour le cas des perturbations inconnues à variations significatives a ffectant aussi bien le système que les mesures, nous nous sommes intéressés dans la dernière partie de la thèse