Concentration inequalities for Markov chains with statistical applications

(1)

Concentration inequalities for Markov chains with statistical applications

François Portier

Télécom ParisTech Joint work with Patrice Bertail

May, 14th

(2)

Outline

1 Regeneration for Markov Chains Harris recurrence

The atomic case

Nummelin splitting trick

2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions

Concentration inequality for Markov Chains

3 Application to kernel density estimation

François Portier (Télécom ParisTech) May, 14th 2 / 36

(3)

Regeneration for Markov Chains

Outlines

The atomic case

(4)

Regeneration for Markov Chains Harris recurrence

General framework : Harris recurrent Markov chains

Object of interest: X = (Xn)n∈N, aψ-irreducible Harris recurrent aperiodic time-homogeneous Markov chain, valued in a measurable space(E,E)with transition probabilityΠ(x,dy)and initial distributionν.

Notations:

Pν(respectively,Px forx inE) the probability measure such thatX0∼ν (resp., conditioned uponX0=x)

EA[.]denotes the expectation given that {X0∈A}. Main tool: Regeneration properties of Markov chains.

Refer to the books Meyn and Tweedie (2009)

(5)

Regeneration for Markov Chains The atomic case

Regenerative chains

Definition

X is calledregenerative when it possesses an accessible atom,i.e. a measurable setAsuch thatψ(A)>0 andΠ(x, .) =Π(y, .)for allx,y inA

Define the hitting times

τ_A =τ_A(1) =inf{n ≥1, X_n ∈A}

τ_A(j) =inf{n > τ_A(j−1), X_n ∈A} , j ≥2

Idea: the sample paths of the chain may be divided into i.i.d. blocks of random length corresponding to consecutive visits toA:

B₁= (X_τ_A₍₁₎₊₁, ..., X_τ_A₍₂₎), ..., B_j = (X_τ_A_(j₎₊₁, ..., X_τ_A_(j₊₁₎), ...

(6)

Example 1 : Cramer-Lundberg with a dividend barrier

Number of claims arrival in an interval[0,t]: {N(t),t ≥0, N(0) =0}an homogeneous Poisson process with rateλ,modeling the number of claims. input times (Tn)_n_∈Ntimes of the claims

Claims sizesUi, i=1,....∞, i.i.d rv’s with cdfF.

S(t) =

NX(t)

i=1

Ui

Constant premium rate (price per unit of time)c.

Reserve of company evolves like

R(t) =u+ct−S(t)

A constant barrierb, over which profit is redistributed.

X(t) = (u+ct−S(t))∧b,

The embedded chainX_n =X(T_n) is an atomic Markov chains with an atom atb

(7)

0 20 40 60 80 100

02468

time

Compagny reserves

X(t) Cramer−Lundberg model with a barrier

Figure:Cramér-Lundberg model with a dividend barrier at b, ruin at 0.

(8)

Regeneration for Markov Chains Nummelin splitting trick

Harris chains : Nummelin splitting trick

Definition

A setS ∈ E is said to besmall forX if there existm ∈N^∗,δ >0 and a probability measureΦon(E,E)(with supportS) such that, for allx ∈S, B ∈ E,

Π^m(x,B)≥δΦ(B).

Property: Harris recurrent chains have small sets

A simplification: We start by considering the casem =1

(9)

The Nummelin splitting trick

Nummelin (1978); Athreya and Ney (1978)

Any Harris recurrent Markov chains can be made atomic!

Expand the initial chain(X_n)_n≥1 to(X_n,Y_n)_n_≥1 where(Y_n)_n≥1 i.i.d.

according toB(δ). Consider the randomization

IfXn ∈S and Yn =1 (with probabilityδ∈]0,1[), thenXn+1∼Φ, IfX_n ∈S and Y_n =0, thenX_n+1∼(1−δ)⁻¹(Π(X_n+1, .) −δΦ(.)).

Property: The setA=S×{1}is an atom for the bivariate Markov chain (X,Y). This chain inherits all its communication and stochastic stability properties fromX (refer to Chapt. 14 of Meyn and Tweedie (2009))

(10)

Figure:Splitting a financial time-series exhibiting thresholds and conditional heteroscedasticity, n=1000,α1=0.95,α2=0.45,β=0.35 andσ²=1.

(11)

Some literature

The splitting technique was proposed in Nummelin (1978); Athreya and Ney (1978)

functional CLT Levental (1988)

Nice proof of the CLT Bednorz et al. (2008)

Edgeworth expansion, large deviation,U-statistics, Bootstrap Bertail and Clémençon (2004a); Bertail and Clemencon (2011); Bertail and

Clémençon (2004b, 2010)

Concentration inequalities : Adamczak (2008); Dedecker and Gouëzel (2015); Paulin (2015)

(12)

Concentration inequality for Markov chains

Outlines

The atomic case

(13)

Concentration inequality for Markov chains Rademacher complexity : the i.i.d case

Empirical process indexed by functions : the i.i.d case

Let(Ω,F,P)be a probability space and suppose thatX = (Xi)_i∈Nis a sequence of random variables on(Ω,F,P)valued in(E,E). LetF denote a countable class of real-valued measurable functions defined onE. Letn ∈N, define

Z =sup

f∈F

Xn

i=1

(f(Xi) −E[f(Xi)]) .

The random variableZ plays a crucial role in machine learning and statistics (Boucheron et al., 2013; Bousquet et al., 2003).

Rademacher complexity : (X_i)_i_∈_Ni.i.d.

The Rademacher complexity associated toF is given by

Rn,ξ(F) =Esup

f∈F

Xn

i=1

if(Xi) ,

where the (i)i∈Nare i.i.d. Rademacher random variables, i.e., taking values +1 and−1, with probability 1/2, independent fromX.

(14)

Empirical process indexed by functions : the i.i.d case

Key tools:

- the functionF is an envelope for the classF, if|f(x)|≤F(x)for allx ∈E andf ∈ F

- for a metric space(F,d), the covering numberN(,F,d)is the minimal number of balls of sizeneeded to coverF. The metric that we use here is k.kL₂(Q)=kfkL₂(Q)={R

f²dQ}^1/2.

(15)

Polynomial entropy : VC type classes

Hypothesis on the classF

F of measurable functionsE →Rof VC-type (or Vapnik-Chervonenkis type) for an envelopeF and admissible characteristic(C,v)(positive constants) such thatC ≥(3√

e)^v andv ≥1, that is, for all probability measureQ on (E,E)with 0<kFkL₂(Q)<∞and every 0< <1,

N kFk_L₂_(Q), F,k.k_L₂_(Q)

≤C^−v. NB: the class is countable to avoid measurability issues (but the

non-countable case may be handled similarly by using outer probability and additional measurability assumptions, see van de Vaart and Wellner,2001).

(16)

Giné and Gillou (2002)s’ key theorem

Theorem (i.i.d. case Giné and Guillou (2002))

Let F be a measurable uniformly bounded VC class of functions defined on E with envelop F and characteristic(C,v). Let U >0 such that

|f(x)|≤U for all x ∈E and f ∈ F. Let σ² be such that E[f(X)²]≤σ² for all f ∈ F. Then, whenever 0< σ≤U , it holds

Rn,ξ(F)≤M

"

vUlogCU

σ +

r

vnσ²logCU σ

# ,

where M is a universal constant.

Very usefull in association with Talagrand (1996) or McDiarmid (1998) (see also Einmahl and Mason (2005))

Purpose of this work: (i) prove some concentration inequalities on Z for Markov chains under assumptions onf similar to the i.i.d case, (ii)

Application to uniform results on density estimation and MCMC.

(17)

Concentration inequality for Markov chains Rademacher complexity for Markov chains

Block decompositions of the empirical process

Decompose the chain according to the elementsXi that belong to complete blocksB₁, . . . ,B_l_n₋₁ and the elementsX_i inB₀andB_l_n:

sup

f∈F

Xn

i=1

(f(Xi) −Eπ[f])

≤sup

f∈F

τX_A(l_n)

i=τA(1)+1

(f(Xi) −Eπ[f])

+sup

f∈F

τA

X

i=1

(f(Xi) −Eπ[f])

+sup

f∈F

Xn

i=τ_A(l_n)+1

(f(Xi) −Eπ[f]) ,

(B0andBl_n will be treated separately). BecauseτA(ln) −τA(1) =Pl_n

k=1`(Bk), where`(B_k)denote the size of blockk, it holds that

τX_A(l_n)

i=τA(1)+1

(f(Xi) −Eπ[f])

=

l_n

X

k=1

(f⁰(Bk) −`(Bk)Eπ[f]) .

wheref⁰(Bk) =Pτ_A(k)

i=τ_A(k)+1f(Xi).

(18)

Block decompositions of the empirical process

2 main difficulties

ln is random ; very easy to get rid of this randomness asymptotically l_n ∼ _E_Aⁿ_τ_A.Much more complicated for finiten, because l_n is correlated with thef⁰(Bk).

even iff is bounded, f⁰ is not and depends on the behavior ofτA

(19)

Block Rademacher complexity of the class F

Block Rademacher complexity of the classF,

Rn,B(F) =E^Asup

f∈F

Xn

k=1

kf⁰(Bk) ,

where(_k)_k_∈_Nare Rademacher random variables independent from the blocks (Bk)k∈N.

Issues

how to control this quantity with the original VC-complexity of the class control concentration of Z with this rademacher complexity

(20)

Block Rademacher complexity of the class F

For this define the torusE⁰=∪^∞_k=1E^k, Occupation measureM be given by

M(B,dy) =X

x∈B

δx(y), for everyB ∈E⁰.

For any functionf :E →R, define the corresponding block functionf⁰ →R given by

f⁰(B) = Z

f(y)M(B,dy) =X

x∈B

f(x),

For any classF of real-valued functions defined on E, denote by F⁰={f⁰ : f ∈ F}.

(21)

Change of measure

LetQ⁰ denote a probability measure on(E⁰,E⁰)and define the measureQ by Q(A) =EQ⁰

`(B)× Z

A

M(B,dy)

/Q⁰(`²), for everyA∈ E, is a probability measure on(E,E).

Lemma

Let Q⁰ be a probability measure on (E⁰,E⁰)such that 0<k`k_L₂_(Q⁰₎<∞. Then we have, for every 0< <∞,

N(k`k_L₂_(Q⁰₎,F⁰,L2(Q⁰))≤ N(, F,L2(Q)).

Moreover if F is VC with constant envelope U and characteristic(C,v), then F⁰ is VC with envelope U`and characteristic (C,v).

Define the truncated version F⁰1_{_`≤L_} ={f⁰1_{_`≤L_} : f ∈ F} , it remains VC with envelop UL.

(22)

Idea of the proof (Jensen inequality)

Q⁰(f⁰²) =EQ⁰

Z

f(y)M(B,dy) 2!

≤EQ⁰

`(B) Z

f(y)²M(B,dy)

=Q(f²)Q⁰(`²).

Apply this inequality to the function f⁰(B) −f_k⁰(B) =

Z

(f(y) −fk(y))M(B,dy),

forf_k the centers of an-cover of the spaceF andkf −f_kk_L₂_(Q)≤.

(23)

Concentration inequality for Markov chains Assumptions

(PM) there existsp>1 such thatEA[τ^p_A]<∞, (EM) there existsλ >0 such thatEA[exp(τAλ)]<∞.

Remarks

(EM) Condition (EM) is equivalent to each of the following assertions : (i) the geometric ergodicity of the chainX, (ii) the (uniform) Doeblin condition, as well as (iii) the Foster-Lyapunov drift condition (see Theorem 16.0.2 in Meyn and Tweedie (2009) for the details).

mixing and (PM) Relationship between (PM) and the rate of decay of mixing coefficients investigated in Bolthausen (1982): this condition is typically fulfilled as soon as the strong mixing coefficients sequence decreases as an arithmetic raten^−s, for somes>p−1.

(24)

Main results

Theorem (Block Rademacher complexity)

Let F be VC with constant envelope U and characteristic(C,v). Let σ⁰² be such that EA

h Pτ_A

i=1f(X_i)²i

≤σ⁰², for all f ∈ F. For some universal constant M >0, and any L such that0< σ⁰ ≤LU ,

1 if (PM) holds, then

Rn,B(F)≤M

"

vLUlogCLU

σ⁰ +

r

vnσ⁰²logCLU σ⁰

#

+nEA[τ^p_A] L^p−1 ,

2 if (EM) holds, then

R_n_,B(F)≤M

"

vLUlogCLU σ⁰ +

r

vnσ⁰²logCLU σ⁰

#

+nUexp(−Lλ/2)C_λ,

where Cλ=2EA[exp(τAλ)]/λ.

Slide 11

(25)

Theorem (Expectation bound)

Let F be a countable class of measurable functions bounded by U . It holds that

Eν

"

sup

f∈F

Xn

i=1

(f(X_i) −Eπ[f])

#

≤4R_n,B(F)

+4 sup

f∈F|Eπ[f]|q

nEA[τ²_A] +2U(Eν[τA] +EA[τA]).

whereν stands for the initial measure.

(26)

Concentration inequality for Markov chains Concentration inequality for Markov Chains

Main concentration inequality

Application of Talagrand or McDiarmid inequality yields to a concentration bound for the empirical process

Theorem (Concentration bound via Rademacher control)

Under (EM) and there exists λ >0 such thatEν[exp(λτA)]<∞. Let F be a countable class of measurable functions bounded by U . Let R_n be such that

Rn ≥4Rn,B(F) +4 sup

f∈F|Eπ[f]|q

nE^A[τ²_A] +2U(E^ν[τA] +E^A[τA]),

σ⁰²≥sup

f∈FEA





τ_A

X

i=1

f(Xi)

!2

.

Then, for some universal constant K >0, and forτ >0 depending on the tails of the regeneration time,

(27)

Theorem (cont...)

with probability 1−δ we have,

sup

f∈F

Xn

i=1

(f(X_i) −Eπ(f))

≤KR_n+

max √

nσ⁰ s

Klog K

δ

,log K

δ

τ³Ulog(n) EA[τ_A]

! .

(28)

Generalization to m > 1

ifm>1 then the blocks(Bi)are 1-dependent (see for instance Chen(1999) Corollary 2.3).

Split the sum as follows

l_n

X

k=0

f(Bi) =

l_n

X

k=0,keven

f(Bk) +

l_n

X

k=0,kodd

f(Bk)

because of the 1-dependence property, in each sums the blocks are independent.

Reduce to two sums of at mostn/2 independent blocks that can be treated separately.

(29)

Application to kernel density estimation

Outlines

The atomic case

(30)

Application to density kernel estimation

Givenn ≥1 observations of a Markov chainsX ⊂R^d, the kernel density estimator of the stationary measureπis given by

^

πn(x) =n⁻¹ Xn

i=1

K((x −Xi)/hn)/h_n^d,

whereK :R^d →R, called the kernel, is such thatR

K(x)dx=1 and(hn)n≥1

is a positive sequence of bandwidths.

The bias term,Eπ^n−π, is classically treated by using techniques from functional analysis (regularity off).

The variance term,π^n −Eπ^n, treated using empirical process technique in the case of independent random variables.

(31)

Hypotheses

We shall consider kernel functionsK :R^d→Rthat taking one of the two following forms,

(i) K(x) =K⁽⁰⁾(|x|), or (ii) K(x) = Yd

k=1

K⁽⁰⁾(xk),

whereK⁽⁰⁾is a bounded function of bounded variation with support[−1,1].

From Nolan and Pollard (1987), the class of function

K={y7→K((x−y)/h) : h >0,x ∈R} is a uniformly bounded VC class.

(32)

Theorem

Suppose that π is bounded, that h_n →0 and there exists β >0 such that hn ≥n^−β.

1 If (PM) holds for p>2 and 0< β(p/(p−1))<1/d , we have

Eν

sup

x∈R^d

|π^_n(x) −Eπ[ ^π_n(x)]|

=O

slog nhn⁻¹ nhn^dp/(p−1)

! .

2 If (EM) holds and 0< β <1/d , we have

Eν

sup

x∈R^d

|π^n(x) −Eπ[ ^πn(x)]|

=O

slog(n)² nh_n^d

! .

(33)

Getting rid of a log (n )

Theorem

Under (EM), Suppose that π is bounded, that h_n →0 and there exists β >0 such that p

|log(hn)|/(nh_n^d)→0. If there exist p>2 and C >0 such that for all x ∈E , π(x)E^x[τ^p_A]≤C , then we have

E^ν

sup

x∈R^d

|π^n(x) −E^π[ ^πn(x)]|

=O

s|log(hn)| nh_n^d

! .

Main idea : control the variance

EA





τA

X

i=1

K((x −Xi)/hn)

!²

≤ch_n^d, for allx ∈E, (1)

(34)

Bibliography I

Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to markov chains.Electronic Journal of Probability 13, 1000–1034.

Athreya, K. B. and P. Ney (1978). A new approach to the limit theory of recurrent Markov chains.Trans. Amer. Math. Soc. 245, 493–501.

Azaïs, R., B. Delyon, and F. Portier (2018). Integral estimation based on markovian design.

Advances in Applied Probability 50(3), 833–857.

Bednorz, W., K. Latuszynski, and R. Latala (2008). A regeneration proof of the central limit theorem for uniformly ergodic markov chains.Electronic Communications in Probability 13, 85–98.

Bertail, P. and S. Clémençon (2004a). Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Probab. Relat. Fields 130(3), 388–414.

Bertail, P. and S. Clémençon (2004b). Note on the regeneration-base bootstrap for atomic Markov chains. TEST 16, 109–122.

Bertail, P. and S. Clémençon (2010). Sharp bounds for the tails of functionals of Markov chains.Th. Prob. Appl. 54(3), 505–515.

Bertail, P. and S. Clemencon (2011). A renewal approach to markovian u-statistics.

Mathematical Methods of Statistics 20(2), 79–105.

(35)

Bibliography II

Boucheron, S., G. Lugosi, and P. Massart (2013).Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.

Bousquet, O., S. Boucheron, and G. Lugosi (2003). Introduction to statistical learning theory. InSummer School on Machine Learning, pp. 169–207. Springer.

Dedecker, J. and S. Gouëzel (2015). Subgaussian concentration inequalities for geometrically ergodic markov chains.Electronic Communications in Probability 20.

Einmahl, U. and D. M. Mason (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33(3), 1380–1403.

Giné, E. and A. Guillou (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38(6), 907–921. En l’honneur de J. Bretagnolle, D. Dacunha-Castelle, I. Ibragimov.

Levental, S. (1988). Uniform limit theorems for harris recurrent markov chains.Probability theory and related fields 80(1), 101–118.

McDiarmid, C. (1998). Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed (Eds.),Probabilistic Methods for Algorithmic Discrete Mathematics, Volume 16 ofAlgorithms and Combinatorics, pp. 195–248. Springer Berlin Heidelberg.

Meyn, S. and R. L. Tweedie (2009).Markov chains and stochastic stability (Second ed.).

Cambridge University Press. With a prologue by Peter W. Glynn.

(36)

Bibliography III

Nummelin, E. (1978). A splitting technique for Harris recurrent Markov chains. Z.

Wahrsch. Verw. Gebiete 43(4), 309–318.

Paulin, D. (2015). Concentration inequalities for markov chains by marton couplings and spectral methods.Electron. J. Probab. 20, 32 pp.

Talagrand, M. (1996). New concentration inequalities in product spaces. Inventiones mathematicae 126(3), 505–563.