Concentration inequalities for Markov chains with statistical applications
François Portier
Télécom ParisTech Joint work with Patrice Bertail
May, 14th
Outline
1 Regeneration for Markov Chains Harris recurrence
The atomic case
Nummelin splitting trick
2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions
Concentration inequality for Markov Chains
3 Application to kernel density estimation
François Portier (Télécom ParisTech) May, 14th 2 / 36
Regeneration for Markov Chains
Outlines
1 Regeneration for Markov Chains Harris recurrence
The atomic case
Nummelin splitting trick
2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions
Concentration inequality for Markov Chains
3 Application to kernel density estimation
Regeneration for Markov Chains Harris recurrence
General framework : Harris recurrent Markov chains
Object of interest: X = (Xn)n∈N, aψ-irreducible Harris recurrent aperiodic time-homogeneous Markov chain, valued in a measurable space(E,E)with transition probabilityΠ(x,dy)and initial distributionν.
Notations:
Pν(respectively,Px forx inE) the probability measure such thatX0∼ν (resp., conditioned uponX0=x)
EA[.]denotes the expectation given that {X0∈A}. Main tool: Regeneration properties of Markov chains.
Refer to the books Meyn and Tweedie (2009)
François Portier (Télécom ParisTech) May, 14th 4 / 36
Regeneration for Markov Chains The atomic case
Regenerative chains
Definition
X is calledregenerative when it possesses an accessible atom,i.e. a measurable setAsuch thatψ(A)>0 andΠ(x, .) =Π(y, .)for allx,y inA
Define the hitting times
τA =τA(1) =inf{n ≥1, Xn ∈A}
τA(j) =inf{n > τA(j−1), Xn ∈A} , j ≥2
Idea: the sample paths of the chain may be divided into i.i.d. blocks of random length corresponding to consecutive visits toA:
B1= (XτA(1)+1, ..., XτA(2)), ..., Bj = (XτA(j)+1, ..., XτA(j+1)), ...
Regeneration for Markov Chains The atomic case
Example 1 : Cramer-Lundberg with a dividend barrier
Number of claims arrival in an interval[0,t]: {N(t),t ≥0, N(0) =0}an homogeneous Poisson process with rateλ,modeling the number of claims. input times (Tn)n∈Ntimes of the claims
Claims sizesUi, i=1,....∞, i.i.d rv’s with cdfF.
S(t) =
NX(t)
i=1
Ui
Constant premium rate (price per unit of time)c.
Reserve of company evolves like
R(t) =u+ct−S(t)
A constant barrierb, over which profit is redistributed.
X(t) = (u+ct−S(t))∧b,
The embedded chainXn =X(Tn) is an atomic Markov chains with an atom atb
François Portier (Télécom ParisTech) May, 14th 6 / 36
Regeneration for Markov Chains The atomic case
0 20 40 60 80 100
02468
time
Compagny reserves
X(t) Cramer−Lundberg model with a barrier
Figure:Cramér-Lundberg model with a dividend barrier at b, ruin at 0.
Regeneration for Markov Chains Nummelin splitting trick
Harris chains : Nummelin splitting trick
Definition
A setS ∈ E is said to besmall forX if there existm ∈N∗,δ >0 and a probability measureΦon(E,E)(with supportS) such that, for allx ∈S, B ∈ E,
Πm(x,B)≥δΦ(B).
Property: Harris recurrent chains have small sets
A simplification: We start by considering the casem =1
François Portier (Télécom ParisTech) May, 14th 8 / 36
Regeneration for Markov Chains Nummelin splitting trick
The Nummelin splitting trick
Nummelin (1978); Athreya and Ney (1978)
Any Harris recurrent Markov chains can be made atomic!
Expand the initial chain(Xn)n≥1 to(Xn,Yn)n≥1 where(Yn)n≥1 i.i.d.
according toB(δ). Consider the randomization
IfXn ∈S and Yn =1 (with probabilityδ∈]0,1[), thenXn+1∼Φ, IfXn ∈S and Yn =0, thenXn+1∼(1−δ)−1(Π(Xn+1, .) −δΦ(.)).
Property: The setA=S×{1}is an atom for the bivariate Markov chain (X,Y). This chain inherits all its communication and stochastic stability properties fromX (refer to Chapt. 14 of Meyn and Tweedie (2009))
Regeneration for Markov Chains Nummelin splitting trick
Figure:Splitting a financial time-series exhibiting thresholds and conditional heteroscedasticity, n=1000,α1=0.95,α2=0.45,β=0.35 andσ2=1.
François Portier (Télécom ParisTech) May, 14th 10 / 36
Regeneration for Markov Chains Nummelin splitting trick
Some literature
The splitting technique was proposed in Nummelin (1978); Athreya and Ney (1978)
functional CLT Levental (1988)
Nice proof of the CLT Bednorz et al. (2008)
Edgeworth expansion, large deviation,U-statistics, Bootstrap Bertail and Clémençon (2004a); Bertail and Clemencon (2011); Bertail and
Clémençon (2004b, 2010)
Concentration inequalities : Adamczak (2008); Dedecker and Gouëzel (2015); Paulin (2015)
Concentration inequality for Markov chains
Outlines
1 Regeneration for Markov Chains Harris recurrence
The atomic case
Nummelin splitting trick
2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions
Concentration inequality for Markov Chains
3 Application to kernel density estimation
François Portier (Télécom ParisTech) May, 14th 12 / 36
Concentration inequality for Markov chains Rademacher complexity : the i.i.d case
Empirical process indexed by functions : the i.i.d case
Let(Ω,F,P)be a probability space and suppose thatX = (Xi)i∈Nis a sequence of random variables on(Ω,F,P)valued in(E,E). LetF denote a countable class of real-valued measurable functions defined onE. Letn ∈N, define
Z =sup
f∈F
Xn
i=1
(f(Xi) −E[f(Xi)]) .
The random variableZ plays a crucial role in machine learning and statistics (Boucheron et al., 2013; Bousquet et al., 2003).
Rademacher complexity : (Xi)i∈Ni.i.d.
The Rademacher complexity associated toF is given by
Rn,ξ(F) =Esup
f∈F
Xn
i=1
if(Xi) ,
where the (i)i∈Nare i.i.d. Rademacher random variables, i.e., taking values +1 and−1, with probability 1/2, independent fromX.
Concentration inequality for Markov chains Rademacher complexity : the i.i.d case
Empirical process indexed by functions : the i.i.d case
Key tools:
- the functionF is an envelope for the classF, if|f(x)|≤F(x)for allx ∈E andf ∈ F
- for a metric space(F,d), the covering numberN(,F,d)is the minimal number of balls of sizeneeded to coverF. The metric that we use here is k.kL2(Q)=kfkL2(Q)={R
f2dQ}1/2.
François Portier (Télécom ParisTech) May, 14th 14 / 36
Concentration inequality for Markov chains Rademacher complexity : the i.i.d case
Polynomial entropy : VC type classes
Hypothesis on the classF
F of measurable functionsE →Rof VC-type (or Vapnik-Chervonenkis type) for an envelopeF and admissible characteristic(C,v)(positive constants) such thatC ≥(3√
e)v andv ≥1, that is, for all probability measureQ on (E,E)with 0<kFkL2(Q)<∞and every 0< <1,
N kFkL2(Q), F,k.kL2(Q)
≤C−v. NB: the class is countable to avoid measurability issues (but the
non-countable case may be handled similarly by using outer probability and additional measurability assumptions, see van de Vaart and Wellner,2001).
Concentration inequality for Markov chains Rademacher complexity : the i.i.d case
Giné and Gillou (2002)s’ key theorem
Theorem (i.i.d. case Giné and Guillou (2002))
Let F be a measurable uniformly bounded VC class of functions defined on E with envelop F and characteristic(C,v). Let U >0 such that
|f(x)|≤U for all x ∈E and f ∈ F. Let σ2 be such that E[f(X)2]≤σ2 for all f ∈ F. Then, whenever 0< σ≤U , it holds
Rn,ξ(F)≤M
"
vUlogCU
σ +
r
vnσ2logCU σ
# ,
where M is a universal constant.
Very usefull in association with Talagrand (1996) or McDiarmid (1998) (see also Einmahl and Mason (2005))
Purpose of this work: (i) prove some concentration inequalities on Z for Markov chains under assumptions onf similar to the i.i.d case, (ii)
Application to uniform results on density estimation and MCMC.
François Portier (Télécom ParisTech) May, 14th 16 / 36
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Block decompositions of the empirical process
Decompose the chain according to the elementsXi that belong to complete blocksB1, . . . ,Bln−1 and the elementsXi inB0andBln:
sup
f∈F
Xn
i=1
(f(Xi) −Eπ[f])
≤sup
f∈F
τXA(ln)
i=τA(1)+1
(f(Xi) −Eπ[f])
+sup
f∈F
τA
X
i=1
(f(Xi) −Eπ[f])
+sup
f∈F
Xn
i=τA(ln)+1
(f(Xi) −Eπ[f]) ,
(B0andBln will be treated separately). BecauseτA(ln) −τA(1) =Pln
k=1`(Bk), where`(Bk)denote the size of blockk, it holds that
τXA(ln)
i=τA(1)+1
(f(Xi) −Eπ[f])
=
ln
X
k=1
(f0(Bk) −`(Bk)Eπ[f]) .
wheref0(Bk) =PτA(k)
i=τA(k)+1f(Xi).
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Block decompositions of the empirical process
2 main difficulties
ln is random ; very easy to get rid of this randomness asymptotically ln ∼ EAnτA.Much more complicated for finiten, because ln is correlated with thef0(Bk).
even iff is bounded, f0 is not and depends on the behavior ofτA
François Portier (Télécom ParisTech) May, 14th 18 / 36
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Block Rademacher complexity of the class F
Block Rademacher complexity of the classF,
Rn,B(F) =EAsup
f∈F
Xn
k=1
kf0(Bk) ,
where(k)k∈Nare Rademacher random variables independent from the blocks (Bk)k∈N.
Issues
how to control this quantity with the original VC-complexity of the class control concentration of Z with this rademacher complexity
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Block Rademacher complexity of the class F
For this define the torusE0=∪∞k=1Ek, Occupation measureM be given by
M(B,dy) =X
x∈B
δx(y), for everyB ∈E0.
For any functionf :E →R, define the corresponding block functionf0 →R given by
f0(B) = Z
f(y)M(B,dy) =X
x∈B
f(x),
For any classF of real-valued functions defined on E, denote by F0={f0 : f ∈ F}.
François Portier (Télécom ParisTech) May, 14th 20 / 36
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Change of measure
LetQ0 denote a probability measure on(E0,E0)and define the measureQ by Q(A) =EQ0
`(B)× Z
A
M(B,dy)
/Q0(`2), for everyA∈ E, is a probability measure on(E,E).
Lemma
Let Q0 be a probability measure on (E0,E0)such that 0<k`kL2(Q0)<∞. Then we have, for every 0< <∞,
N(k`kL2(Q0),F0,L2(Q0))≤ N(, F,L2(Q)).
Moreover if F is VC with constant envelope U and characteristic(C,v), then F0 is VC with envelope U`and characteristic (C,v).
Define the truncated version F01{`≤L} ={f01{`≤L} : f ∈ F} , it remains VC with envelop UL.
Concentration inequality for Markov chains Rademacher complexity for Markov chains
Idea of the proof (Jensen inequality)
Q0(f02) =EQ0
Z
f(y)M(B,dy) 2!
≤EQ0
`(B) Z
f(y)2M(B,dy)
=Q(f2)Q0(`2).
Apply this inequality to the function f0(B) −fk0(B) =
Z
(f(y) −fk(y))M(B,dy),
forfk the centers of an-cover of the spaceF andkf −fkkL2(Q)≤.
François Portier (Télécom ParisTech) May, 14th 22 / 36
Concentration inequality for Markov chains Assumptions
(PM) there existsp>1 such thatEA[τpA]<∞, (EM) there existsλ >0 such thatEA[exp(τAλ)]<∞.
Remarks
(EM) Condition (EM) is equivalent to each of the following assertions : (i) the geometric ergodicity of the chainX, (ii) the (uniform) Doeblin condition, as well as (iii) the Foster-Lyapunov drift condition (see Theorem 16.0.2 in Meyn and Tweedie (2009) for the details).
mixing and (PM) Relationship between (PM) and the rate of decay of mixing coefficients investigated in Bolthausen (1982): this condition is typically fulfilled as soon as the strong mixing coefficients sequence decreases as an arithmetic raten−s, for somes>p−1.
Concentration inequality for Markov chains Assumptions
Main results
Theorem (Block Rademacher complexity)
Let F be VC with constant envelope U and characteristic(C,v). Let σ02 be such that EA
h PτA
i=1f(Xi)2i
≤σ02, for all f ∈ F. For some universal constant M >0, and any L such that0< σ0 ≤LU ,
1 if (PM) holds, then
Rn,B(F)≤M
"
vLUlogCLU
σ0 +
r
vnσ02logCLU σ0
#
+nEA[τpA] Lp−1 ,
2 if (EM) holds, then
Rn,B(F)≤M
"
vLUlogCLU σ0 +
r
vnσ02logCLU σ0
#
+nUexp(−Lλ/2)Cλ,
where Cλ=2EA[exp(τAλ)]/λ.
Slide 11
François Portier (Télécom ParisTech) May, 14th 24 / 36
Concentration inequality for Markov chains Assumptions
Theorem (Expectation bound)
Let F be a countable class of measurable functions bounded by U . It holds that
Eν
"
sup
f∈F
Xn
i=1
(f(Xi) −Eπ[f])
#
≤4Rn,B(F)
+4 sup
f∈F|Eπ[f]|q
nEA[τ2A] +2U(Eν[τA] +EA[τA]).
whereν stands for the initial measure.
Concentration inequality for Markov chains Concentration inequality for Markov Chains
Main concentration inequality
Application of Talagrand or McDiarmid inequality yields to a concentration bound for the empirical process
Theorem (Concentration bound via Rademacher control)
Under (EM) and there exists λ >0 such thatEν[exp(λτA)]<∞. Let F be a countable class of measurable functions bounded by U . Let Rn be such that
Rn ≥4Rn,B(F) +4 sup
f∈F|Eπ[f]|q
nEA[τ2A] +2U(Eν[τA] +EA[τA]),
σ02≥sup
f∈FEA
τA
X
i=1
f(Xi)
!2
.
Then, for some universal constant K >0, and forτ >0 depending on the tails of the regeneration time,
François Portier (Télécom ParisTech) May, 14th 26 / 36
Concentration inequality for Markov chains Concentration inequality for Markov Chains
Theorem (cont...)
with probability 1−δ we have,
sup
f∈F
Xn
i=1
(f(Xi) −Eπ(f))
≤KRn+
max √
nσ0 s
Klog K
δ
,log K
δ
τ3Ulog(n) EA[τA]
! .
Concentration inequality for Markov chains Concentration inequality for Markov Chains
Generalization to m > 1
ifm>1 then the blocks(Bi)are 1-dependent (see for instance Chen(1999) Corollary 2.3).
Split the sum as follows
ln
X
k=0
f(Bi) =
ln
X
k=0,keven
f(Bk) +
ln
X
k=0,kodd
f(Bk)
because of the 1-dependence property, in each sums the blocks are independent.
Reduce to two sums of at mostn/2 independent blocks that can be treated separately.
François Portier (Télécom ParisTech) May, 14th 28 / 36
Application to kernel density estimation
Outlines
1 Regeneration for Markov Chains Harris recurrence
The atomic case
Nummelin splitting trick
2 Concentration inequality for Markov chains Rademacher complexity : the i.i.d case Rademacher complexity for Markov chains Assumptions
Concentration inequality for Markov Chains
3 Application to kernel density estimation
Application to kernel density estimation
Application to density kernel estimation
Givenn ≥1 observations of a Markov chainsX ⊂Rd, the kernel density estimator of the stationary measureπis given by
^
πn(x) =n−1 Xn
i=1
K((x −Xi)/hn)/hnd,
whereK :Rd →R, called the kernel, is such thatR
K(x)dx=1 and(hn)n≥1
is a positive sequence of bandwidths.
The bias term,Eπ^n−π, is classically treated by using techniques from functional analysis (regularity off).
The variance term,π^n −Eπ^n, treated using empirical process technique in the case of independent random variables.
François Portier (Télécom ParisTech) May, 14th 30 / 36
Application to kernel density estimation
Hypotheses
We shall consider kernel functionsK :Rd→Rthat taking one of the two following forms,
(i) K(x) =K(0)(|x|), or (ii) K(x) = Yd
k=1
K(0)(xk),
whereK(0)is a bounded function of bounded variation with support[−1,1].
From Nolan and Pollard (1987), the class of function
K={y7→K((x−y)/h) : h >0,x ∈R} is a uniformly bounded VC class.
Application to kernel density estimation
Theorem
Suppose that π is bounded, that hn →0 and there exists β >0 such that hn ≥n−β.
1 If (PM) holds for p>2 and 0< β(p/(p−1))<1/d , we have
Eν
sup
x∈Rd
|π^n(x) −Eπ[ ^πn(x)]|
=O
slog nhn−1 nhndp/(p−1)
! .
2 If (EM) holds and 0< β <1/d , we have
Eν
sup
x∈Rd
|π^n(x) −Eπ[ ^πn(x)]|
=O
slog(n)2 nhnd
! .
François Portier (Télécom ParisTech) May, 14th 32 / 36
Application to kernel density estimation
Getting rid of a log (n )
Theorem
Under (EM), Suppose that π is bounded, that hn →0 and there exists β >0 such that p
|log(hn)|/(nhnd)→0. If there exist p>2 and C >0 such that for all x ∈E , π(x)Ex[τpA]≤C , then we have
Eν
sup
x∈Rd
|π^n(x) −Eπ[ ^πn(x)]|
=O
s|log(hn)| nhnd
! .
Main idea : control the variance
EA
τA
X
i=1
K((x −Xi)/hn)
!2
≤chnd, for allx ∈E, (1)
Application to kernel density estimation
Bibliography I
Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to markov chains.Electronic Journal of Probability 13, 1000–1034.
Athreya, K. B. and P. Ney (1978). A new approach to the limit theory of recurrent Markov chains.Trans. Amer. Math. Soc. 245, 493–501.
Azaïs, R., B. Delyon, and F. Portier (2018). Integral estimation based on markovian design.
Advances in Applied Probability 50(3), 833–857.
Bednorz, W., K. Latuszynski, and R. Latala (2008). A regeneration proof of the central limit theorem for uniformly ergodic markov chains.Electronic Communications in Probability 13, 85–98.
Bertail, P. and S. Clémençon (2004a). Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Probab. Relat. Fields 130(3), 388–414.
Bertail, P. and S. Clémençon (2004b). Note on the regeneration-base bootstrap for atomic Markov chains. TEST 16, 109–122.
Bertail, P. and S. Clémençon (2010). Sharp bounds for the tails of functionals of Markov chains.Th. Prob. Appl. 54(3), 505–515.
Bertail, P. and S. Clemencon (2011). A renewal approach to markovian u-statistics.
Mathematical Methods of Statistics 20(2), 79–105.
François Portier (Télécom ParisTech) May, 14th 34 / 36
Application to kernel density estimation
Bibliography II
Boucheron, S., G. Lugosi, and P. Massart (2013).Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
Bousquet, O., S. Boucheron, and G. Lugosi (2003). Introduction to statistical learning theory. InSummer School on Machine Learning, pp. 169–207. Springer.
Dedecker, J. and S. Gouëzel (2015). Subgaussian concentration inequalities for geometrically ergodic markov chains.Electronic Communications in Probability 20.
Einmahl, U. and D. M. Mason (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33(3), 1380–1403.
Giné, E. and A. Guillou (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38(6), 907–921. En l’honneur de J. Bretagnolle, D. Dacunha-Castelle, I. Ibragimov.
Levental, S. (1988). Uniform limit theorems for harris recurrent markov chains.Probability theory and related fields 80(1), 101–118.
McDiarmid, C. (1998). Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed (Eds.),Probabilistic Methods for Algorithmic Discrete Mathematics, Volume 16 ofAlgorithms and Combinatorics, pp. 195–248. Springer Berlin Heidelberg.
Meyn, S. and R. L. Tweedie (2009).Markov chains and stochastic stability (Second ed.).
Cambridge University Press. With a prologue by Peter W. Glynn.
Application to kernel density estimation
Bibliography III
Nummelin, E. (1978). A splitting technique for Harris recurrent Markov chains. Z.
Wahrsch. Verw. Gebiete 43(4), 309–318.
Paulin, D. (2015). Concentration inequalities for markov chains by marton couplings and spectral methods.Electron. J. Probab. 20, 32 pp.
Talagrand, M. (1996). New concentration inequalities in product spaces. Inventiones mathematicae 126(3), 505–563.
François Portier (Télécom ParisTech) May, 14th 36 / 36