arXiv:1805.10721v1 [math.ST] 28 May 2018

(1)

BERNSTEIN’S INEQUALITY FOR GENERAL MARKOV CHAINS

By Bai Jiang^†, Qiang Sun^‡ and Jianqing Fan^†^,^∗ Princeton University^†, University of Toronto^‡

We prove a sharp Bernstein inequality for general-state-space and not necessarily reversible Markov chains. It is sharp in the sense that the variance proxy term is optimal. Our result covers the classical Bernstein’s inequality for independent random variables as a special case.

1. Introduction. Concentration inequalities bounds the probability that the sum of random variables deviates from its mean and have found enor- mous applications in statistics, machine learning and information theory.

Many of them can be derived by the Chernoff approach (Boucheron et al., 2013). Let Z₁, . . . , Z_n be n independent random variables with EZ_i = 0.

Suppose the moment generating function of their sum P_n

i=1Z_i is bounded with a certain convex functiong in the way

(1.1) Eh

e^t^Pⁿⁱ⁼¹^Zⁱi

≤e^ng(t).

Let g^∗() = sup_t{t−g(t)} be the Fenchel conjugate of g. The Chernoff approach implies that, for >0,

(1.2) P 1

n Xn

i=1

Z_i >

!

≤e⁻^ng^∗⁽⁾.

The conjugation and second-order properties of convex functions (Gorni, 1991) assert that if

g(t) = V t²

2 +o(t²) for small t, then

g^∗() = ²

2V +o(²) for small .

∗Corresponding author.

MSC 2010 subject classifications:Primary 60J05; secondary 62J05, 65C05 Keywords and phrases:Markov chain, Bernstein’s inequality, general state space

1

arXiv:1805.10721v1 [math.ST] 28 May 2018

(2)

Thus a tighter bound of the moment generating function in (1.1) with a sharper coefficient V := 2×lim_t→0g(t)/t² leads to a tighter concentration in (1.2).

We refer the quantityV as the variance proxy of Pn

i=1Zi/√

n, as it not only characterizes the shape of the sub-gaussian tail in (1.2) but also natu- rally upper bounds the variance ofPn

i=1Z_i/√

n. Indeed, plugging expansions Eh

e^t^Pⁿⁱ⁼¹^Zⁱi

= 1 +E

" _n X

i=1

Z_i

#

·t+E



 Xn

i=1

Z_i

!2

·t² 2 +. . .

= 1 + Var

" _n X

i=1

Z_i

#

·t²

2 +o(t²), and e^ng(t)= 1 +ng(t) +(ng(t))²

2 +. . .

= 1 +nV ·t²

2 +o(t²) into (1.1) yields Var [Pn

i=1Zi]·t² ≤nV ·t²+o(t²). Dividing both sides by t² and takingt→0 yields Var [P_n

i=1Z_i]≤nV.

Sergei Bernstein, George Bennett and Wassily Hoeffding pioneered in the study of concentration inequalities and named three well-known results. For nindependent random variables Z₁, . . . , Z_n such thatEZ_i = 0 and|Z_i|< c almost surely, Hoeffding’s inequality (Hoeffding,1963) holds with

g(t) = c²t²

2 , g^∗() = ²

2c², V =c².

If taking the variances of Z_i into account and letting σ² =P_n

i=1VarZ_i/n, Bennett’s inequality (Bennett,1962) holds with

(1.3) g(t) = σ²

c²(e^tc−1−tc), g^∗() = σ² c²hc

σ²

, V =σ²,

where h(u) = (1 +u) log(1 + u) −u. Bennett’s inequality, together with the fact that h(u) ≥ u²/2(1 +u/3) for u ≥ 0, further implies Bernstein’s inequality (Bernstein,1946):

(1.4) P 1

n Xn

i=1

Z_i >

!

≤exp − n² 2 σ²+^c₃

! .

For both theoretical interest in probability and practical needs from mod- ern applications in statistics, machine learning and information theory, re- searchers have tried to study concentration inequalities for Markov-dependent

(3)

random variables. These inequalities consider the cases in whichZi =fi(Xi) with {X_i}i≥1 being a Markov chain. A large body of literature utilize the operator theory in Hilbert spaces and produce inequalities involving theL2- spectral gap, a quantity measuring the converging speed of the Markov chain towards its invariant distribution.

Among these work,Le´on and Perron(2004) established a sharp Hoeffding- type inequality for finite-state-space, reversible Markov chains, using convex majorizations of transition probability matrices. Miasojedow (2014) developed a discretization technique to prove a variant of L´eon and Perron’s result for general Markov chains. Throughout the paper we use general Markov chains to denote general-state-space and not necessarily reversible Markov chains.Fan et al.(2018) improved upon (Miasojedow,2014) and established the exact counterpart of Hoeffding’s inequality for general Markov chains.

Comparing with Hoeffding-type inequalities which only use the boundc, both Bennett’s and Berstein’s inequalities have incorporated variances and thus can be sharper, especially in the case σ² c², as would be the case for a random variable that occasionally takes on large values but has rela- tively small variance. Lezaud (1998a) derive an inequality of this kind for finite-state-space, reversible Markov chains. This work has recently inspired Bernstein-type inequalities in (Paulin,2015). Unfortunately, their inequalities do not hold for general-state-space Markov chains¹.

In this paper, we establish a sharp Bernstein-type inequality for general Markov chains. Throughout the paper, we denote by π(f) the integral of functionf with respect to a distributionπ. We formally state our first main result.

Theorem1.1. Let{X_i}i≥1 be a stationary Markov chain with invariant distribution π and L2-spectral gap (see Definition 2.1) 1−λ ∈ (0,1]. Let f_i : X → [−c,+c] be a sequence of bounded functions with π(f_i) = 0. Let σ²=Pn

i=1π(f_i²)/n. Then, for any 0≤t <(1−λ)/5c, (1.5) Eh

e^t^Pⁿⁱ⁼¹^fⁱ^(Xⁱ⁾i

≤exp nσ²

c² (e^tc−1−tc) + nσ²λt² 1−λ−5ct

.

1At the beginning of Section 4 and in the remarks after display (13) in (Lezaud,1998a), Lezaud mentioned that his method based on Kato’s perturbation theory (Kato,2013) does not work for general-state-space Markov chains. However, on page 97 of his PhD thesis (Lezaud,1998b) (in French), Lezaud wrote that his method has the possibility to be ex- tended to general-state-space Markov chains but provided no proof.Paulin (2015) used Lezaud’s method to establish Bernstein-type inequalities for Markov chains and claimed that his inequalities hold for general-state-space Markov chains. After attentively investi- gating into these literature, we concluded that Lezaud’s method does not work for general- state-space Markov chains.

(4)

Moreover, we have, for any >0,

(1.6) P 1

n Xn

i=1

f_i(X_i)>

!

≤exp

− n²

2(A₁σ²+A₂c)

,

where

A₁ = 1 +λ

1−λ, A₂= 1

3I(λ= 0) + 5

1−λI(λ >0).

In case thatf_i =f are time-independent, we can sharpen coefficients in Theorem 1.1 by replacing λ with a smaller quantity λ+∨0 related to the right L2-spectral gap (see Definition 2.2) 1−λ₊. Here a∨b denotes the maximum ofaand b. Note thatλ₊∈[−1,1), λ∈[0,1) and |λ₊| ≤λ, thus λ+∨0≤λ. We summarize this result in the following theorem.

Theorem1.2. Let{X_i}i≥1 be a stationary Markov chain with invariant distributionπ and rightL2-spectral gap (see Definition2.2)1−λ₊∈(0,2].

Let f :X → [−c,+c] be a bounded function withπ(f) = 0 and π(f²) =σ². Then, for any 0≤t <(1−λ₊∨0)/5c,

(1.7) Eπ

he^t^Pⁿⁱ⁼¹^f(Xⁱ⁾i

≤exp nσ²

c² (e^tc−1−tc) + nσ²(λ₊∨0)t² 1−λ₊∨0−5ct

.

Moreover, we have, for any >0,

(1.8) Pπ

1 n

Xn i=1

f(X_i)>

!

≤exp

− n²

2(A₁σ²+A₂c)

,

where

A₁ = 1 +λ₊∨0

1−λ₊∨0 A₂ = 1

3I(λ₊ ≤0) + 5

1−λ₊I(λ₊>0).

Our results improve upon the literature on three folds. First, our inequality recovers Bernstein’s inequality for independent random variables as a special case, whereas the previous inequalities do not. The exponent in the bound (1.5) has two terms. The leading term (scaled by 1/n) is exactly the functiong(t) of Bennett’s and Bernstein’s inequalities in (1.3). The remain- der is caused by the dependence across the Markov chain, and decreases to 0 linearly fast as λ decreases to zero. Correspondingly, the coefficients A1

and A₂ in (1.6) are increasing with λ. When λ = 0, Theorem 1.1 recovers the classical Bernstein’s inequality for independent random variables. In- deed, independent random variables Z1, . . . , Zn ∈ [−c,+c] can be seen as

(5)

transformations of independently and identically distributed (i.i.d.) random variables U₁, . . . , U_n∼Uniform[0,1] via the inverse cumulative distribution functionF_Z⁻¹

i , i.e.Z_i =F_Z⁻¹

i (U_i); and, i.i.d. random variables{U_i}i≥1 form a stationary Markov chain withλ=λ+= 0. Similarly, Theorem1.2reduces to the classical Bernstein’s inequality for i.i.d. random variables withλ₊ = 0.

Second, our inequality holds for general-state-space Markov chains, in contrast that the Bernstein-type inequalities in (Paulin,2015) do not. The latter, in a similar vein to (Lezaud,1998a), used Kato’s perturbation theory (Kato, 2013) to estimate the largest eigenvalue of the perturbed transition probability matrix. However, this estimation method is developed for finite- dimensional transition probability matrices, whereas the transition kernels of general-state-space Markov chains are infinite-dimensional. Such perturbed transition kernels may not have eigenvalues. This obstacle is the main fo- cus of our paper. We overcome it by using techniques developed for the Hoeffding-type inequalities for general Markov chains (Fan et al.,2018).

Third, the variance proxiesσ²·(1 +λ)/(1−λ) for time-dependent functions orσ²·(1 +λ₊∨0)/(1−λ₊∨0) for time-independent functions in our inequalities are optimal. For any λ ∈ [0,1), consider a stationary Markov chain{Xi}i≥1with transition probabilityP(x, B) =λI(x∈B)+(1−λ)π(B) for statexand subsetB of the state space. This chain has invariant distribu- tionπ,L2-spectral gap 1−λand rightL2-spectral gap 1−λ, see definitions in Section 2. For function f :x 7→[−c,+c] with π(f) = 0 and π(f²) = σ², Theorem 2.1 in (Geyer,1992) asserts that

nlim→∞Var Pn

i=1f(X_i)

√n

=σ²·1 +λ 1−λ. Recall that any variance proxy V bounds the variance of Pn

i=1f(X_i)/√ n.

Thus,V ≥σ²·(1 +λ)/(1−λ). This lower bound is indeed achieved in our Theorem1.1. Variance proxies of our inequalities and others’ are compared in Table1(for time-independent functionsf_i) and Table2(for time-dependent functionfi =f). Notations in the two tables are referred to Theorems 1.1 and1.2.

The remaining of this paper is structured as follows. Section 2 presents the preliminaries about operator and spectral theories. Section 3 is devoted to the proofs of Theorems1.1 and 1.2.

2. Preliminaries. Throughout the paper, we assume the state space X equipped with a sigma-algebra B is a standard Borel space². This assumption holds in most practical examples in which X is a subset of a

2A measurable space (X,B) is standard Borel if it is isomorphic to a subset ofR. See Definition 4.33 in (Breiman,1992).

(6)

Table 1

Variance proxy in inequalities for time-dependent functions fi of Markov chains

type reference condition variance proxyV

Hoeffding (Hoeffding,1963) independent c² Hoeffding (Fan et al.,2018) general-state-space ^1+λ_1−λ·c² Bernstein (Bernstein,1946) independent σ²

Bennett (Bennett,1962) independent σ²

Bernstein (Paulin,2015), (3.22) finite-state-space, reversible ₁₋⁴_λ2 ·σ² Bernstein Theorem1.1 general-state-space ^1+λ₁₋_λ·σ²

Table 2

Variance proxy in inequalities for time-independent functionf of Markov chains

type reference condition variance proxyV

Hoeffding (Hoeffding,1963) independent c²

Hoeffding (Le´on and Perron,2004) finite-state-space, reversible ^1+λ₁₋_λ⁺^∨⁰

+∨0 ·c² Hoeffding (Miasojedow,2014) general-state-space ^1+λ₁₋_λ·c² Hoeffding (Fan et al.,2018) general-state-space ^1+λ₁₋_λ⁺^∨⁰

+∨0 ·c²

Bernstein (Bernstein,1946) independent σ²

Bennett (Bennett,1962) independent σ²

Chernoff (Lezaud,1998a), (1) finite-state-space, reversible ₁₋_λ²

+∨0 ·σ² Chernoff (Lezaud,1998a), (13) general-state-space ₁₋⁴_λ·σ² Bernstein (Paulin,2015), (3.20) finite-state-space, reversible _1+λ

+∨0 1−λ+∨0 + 0.8

σ²(*) Bernstein (Paulin,2015), (3.21) finite-state-space, reversible ₁₋_λ²

+∨0 ·σ² Bernstein Theorem1.2 general-state-space ^1+λ₁₋_λ⁺^∨⁰

+∨0 ·σ² (*) Inequality (3.20) in (Paulin,2015) has variance proxyσasy² + 0.8σ². Hereσ²asyis the asymptotic variance ofPn

i=1f(Xi)/√n, which is unknown in practice and is ^1+λ₁₋_λ⁺

+σ²in the worst case for reversible Markov chains (Rosenthal,2003). And, the proof in (Paulin, 2015) implicitly requiresλ+≥0. So we write_1+λ

+∨0 1−λ+∨0+ 0.8

σ²as the variance proxy for convenience of comparison.

multi-dimensional real space andB is the Borel sigma-algebra overX. Let {X_i}i≥1 be a Markov chain on the state space (X,B) with invariant distri- butionπ.

(7)

2.1. Hilbert space andL2-spectral gap. Formally, letπ(h) :=R

h(x)π(dx) for any real-valued,B-measurable functionh:X →R. The set of all square- integrable functions

L2(X,B, π) :=

h:π(h²)<∞ is a Hilbert space endowed with the inner product

hh1, h2iπ = Z

h1(x)h2(x)π(dx), ∀h1, h2∈ L2(X,B, π).

We define the norm of a functionh∈ L2(X,B, π) as khkπ =p

hh, hiπ,

which induces the norm of a linear operatorT on L2(X,B, π) as

|||T|||_π = sup{kT hkπ :khkπ = 1}.

We writeL2in place ofL2(X,B, π), whenever the probability space (X,B, π) is clear in the context.

Each transition probability kernel P(x, B) with x ∈ X and B ∈ B, if invariant with respect toπ, corresponds to a bounded linear operator h7→

Rh(y)P(·, dy) on L2. We abuse P to denote this linear operator:

P h(x) = Z

h(y)P(x, dy), ∀x∈ X, ∀h∈ L2.

Let 1 : x ∈ X 7→ 1 denote the constant operator, and let Π denote the projection operator

Π : h7→ hh,1iπ1. TheL2-spectral gap is defined as follows.

Definition2.1 (L2-spectral gap). A Markov operatorP hasL2-spectral gap 1−λ(P) if

λ(P) :=|||P−Π|||π <1.

An equivalent characterization of λ(P) is given by

λ(P) = sup{kP hkπ :khkπ = 1, π(h) = 0}.

LetP^∗ be the adjoint operator of Markov operator P. It corresponds to the time-reversal of the transition probability kernel. The real part or the additive reversiblization (Fill,1991) ofP is given by (P+P^∗)/2, and is self- adjoint. Let L⁰2(π) :={h ∈ L2 : π(h) = 0}. It is known that the spectrum of self-adjoint Markov operator acting on L⁰2(π) := {h ∈ L2 : π(h) = 0} is contained in [−1,+1]. We define the gap between 1 and the maximum of this spectrum as the rightL2-spectral gap of P.

(8)

Definition2.2 (RightL2-spectral gap). A Markov operatorP has right L2-spectral gap1−λ₊(R)if its additive reversiblizationR = (P+P^∗)/2has

λ₊(R) := sup

s: s∈spectrum of R acting onL⁰2(π) <1.

2.2. Le´on-Perron operator and three lemmas. Let I be the identity operator onL2. Note that every convex combination of Markov operators pro- duces a Markov operator. We call a Markov operatorLe´on-Perron if it is a convex combination ofI and Π.

Definition 2.3 (Le´on-Perron operator). A Markov operator Pb on L2

is said Le´on-Perron if it is a convex combination of operatorsI and Π with some coefficientλ∈[0,1), that is

Pb=λI+ (1−λ)Π. The associated transition kernel,

Pb(x, B) =λI(x∈B) + (1−λ)π(B), ∀x∈ X, ∀B∈ B,

characterizes a random-scan mechanism: the Markov chain either stays at the current state with probability λ or samples a new state from π with probability 1−λat each step.

If a Le´on-Perron operatorPbshares the sameL2-spectral gap and invariant measureπwith a Markov operatorP then we call it the Le´on-Perron version ofP, which is formally defined below.

Definition 2.4 (Le´on-Perron version). For a Markov operator P with invariant measure π and L2-spectral gap 1−λ(P) = 1−λ, we say Pb = λI+ (1−λ)Π is the Le´on-Perron version ofP.

The León-Perron versionPbhas played a central role in (León and Perron, 2004) on Hoeffding-type inequalities for Markov chains. This is why we call it León-Perron.

This paper uses three lemmas of Le´on-Perron operators for general Markov chains, which are originally developed in (Fan et al.,2018). We list them as the following lemmas. In these results,E^tf denotes the multiplication operator of functione^tf(x):

E^tf :h∈ L2 7→he^tf.

(9)

Lemma2.1 (Lemma 4.1 in (Fan et al.,2018)). Let {Xi}i≥1 be a Markov chain with invariant measureπ and L2-spectral gap1−λ∈(0,1]. Let Pb= λI+ (1−λ)Π . For any bounded functions f_i and any t∈R,

Eh

e^t^Pⁿⁱ⁼¹^fⁱ^(Xⁱ⁾i

≤ Yn i=1

|||E^tfⁱ^/2P Eb ^tfⁱ^/2|||_π.

Lemma2.2 (Part of Theorem 2.2 in (Fan et al.,2018)). Let {X_i}i≥1 be a Markov chain with invariant measureπ and rightL2-spectral gap1−λ₊∈ (0,2]. LetRb₊= (λ₊∨0)I+ (1−λ₊∨0)Π . For any bounded functionf and anyt∈R,

Eh

e^t^Pⁿⁱ⁼¹^f(Xⁱ⁾i

≤ |||E^{tf /2}Rb+E^{tf /2}|||ⁿ_π.

Lemma2.3 (Lemma 4.3 in (Fan et al.,2018)). Let {Xb_i}i≥1 be a Markov chain driven by a Le´on-Perron operator Pb = λI + (1−λ)Π with some λ∈[0,1). For any bounded functionf and any t∈R,

n→∞lim 1

nlogEh

e^t^Pⁿⁱ⁼¹^f(^X^bⁱ⁾i

= log|||E^{tf /2}P Eb ^{tf /2}|||_π.

Lemmas 2.1 and 2.2 upper bound the moment generating function of Pn

i=1fi(Xi) orPn

i=1f(Xi) by the product of the norms of operatorsE^tfⁱP Eb ^tfⁱ orE^tfRb₊E^tf, where Pb and Rb₊ are León-Perron. Comparing them to (1.5) and (1.7) suggests that we only need to bound |||E^{tf /2}P Eb ^{tf /2}|||π for a León- Perron operator Pb. Lemma 2.3 studies the large deviation behavior of a Markov chain driven by a León-Perron operator, and shows that the upper bound in Lemma2.2is asymptotically tight for such Markov chains.

2.3. Asymptotic variance and reduced resolvent. This subsection consid- ers a reversible Markov chain {X_i}i≥1 driven by a self-adjoint P. Suppose this Markov chain admits a right L2-spectral gap 1−λ₊, i.e. the second largest eigenvalue of P is λ₊. Define the asymptotic variance of f with π(f) = 0 andπ(f²) =σ² as

σ_asy² := lim

n→∞Var Pn

i=1f(X_i)

√n

= lim

n→∞

1 nE^π



 Xn i=1

Xn j=1

f(X_i)f(X_j)



.

By (Rosenthal,2003, Proposition 1), σ_asy² ≤ 1 +λ₊

1−λ₊σ².

(10)

LetSdenote the reduced resolvent ofPwith respect to its largest eigenvalue 1, which can be equivalently defined through (2.11) in (Kato,2013, Chapter 2):

SΠ =ΠS= 0, (P−I)S=S(P−I) =I−Π. LetZ =−S then, with the conventionT⁰ =I for any operatorT,

Z =−S= (I−P)⁻¹(I−Π) = X∞ n=0

Pⁿ

!

(I−Π)

= X∞ n=0

(Pⁿ−Π) = X∞ n=0

(P−Π)ⁿ−Π

= (I−P+Π)⁻¹−Π. We now write

Eπ



 Xn i=1

Xn j=1

f(X_i)f(X_j)



=

*

 Xn

i=1

Xn j=1

(P−Π)^|j−i|



f, f +

π

=

n(2Z−I)−2Z²P(I−Pⁿ) f, f

π, which implies that

h(2Z−I)f, fiπ =σ²_asy≤ 1 +λ₊

1−λ₊σ², hZf, fiπ = σ²_asy+σ²

2 ≤ σ²

1−λ₊. Becausef could be an arbitrary function with mean 0, thatZ is self-adjoint, thatZΠ =ΠZ = 0 and thathZf, fi ≥0 for anyf with mean 0, we obtain

|||Z|||π = sup

h:khkπ=1|hZh, hiπ|

= sup

h:khkπ=1|hZ(I−Π)h,(I−Π)hiπ|

= sup

h:khkπ=1hZ(I −Π)h,(I−Π)hiπ

≤ sup

h:khkπ=1

k(I −Π)hk²π

1−λ₊ = 1 1−λ₊.

2.4. Kato’s perturbation theory. This subsection uses Kato’s perturbation theory (Kato,2013) to expand the largest eigenvalue ofP E^tf as a series int. Here,P is a transition probability kernel (matrix) of a finite-state-space, irreducible Markov chain, andE^tf is the multiplication operator of function e^tf(x), appearing as a diagonal matrix with elements {e^tf(x):x∈ X }.

(11)

The irreducibility ofP implies the uniqueness of the invariant distribution π. Let D be the diagonal matrix with elements {f(x) : x ∈ X }. Then we can expandP E^tf as

P E^tf =P X∞ n=0

tⁿDⁿ n!

!

= X∞ n=0

tⁿT⁽ⁿ⁾,

where T⁽ⁿ⁾ = P Dⁿ/n! for n ≥ 0. Let β(t) be the largest eigenvalue of P E^tf. By (2.31) in (Kato,2013, Chapter 2), for small|t|< t₀(witht₀ being determined later),

β(t) =β⁽⁰⁾+β⁽¹⁾t+β⁽²⁾t²+. . . , whereβ⁽⁰⁾ = 1 is the largest eigenvalue ofP and

β⁽ⁿ⁾= Xn p=1

(−1)^p p

X

v1+· · ·+vp =n, vi ≥1 k₁+· · ·+k_p=p−1, k_j ≥0

tr

T^(v¹⁾S^(k¹⁾. . . T^(v^p⁾S^(k^p⁾

withS⁽⁰⁾=−Π,S⁽ⁿ⁾=Sⁿ forn≥1, and S being the reduced resolvent of P with respect to its largest eigenvalue 1. It is more convenient to substitute S,T⁽ⁿ⁾ with−Z and P Dⁿ/n!, respectively. LetZ⁽ⁿ⁾= (−1)ⁿS⁽ⁿ⁾forn≥0, i.e.Z⁽⁰⁾ =−Π and Z⁽ⁿ⁾=Zⁿ forn≥1.

β⁽ⁿ⁾ = Xn p=1

1 p

X

v₁+· · ·+v_p =n, v_i ≥1 k₁+· · ·+k_p =p−1, k_j ≥0

−tr P D^v¹Z^(k¹⁾. . . P D^v^pZ^(k^p⁾ v₁!. . . v_p!

By a simple calculation, β⁽¹⁾ =−tr

P D¹Z⁽⁰⁾

= tr (P DΠ) = tr (DΠP) = tr (DΠ) =hf,1iπ = 0.

Using the fact thatZP =Z−(I−Π) derived from the proceeding subsection, β⁽²⁾=−1

1

tr P D²Z⁽⁰⁾

2! −1

2

tr P D¹Z⁽⁰⁾P D¹Z¹

+ tr P D¹Z¹P D¹Z⁽⁰⁾ 1!1!

= hf, fiπ

2 +hZP f, fiπ =hZf, fiπ− hf, fiπ

2 = σ_asy² 2 .

Forn≥3,β⁽ⁿ⁾ is more complicated. Boundingβ⁽ⁿ⁾ is the main task in our proof of the main theorems, and will shown in the proof section.

(12)

The convergence radiust0 is given by (6) in (Lezaud,1998a)

t₀ = 1

2|||T⁽¹⁾|||π(1−λ₊)⁻¹+c₀ with anyc₀ such that

|||T⁽ⁿ⁾|||_π ≤ |||T⁽¹⁾|||_πcⁿ⁻¹₀ .

It is easy to verify thatc₀ =c≥ |||D|||_π is a satisfactory choice. Indeed,

|||T⁽¹⁾|||π =|||P D|||π ≤c,

|||T⁽ⁿ⁾|||_π = 1

n!|||P Dⁿ|||_π ≤ |||P D|||_π|||D|||ⁿ⁻¹_π ≤ |||T⁽¹⁾|||_πcⁿ⁻¹. With this choice,

t₀ ≥ 1

2c(1−λ+)⁻¹+c = 1−λ₊ (3−λ+)c.

3. Proof of Theorems1.1 and 1.2. Comparing Lemmas2.1and 2.2 to (1.5) and (1.7) suggests that we need to bound |||E^{tf /2}P Eb ^{tf /2}|||_π for a Le´on-Perron operator Pb=λI+ (1−λ)Π. This task is attacked by Lemma 3.1.

3.1. Bounding |||E^{tf /2}P Eb ^{tf /2}|||_π. The proof of Lemma 3.1 consists of three steps. First, we discretize function f as f_k such that sup_x_∈X|f_k(x)− f(x)| ≤ c/k, π(f_k) = 0, sup_x_∈X|f_k(x)| ≤ c and f_k takes finitely many possible values.

Denote by Y the range of f_k. Next, we show that |||E^tf^k^/2P Eb ^tf^k^/2|||_π =

|||E^ty/2QEb ^ty/2|||µ, where Qb = λI + (1−λ)1µ⁰ is a transition probability matrix of a finite-state-space Markov chain onYwithµ(y) =π({x:f_k(x) = y}) for each y∈ Y, and E^ty is the diagonal matrix with elements{e^ty :y∈ Y}. This is formally proved in Lemma3.2.

Finally, we bound |||E^ty/2QEb ^ty/2|||_µ through Lemma 3.3, which is based on a refinement of the arguments in (Lezaud,1998a). Lemma3.1is concluded by tendingk→ ∞ such that|||E^tf^k^/2P Eb ^tf^k^/2|||π → |||E^{tf /2}P Eb ^{tf /2}|||π.

Lemma 3.1. Let Pb = λI + (1−λ)Π with some λ ∈ [0,1) be a Le´on- Perron operator. Let f :X →[−c,+c]be a bounded function with π(f) = 0 andπ(f²) =σ². Then, for any 0≤t <(1−λ)/5c,

|||E^{tf /2}P Eb ^{tf /2}|||π ≤exp σ²

c²(e^tc−1−tc) + σ²λt² 1−λ−5ct

.

(13)

Proof of Lemma3.1. Let a simple functionf_kbe the (c/3k)-approximation off in the way

f_k(x) =

f(x) +c c/3k

× c 3k −c.

Then, |π(f_k)| ≤ c/3k, and sup_x|f_k(x)−π(f_k)| ≤ sup_x|f_k(x)|+|π(f_k)| ≤ c+c/3k. Letfe_k be the normalized version off_k in the way

fe_k= f_k−π(f_k) 1 + 1/3k . Then, sup_x|fe_k(x)| ≤c,π(fe_k) = 0 and

sup

x |fe_k(x)−f(x)| ≤sup

x |fe_k(x)−[f_k(x)−π(f_k)]|+ sup

x |f_k(x)−f(x)|+|π(f_k)|

= sup

x

f_k(x)−π(f_k) 1 + 1/3k

+ sup

x |f_k(x)−f(x)|+|π(f_k)|

≤ c/3k

1 + 1/3k +c/3k+c/3k < c/k.

We still writefe_kasf_k. Note thatf_k takes k⁰≤(6k+ 1) possible values, say {yj : 1≤j≤k⁰} ⊂[−c,+c]. LetQb =λI+ (1−λ)1µ⁰ be ak⁰×k⁰ transition probability matrix withµ_j =π({x:f_k(x) =y_j}) forj= 1, . . . , k⁰, and E^ty be the diagonal matrix with elements{e^ty^j : 1≤j ≤k⁰}. By Lemma3.2, (3.1) |||E^tf^k^/2P Eb ^tf^k^/2|||π =|||E^ty/2QEb ^ty/2|||µ.

By Lemma3.3, for any 0≤t <(1−λ)/5c,

|||E^ty/2QEb ^ty/2|||_µ≤exp



 P_k0

j=1µ_jy_j²

c² (e^tc−1−tc) + Pk⁰

j=1µjy²_j λt² 1−λ−5ct



.

Note thatPk⁰

j=1µjy²_j =π(f_k²). And, from the fact that sup_x|f_k(x)−f(x)| ≤ c/k, it follows that

|||E^{tf /2}P Eb ^{tf /2}|||_π ≤e^ct/k× |||E^tf^k^/2P Eb ^tf^k^/2|||_π.

Collecting these pieces together yields that, for any 0≤t <(1−λ)/5c,

|||E^{tf /2}P Eb ^{tf /2}|||_π ≤exp ct

k + π(f_k²)

c² (e^tc−1−tc) + π(f_k²)λt² 1−λ−5ct

. Tendingk→ ∞ and π(f_k²)→σ² concludes the proof.

(14)

Lemma 3.2 details the proof of (3.1). For any function f taking finitely many values and any Le´on-Perron operator P, we construct two stationaryb Markov chain{Xb_i}i≥1 driven by Pb and{Yb_i}i≥1 driven by Qb such thatYb_i = f(Xb_i). Then we use the large deviation results of {Xb_i}i≥1 and {Yb_i}i≥1 in Lemma2.3 to establish the equivalence between the norms ofE^{tf /2}P Eb ^{tf /2} andE^ty/2QEb ^ty/2.

Lemma 3.2. Let Pb = λI + (1− λ)Π be a Le´on-Perron operator on a general state space X. Let f : X → {y_j : j = 1, . . . , k} be a simple function taking k possible values. Define Qb = λI+ (1−λ)1µ⁰, with µ_j = π({x : f(x) = y_j}), as a transition probability matrix on the finite state space Y = {y_j : j = 1, . . . , k}. Let E^ty denote the diagonal matrix with elements{e^ty^j : j= 1, . . . , k}. Then

|||E^{tf /2}P Eb ^{tf /2}|||π =|||E^ty/2QEb ^ty/2|||µ.

Proofof Lemma 3.2. Construct two stationary Markov chains{Xb_i}i≥1

and {Yb_i}i≥1 in the following way. Let {B_i}i≥1 and {Z_i}i≥1 be sequences of i.i.d. Bernoulli random variables with success probabilityλand i.i.d. random variables followingπ, respectively. It is easy to verify that

Xb₁=Z₁, Xb_i =B_iXb_i₋₁+ (1−B_i)Z_i, ∀i≥2;

Yb₁ =f(Z₁), Yb_i =B_iYb_i₋₁+ (1−B_i)f(Z_i), ∀i≥2.

are stationary Markov chains driven by Pb and Q, respectively; and thatb Ybi =f(Xbi) for i≥1. Putting them together with Lemma 2.3yields

log|||E^{tf /2}P Eb ^{tf /2}|||π = lim

n→∞logE

"

exp t Xn i=1

f(Xb_i)

!#

= lim

n→∞logE

"

exp t Xn i=1

Yb_i

!#

= log|||E^ty/2QEb ^ty/2|||_µ.

We now refine the arguments inLezaud(1998a) to bound|||E^ty/2QEb ^ty/2|||_µ, and give a tighter bound for the coefficient β⁽ⁿ⁾ in Kato’s expansion than Lezaud (1998a) andPaulin (2015). This tighter bound is stated in Lemma 3.3and eventually leads to the improvement over their inequalities.

(15)

Lemma 3.3. Let P be the transition probability matrix of a reversible, finite-state-space, irreducible Markov chain with invariant distributionπ and right L2-spectral gap 1−λ₊, i.e. the second largest eigenvalue of P is λ₊. Let f :X → [−c,+c] be a bounded function withπ(f) = 0 and π(f²) =σ². If λ₊∈[0,1)then, for any 0≤t <(1−λ₊)/5c,

|||E^{tf /2}P E^{tf /2}|||π ≤exp σ²

c²(e^tc−1−tc) +σ²|||P −Π|||_πt² 1−λ₊−5ct

. Proofof Theorem 3.3. Note that each element of matrixE^{tf /2}P E^{tf /2} is non-negative. By Perron-Frobenius theorem,|||E^{tf /2}P E^{tf /2}|||π is equal to its largest eigenvalue. P E^tf is similar to E^{tf /2}P E^{tf /2}, and thus shares the same eigenvalues. It follows that

|||E^{tf /2}P E^{tf /2}|||_π =β(t) := the largest eigenvalue of P E^tf.

Recall that D is the diagonal matrix with elements {f(x) : x ∈ X }, and Z⁽⁰⁾ =−Π and Z⁽ⁿ⁾ =Zⁿ forn≥1 with Z =P_∞

n=0(Pⁿ−Π). By Kato’s perturbation theory,

β(t) =β⁽⁰⁾+β⁽¹⁾t+β⁽²⁾t²+. . . , ∀ |t| ≤ 1−λ+

(3−λ₊)c, whereβ⁽⁰⁾ = 1, β⁽¹⁾ = 0, β⁽²⁾=σ_asy² /2, and in general

β⁽ⁿ⁾= Xn p=1

1 p

X

v₁+· · ·+v_p=n, v_i ≥1 k₁+· · ·+k_p =p−1, k_j ≥0

−tr P D^v¹Z^(k¹⁾. . . P D^v^pZ^(k^p⁾ v1!. . . vp! .

Proceed to boundβ⁽ⁿ⁾ forn≥3. Consider two cases.

Case I:p= 1. Then v₁=n,k₁ = 0, thus the above summand is equal to tr (P DⁿΠ)

n! = π(fⁿ) n! .

Case II:p≥2. Sincek₁+. . . k_p =p−1, there exists at least one indexjsuch that k_j is zero. Supposej₁ is the lowest of such indices. Let (k₁⁰, . . . , k_p⁰) = (k_j₁₊₁, . . . , k_p, k₁, . . . , k_j₁) be the cyclic rotation of the indices. Correspond-

(16)

ingly, let (v₁⁰, . . . , v⁰_p) = (vj1+1, . . . , vp, v1, . . . , vj1). Then

−tr

P D^v¹Z^(k¹⁾P D^v¹Z^(k²⁾. . . P D^v^pZ^(k^p⁾

=−tr

P D^v¹⁰Z^(k⁰¹⁾P D^v⁰²Z^(k⁰²⁾. . . P D^v^p⁰Z^(k^p⁰⁾

= tr

P D^v⁰¹Z^(k⁰¹⁾P D^v²⁰Z^(k²⁰⁾. . . P D^v⁰^p−1Z^(k⁰^p−1⁾P D^v^p⁰Π

= tr

D^v⁰¹Z^(k⁰¹⁾P D^v⁰²Z^(k⁰²⁾. . . P D^v⁰^p⁻¹Z^(k^p⁰⁻¹⁾P D^v⁰^pΠP

= tr

D^v⁰¹Z^(k⁰¹⁾P D^v⁰²Z^(k⁰²⁾. . . P D^v⁰^p−1Z^(k^p−1⁰ ⁾P D^v⁰^pΠ

=hf, D^v¹⁰⁻¹Z^(k¹⁰⁾P D^v²⁰Z^(k⁰²⁾. . . P D^v⁰^p−1Z^(k⁰^p−1⁾P D^v^p⁰⁻¹fiπ.

On the other hand,k⁰₁+· · ·+k⁰_p₋₁ =p−1≥1, implying that at least one k_j⁰₂ with indexj2 ∈ {1, . . . , p−1} is non-zero. FromZΠ = 0, it follows that Z^(k^j⁰²⁾P =Z^(k^j⁰²⁾(P−Π). Thus,

|hf, D^v¹⁰⁻¹Z^(k¹⁰⁾P D^v²⁰Z^(k⁰²⁾. . . Z^(k^j⁰²⁾P . . . P D^v⁰^p⁻¹Z^(k^p⁰⁻¹⁾P D^v⁰^p⁻¹fiπ|

=|hf, D^v⁰¹⁻¹Z^(k⁰¹⁾P D^v⁰²Z^(k⁰²⁾. . . Z^(k⁰^j²⁾(P−Π). . . P D^v^p−1⁰ Z^(k^p−1⁰ ⁾P D^v⁰^p⁻¹fiπ|

≤σ²|||P−Π|||_π|||D|||ⁿ_π⁻²|||Z|||^p⁻¹≤σ²|||P−Π|||_πcⁿ⁻²(1−λ₊)⁻⁽ⁿ⁻¹⁾, where the last step uses the facts that|||D|||π ≤c, that|||Z|||π ≤(1−λ₊)⁻¹ and the assumption thatλ₊≥0 (thus (1−λ₊)⁻¹≥1). Note that

# (

(v₁, . . . , v_p) : Xp i=1

v_i =n, v_i≥1 )

=

n−1 p−1

,

#



(k₁, . . . , k_p) : Xp j=1

k_j =p−1, k_j ≥0



=

2p−2 p−1

.

and that, byLezaud (1998a), pp. 856, Xn

p=1

1 p

n−1 p−1

2p−2 p−1

≤5ⁿ⁻², ∀n≥3.

Collecting these pieces together yields

|β⁽ⁿ⁾| ≤ π(fⁿ)

n! + 5ⁿ⁻²×σ²|||P−Π|||_πcⁿ⁻² (1−λ₊)ⁿ⁻¹

= π(fⁿ)

n! +σ²|||P −Π|||_π 5c

5c 1−λ₊

n−1

, ∀ n≥3.

(17)

This inequality also holds forn= 2, as β⁽²⁾ = σ_asy²

2 ≤σ² 1

2 + λ₊ 1−λ₊

≤ σ²

2 +|||P−Π|||π × σ² 1−λ₊. It follows that, for any 0≤ t < (1−λ₊)/5c < (1−λ₊)/(3−λ₊)c, β(t) is bounded as

β(t) =β⁽⁰⁾+β⁽¹⁾t+ X∞ n=2

β⁽ⁿ⁾tⁿ

≤1 + 0 + X∞ n=2

π(fⁿ)tⁿ

n! +|||P−Π|||_π× X∞ n=2

σ²t 5c

5ct 1−λ₊

n−1

≤exp X∞ n=2

π(fⁿ)tⁿ

n! +|||P−Π|||_π× X∞ n=2

σ²t 5c

5ct 1−λ₊

n−1! .

Putting it together with the facts that, for anyt≥0 X∞

n=2

π(fⁿ)tⁿ

n! ≤

X∞ n=2

π(f²)cⁿ⁻²tⁿ n! = σ²

c² X∞ n=2

cⁿtⁿ n! = σ²

c² e^tc−1−tc ,

and that, for any 0≤t <(1−λ+)/5c X∞

n=2

σ²t 5c

5ct 1−λ₊

n−1

= σ²t² 1−λ₊−5ct. completes the proof.

3.2. Proofs of Theorems1.1and1.2. We present the proofs to Theorems 1.1and 1.2.

Proofof Theorem 1.1. The first claim follows from Lemmas 2.1 and 3.1. For the second claim, we first consider the case of λ >0. Let

g₁(t) =

(0 ift≤0

σ²

c² e^tc−1−tc

ift >0 and

g₂(t) =







0 ift≤0

σ²λt²

1−λ−5ct if 0< t < ¹⁻_5c^λ

∞ ift≥ ¹⁻_5c^λ .

(18)

Both are closed proper convex functions and admit Frechet conjugates g^∗₁(₁) =

(σ²

c²h1(c1/σ²) if1 ≥0, +∞ if₁ <0, withh₁(u) = (1 +u) log(1 +u)−u; and

g₂^∗(₂) =

( ₍₁₋_λ)2 2

2λσ²h2(5c2/λσ²) if₂ ≥0, +∞ if₂ <0, withh₂(u) =√

1 +u+u/2 + 1. By the Chernoff bound,

−logP 1 n

Xn i=1

f_i(X_i)>

!

≥n×sup{t−g₁(t)−g₂(t) : t >0}. Since g₁(t) = O(t²) and g₂(t) = O(t²) as t→ 0, t−g₁(t)−g₂(t) > 0 for somet >0; and, for t≤0,t−g₁(t)−g₂(t)≤0. Hence

sup{t−g1(t)−g2(t) : t >0}= sup{t−g1(t)−g2(t) : t∈R}= (g1+g2)^∗().

Recall that both g₁ and g₂ are closed proper convex functions. Thus, the Fenchel conjugate of their sum is the infimal convolution of their conjugates g₁^∗ andg₂^∗.

(g₁+g₂)^∗() = inf{g₁^∗(₁) +g₂^∗(₂) : ₁+₂ =}

= inf (σ²

c²h₁c₁ σ²

+ (1−λ)²₂

2λσ²h2(^5c_λσ2²) : ₁+₂ =, ₁ ≥0, ₂≥0 )

Using the facts thath1(u)≥u²/2(1 +u/3) andh2(u)≤2 +uforu≥0, we lower bound this term as

(g₁+g₂)^∗()≥inf



 ²₁

2 σ²+^c₃¹+ ²₂ 2

2λ

1−λσ²+^5c₁₋_λ² : ₁+₂ =, ₁ ≥0, ₂ ≥0



 Using the fact that _a²¹ +_b²² ≥ ⁽¹_a+b⁺²⁾² for any positive ₁, ₂, a, b, we further

bound

(g1+g2)^∗()≥inf





(1+2)² 2 σ²+^c₃¹

+ 2

1−λ2λ σ²+ ^5c_1−λ² : 1+2 =, 1 ≥0, 2≥0





= ²

2

1+λ

1−λσ²+₁^5c₋_λ.