HAL Id: hal-02292419
https://hal.archives-ouvertes.fr/hal-02292419
Submitted on 19 Sep 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Gaussian Concentration bound for potentials satisfying Walters condition with subexponential continuity rates
J.-R Chazottes, J Moles, E Ugalde
To cite this version:
J.-R Chazottes, J Moles, E Ugalde. Gaussian Concentration bound for potentials satisfying Walters
condition with subexponential continuity rates. Nonlinearity, IOP Publishing, 2019. �hal-02292419�
Gaussian Concentration bound for potentials satisfying Walters condition with subexponential
continuity rates
J.-R. Chazottes
∗1, J. Moles
†2, and E. Ugalde
‡21
Centre de Physique Th´ eorique, Ecole Polytechnique, CNRS, 91128 Palaiseau, France.
Email: chazottes@cpht.polytechnique.fr
2
Instituto de F´ısica, Universidad Aut´ onoma de San Luis Potos´ı, S.L.P., 78290 M´ exico
Emails: molesjdn@ifisica.uaslp.mx, ugalde@ifisica.uaslp.mx
Dated: September 10, 2019
Abstract
We consider the full shift T : Ω → Ω where Ω = A
N, A being a finite alphabet. For a class of potentials which contains in particular potentials φ with variation decreasing like O(n
−α) for some α > 2, we prove that their corresponding equilibrium state µ
φsatisfies a Gaussian concentra- tion bound. Namely, we prove that there exists a constant C > 0 such that, for all n and for all separately Lipschitz functions K(x
0, . . . , x
n−1), the exponential moment of K(x, . . . , T
n−1x) − R
K(y, . . . , T
n−1y) dµ
φ(y) is bounded by exp C P
n−1i=0
Lip
i(K)
2. The crucial point is that C is independent of n and K. We then derive various consequences of this inequality. For instance, we obtain bounds on the fluctuations of the empirical frequency of blocks, the speed of convergence of the empirical measure, and speed of Markov approximation of µ
φ. We also derive an almost-sure central limit theorem.
Keywords: concentration inequalities, empirical measure, Kantorovich distance, Wasserstein distance, d-bar distance, relative entropy, Markov approximation, almost-sure central limit theorem.
∗
JRC benefited from an ECOS Nord project for his stay at San Luis Potos´ı.
†
JM benefited from an ECOS Nord project for his stay at Ecole Polytechnique.
‡
EU acknowledges Ecole Polytechnique for financial support (one-month stay).
Contents
1 Introduction 2
2 Setting and preliminary results 4
3 Main result and applications 8
3.1 Gaussian concentration bound . . . . 8
3.2 Related works . . . . 10
3.3 Applications . . . . 11
3.3.1 Birkhoff sums . . . . 11
3.3.2 Empirical frequency of blocks . . . . 12
3.3.3 Hitting times and entropy . . . . 13
3.3.4 Speed of convergence of the empirical measure . . . . . 14
3.3.5 Relative entropy, ¯ d-distance and speed of Markov ap- proximation . . . . 15
3.3.6 Shadowing of orbits . . . . 19
3.3.7 Almost-sure central limit theorem . . . . 20
4 Proof of Theorem 3.1 22 4.1 Some preparatory results . . . . 22
4.2 Proof of Theorem 3.1 . . . . 24
1 Introduction
We consider the full shift T : Ω → Ω where Ω = A
N, A being a finite alphabet.
Given an ergodic measure µ on Ω and a continuous observable f : Ω → R , we know by Birkhoff’s ergodic theorem that n
−1S
nf (x) converges, for µ-almost every x, to R
f dµ. (We use the standard notation S
nf = f + f ◦ T + · · · + f ◦ T
n−1.) To refine this result, we need more assumptions on µ and f . For instance, if µ = µ
φis the equilibrium state for a Lipschitz potential φ : Ω → R and f is also a Lipschitz function, then the following central limit theorem holds:
n→∞
lim µ
φx : S
nf (x) − n R f dµ
φ√ n ≤ u
= 1
σ √ 2π
Z
u−∞
e
−ξ2
2σ2
dξ (1.1) for all u ∈ R , where σ
2= σ
f2is the variance of the process {f (T
nx)}
n≥0where x is distributed according to µ
φ.
1This result says in essence that the fluctuations of S
nf (x) − n R
f dµ
φare with high probability of order √
n, when n → ∞. Fluctuations of order n, referred to as ‘large deviations’, are unlikely
1
σ
2f= R
f
2dµ
φ− R fdµ
φ 2+ 2 P
`≥1
R f · f ◦ T
`dµ
φ− R fdµ
φ 2to appear. Indeed, for instance one has µ
φx : 1
n S
nf (x) ≥ Z
f dµ
φ+ u
e
−nI(u+Rfdµφ)(1.2) where u ≥ 0 and I (u) ≥ 0 is the so-called ‘rate function’ which is (strictly) convex, such that I( R
f dµ
φ) = 0, and equal to +∞ outside a certain finite interval (u
f, u ¯
f).
2Of course, both the central limit theorem and the large deviation asymptotics have been obtained for more general potentials, and for more general ‘chaotic’ dynamical systems. For a fairly recent review on prob- abilistic properties of nonuniformly hyperbolic dynamical systems modeled by Young towers, we refer to [4].
In this paper, we are interested in concentration inequalities which describe the fluctuations of observables of the form K(x, T x, . . . , T
n−1x) around their average. The only restriction on K is that it has to be separately Lipschitz.
By this we mean that, for all i = 0, . . . , n − 1, there exists a constant Lip
i(K) with
|K(x
0, . . . , x
i, . . . , x
n−1) − K(x
0, . . . , x
0i, . . . , x
n−1)| ≤ Lip
i(K) d(x
i, x
0i).
for all points x
0, . . . , x
i, . . . , x
n−1, x
0iin Ω, where d is the usual distance on Ω (see (2.1)). So K can be nonlinear and implicitly defined. Of course, such a class contains partial sums of Lipschitz functions, namely functions of the form K(x
0, . . . , x
n−1) = f (x
0) + · · · + f (x
n−1) for which Lip
i(K ) = Lip(f ) for all i.
Beside considering very general observables, the other essential characteristics of concentration inequalities is that they are valid for all n, contrarily to the above two results which are valid only in the limit n → ∞. More precisely, we shall prove the following ‘Gaussian concentration bound’. There exists a constant C such that, for all n and for all separately Lipschitz functions K(x
0, . . . , x
n−1), we have
Z
exp K x, T x, . . . , T
n−1x
dµ
φ(x)
≤ exp Z
K x, T x, . . . , T
n−1x
dµ
φ(x)
exp
C
n−1
X
j=0
Lip
j(K)
2
. (1.3) The crucial point is that C is independent of n and K. By a standard argument (see below), the previous inequality implies that for all u > 0
µ
φx : K x, T x, . . . , T
n−1x
− Z
K y, T y, . . . , T
n−1y
dµ
φ(y) ≥ u
≤ exp − u
24C P
n−1i=0
Lip
i(K)
2!
. (1.4)
2
For two positive sequences (a
n), (b
n), a
nb
nmeans that lim
n(1/n) log a
n=
lim
n(1/n) log b
n.
The Gaussian concentration bound (1.3) is known for Lipschitz potentials [7]. We shall prove that it remains true for a large subclass of potentials φ satisfying Walters condition. For instance, the bound holds for a potential whose variation is O(n
−α) for some α > 2. The proof of our result relies on two main ingredients. First, we start with a classical decomposition of K − R
K as a telescopic sum of martingale differences. Second, we have to do a second telescoping to use Ruelle’s Perrons-Frobenius operator. But we do not have a spectral gap anymore as in [7] (in the case of Lipschitz potentials). Instead, we use a result of V. Maume-Deschamps [18] based on Birkhoff cones.
We apply the Gaussian concentration bound and its consequences, like (1.4), to various observables. On the one hand, we obtain concentration bounds for previously studied observables. We get the same bounds but they are no more limited to equilibrium states with Lipschitz potentials. On the other hand, we consider observables not considered before. Even when K(x, . . . , T
n−1x) = S
nf (x), we get a non-trivial bound. We then obtain a control on the fluctuations of the empirical frequency of blocks a
0, . . . , a
k−1around µ([a
0, . . . , a
k−1]), uniformly in a
0, . . . , a
k−1∈ A
k. We then consider an estimator of the entropy µ
φbased on hitting times. The next application is about the speed of convergence of the empirical measure (1/n) P
n−1i=0
δ
Tixtowards µ
φin Wasserstein distance. Then we obtain an upper bound for the d-distance between any shift-invariant probability measure and ¯ µ
φ. This dis- tance is bounded by the square root of their relative entropy, times a constant.
A consequence of this inequality is a bound for the speed of convergence of the Markov approximation of µ
φin ¯ d-distance. Then we quantify the ‘shadowing’
of an orbit by another one which has to start in a subset of Ω with µ
φ-measure 1/3, say. Finally, we prove an almost-sure version of the central limit theorem.
This application shows in particular that concentration inequalities can also be used to obtain limit theorems.
2 Setting and preliminary results
Let Ω = A
Nwhere A is a finite set. We denote by x = x
0x
1. . . the elements of Ω (hence x
i∈ A), and by T the shift map: (T x)
k= x
k+1, k ∈ N . (We use upper indices instead of lower indices because we will need to consider bunches of points in Ω, e.g., x
0, x
1, . . . , x
p, x
i∈ Ω.) We use the classical distance
d
θ(x, y) = θ
inf{k:xk6=yk}(2.1) where θ ∈ (0, 1) is some fixed number. Probability measures are defined on the Borel sigma-algebra of Ω which is generated by cylinder sets. Let φ : Ω → R be a continuous potential, which means that
var
n(φ) := sup{|φ(x) − φ(y)| : x
i= y
i, 0 ≤ i ≤ n − 1} −−−→
n→∞0.
The sequence (var
n(φ))
n≥1is the modulus of continuity of φ and it is called the ‘variation’ of φ in our context. By the way, we denote by C (Ω) the Banach space of real-valued continuous functions on Ω equipped with the supremum norm k · k
∞. We put further restrictions on φ, namely that it must satisfy the Walters condition [22]. For x, y in Ω let
W (φ, x, y) = sup
n∈N
sup
a∈An
|S
nφ(ax) − S
nφ(ay)| .
We assume that W (φ, x, y) exists and that there exists W (φ) > 0 such that sup
x,y∈Ω
W (φ, x, y) ≤ W (φ) . (2.2)
Now for p ∈ N let
W
p(φ) := sup{W (φ, x, y) : x
i= y
i, 0 ≤ i ≤ p − 1} .
Definition 2.1. φ is said to satisfy Walters’ condition if (W
p(φ))
p∈Nis a strictly positive sequence and decreases to 0 as p → ∞.
We now make several remarks on Walters’ condition. First, observe that locally constant potentials do not satisfy this condition because W
p(φ) = 0 for all p larger than some p
0. But one can in fact work with any strictly positive sequence ( W f
p(φ))
p∈Ndecreasing to zero such that W
p(φ) ≤ W f
p(φ) for all p, e.g., max(W
p(φ), η
p) for some fixed η ∈ (0, 1). Second, one easily checks that
var
p+1(φ) ≤ W
p(φ) ≤
∞
X
k=p+1
var
k(φ) , p ∈ N . (2.3) Hence the set of potentials satisfying Walters’ condition contains the set of potentials with summable variation. In particular, (W
p(φ))
pis bounded above by a geometric sequence if and only if (var
p(φ))
pis also bounded above by a geometric sequence. This corresponds to the case of Lipschitz or H¨ older potentials (with respect to d
θ).
Now define Ruelle’s Perron-Frobenius operator P
φ: C (Ω) → C (Ω) as P
φf (x) = X
T y=x
f (y) e
φ(y).
The next step is to define a function space preserved by P
φand on which it has good spectral properties. We take the space of Lipschitz functions with respect to a new distance d
φbuilt out of φ as follows.
Definition 2.2 (The distance d
φ). For x, y ∈ Ω let
d
φ(x, y) = W
p(φ) if d
θ(x, y) = θ
pand d
φ(x, x) = 0.
Now define
L
φ= {f ∈ C (Ω) : ∃M > 0 such that var
n(f ) ≤ M W
n(φ), n = 1, 2, . . .}
and
Lip
φ(f ) = sup
|f (x) − f (y)|
d
φ(x, y) : x 6= y
= sup
var
n(f)
W
n(φ) : n ∈ N
.
One can then define a norm on L
φ, making it a Banach space, by setting kf k
Lφ= kf k
∞+ Lip
φ(f ).
Remark 2.1. The usual Banach space of Lipschitz functions is defined as follows. Let
L
θ= {f ∈ C (Ω) : ∃M > 0 such that var
n(f ) ≤ M θ
n, n = 1, 2, . . .}
and
Lip
θ(f) = sup
|f(x) − f (y)|
d
θ(x, y) : x 6= y
= sup
var
n(f )
θ
n: n ∈ N
. The canonical norm making L
θa Banach space is kfk
Lθ= kf k
∞+ Lip
θ(f ).
In view of (2.3), if we have W
n(φ) = O(θ
n), then L
θ= L
φ. If we now have, for instance, W
n(φ) = O(n
−q) for some q > 0, then we get a bigger space which contains in particular all functions f such that var
n(f ) = O(n
−r) with r ≥ q.
The following result is instrumental to this article. In brief, it tells us that a potential φ satisfying Walters’ condition has a unique equilibrium state, which will be denoted by µ
φ, and gives a speed of convergence for the properly normalized iterates of the associated Ruelle’s Perron-Frobenius operator. The first part of the theorem is due to Walters, while the second one is due to Maume-Deschamps and can be found in her PhD thesis [18, Chapter I.2].
Unfortunately, her result was not published even though it is much sharper than the result in [16].
Theorem 2.1 ([22], [18]). Let φ : Ω → R satisfying Walters’ condition as above. Then the following holds.
A. There exists a unique triplet (h
φ, λ
φ, ν
φ) such that h
φ∈ L
φand is strictly positive, k log h
φk
∞< ∞, λ
φ> 0, ν
φa fully supported probability mea- sure such that R
h
φdν
φ= 1. Moreover, P
φh
φ= λ
φh
φand P
φ∗ν
φ= λ
φν
φ, and φ has a unique equilibrium state µ
φ= h
φν
φwhich is mixing.
33
This means that for any pair of cylinders B, B
0lim
n→∞µ
φ(B ∩ T
−nB
0) = µ
φ(B )µ
φ(B
0).
In particular µ
φis ergodic.
B. There exists a positive sequence (
n)
n∈Nconverging to zero, such that, for any f ∈ L
φ,
P
φnf λ
nφ− h
φZ f dν
φ∞
≤ C
(2.1)nkfk
Lφ, ∀n ∈ N . (2.4) Morover, one has the following behaviors:
1. If W
n(φ) = O(η
n) for some η ∈ (0, 1), then there exists η
0∈ (0, 1) such that
n= O(η
0n).
2. If W
n(φ) = O(n
−α) for some α > 0, then
n= O(n
−α).
3. If W
n(φ) = O(θ
(logn)α) for some θ ∈ (0, 1) and α > 1, then, for any > 0,
n= O(θ
(logn)α−).
4. If W
n(φ) = O(e
−cnα) for some c > 0 and α ∈ (0, 1), then there exists c
0> 0 such that
n= O e
−c0nα α+1
.
The fact that µ
φis an equilibrium state means that it maximizes the func- tional µ 7→ h(ν) + R
φ dν over the set of shift-invariant probability measures on Ω, where h(ν) is the entropy of ν, and the maximum is equal to the topological pressure P (φ) of φ (see e.g. [15]), and we have P (φ) = log λ
φ.
Let us give examples of potentials. First consider A = {−1, 1} and p > 1, and define
φ(x) = − X
n≥2
x
0x
n−1n
p.
One can check that W
n(φ) = O(n
−p+2). This is the analog of the so-called long-range Ising model on N . Let us now take A = {0, 1} and let [0
k1] = {x ∈ Ω : x
i= 0, 0 ≤ i ≤ k − 1, and x
k= 1}. Let (v
n) be a monotone decreasing sequence of real numbers converging to 0 and define
φ(x) =
v
kif x ∈ [0
k1]
0 if x = (0, 0, . . .) 0 otherwise .
One can check that var
n(φ) = v
n. This example is taken from [19].
Remark 2.2. Let us briefly explain how we can interpret an equilibrium state
for a non Lipschitz potential as an absolutely continuous invariant measure of
a piecewise expanding map of the unit interval with a Markov partition. It is
well-known that a uniformly expanding map S of the unit interval with a finite
Markov partition which is piecewise C
1+η, for some η > 0, can be coded by
a subshift of finite type (Ω, T ) over a finite alphabet. Then, − log |S
0| induces
a potential φ on Ω which is Lipschitz (with respect to d
θ). The pullback of
µ
φis then the unique absolutely continuous invariant probability measure for
S. In [10], the authors showed that, given φ which is not Lipschitz, one can construct a uniformly expanding map of the unit interval with a finite Markov partition which is piecewise C
1, but not piecewise C
1+ηfor any η > 0, and such that the pullback of µ
φis the Lebesgue measure.
3 Main result and applications
3.1 Gaussian concentration bound
We can now state our main theorem whose proof is deferred to Section 4. We start by the definition of separately d
θ-Lipschitz functions.
Definition 3.1. A function K : Ω
n→ R is said to be separately d
θ-Lipschitz if, for all i, there exists a constant Lip
θ,i(K) with
|K(x
0, . . . , x
i, . . . , x
n−1) − K(x
0, . . . , x
0i, . . . , x
n−1)| ≤ Lip
θ,i(K) d
θ(x
i, x
0i).
for all points x
0, . . . , x
i, . . . , x
n−1, x
0iin Ω.
Theorem 3.1. Suppose that φ satisfies one of the following conditions:
1. W
n(φ) = O(θ
n) (that is, φ is d
θ-Lipschitz);
2. W
n(φ) = O(n
−α) for some α > 1;
3. W
n(φ) = O(θ
(logn)α) for some θ ∈ (0, 1) and α > 1;
4. W
n(φ) = O(e
−cnα) for some c > 0 and α ∈ (0, 1).
Then the process (x, T x, . . .), with x distributed according to µ
φ, satisfies the following Gaussian concentration bound. There exists C
(3.1)> 0 such that for any n ∈ N and for any separately d
θ-Lipschitz function K : Ω
n→ R , we have
Z
exp K x, T x, . . . , T
n−1x
dµ
φ(x)
≤ exp Z
K x, T x, . . . , T
n−1x
dµ
φ(x)
exp
C
(3.1)n−1
X
j=0
Lip
θ,j(K)
2
. (3.1) Three remarks are in order. First, we conjecture that this theorem is valid under the condition P
n
var
n(φ) < ∞. Second, it would be useful to have an
explicit formula for C
(3.1)in (3.1). Unfortunately, this constant is proportional
to C
(2.1)(see Theorem 2.1) which is cumbersome since it involves the eigendata
of P
φ. Third, for the sake of simplicity, we considered the full shift A
N. In
fact, our results remain true if Ω ⊂ A
Nis a topologically mixing one-sided
subshift of finite type. Moreover, one can extend Theorem 3.1 to bilateral
subshifts of finite type by a trick used in [7].
We now give some corollaries of our main theorem that we will be used in the section on applications. First, by (2.3) we immediately obtain the following corollary.
Corollary 3.2. If there exists α > 2 such that var
n(φ) = O
1 n
αthen we have the Gaussian concentration bound (3.1).
Next, we get the following concentration inequalities from (3.1).
Corollary 3.3. For all u > 0, we have µ
φx ∈ Ω : K x, T x, . . . , T
n−1x
− Z
K y, T y, . . . , T
n−1y
dµ
φ(y) ≥ u
≤ exp − u
24C
(3.1)P
n−1i=0
Lip
θ,i(K)
2!
(3.2) and
µ
φx ∈ Ω :
K x, T x, . . . , T
n−1x
− Z
K y, T y, . . . , T
n−1y
dµ
φ(y)
≥ u
≤ 2 exp − u
24C
(3.1)P
n−1i=0
Lip
θ,i(K)
2!
. (3.3)
Proof. Inequality (3.2) follows by a well-known trick referred to as Chernoff’s bounding method [2]. Let us give the proof for completeness. Let u > 0.
For any random variable Y , Markov’s inequality tells us that P (Y ≥ u) ≤ e
−ξuE e
ξYfor all ξ > 0. Now let Y = K x, T x, . . . , T
n−1x
− Z
K y, T y, . . . , T
n−1y
dµ
φ(y) .
Using (3.1) and optimizing over ξ, we get (3.2). Inequality (3.3) follows by applying (3.2) to −K and then summing up the two bounds.
The last corollary we want to state is about the variance of any separately d
θ-Lipschitz function.
Corollary 3.4. We have Z
K x, T x, . . . , T
n−1x
− Z
K y, T y, . . . , T
n−1y
dµ
φ(y)
2dµ
φ(x)
≤ 2C
(3.1)n−1
X
i=0
Lip
θ,i(K)
2. (3.4)
Proof. To alleviate notations, we simply write K instead of K x, T x, . . . , T
n−1x , R K instead of R
K y, T y, . . . , T
n−1y
dµ
φ(y), and so on and so forth. Apply- ing (3.1) to ξK where ξ is any real number different from 0, we get
Z exp
ξ
K − Z
K
≤ exp
C
(3.1)ξ
2n−1
X
j=0
Lip
θ,j(K)
2
.
Now by Taylor expansion we get 1 + ξ
22 Z
K − Z
K
2+ o(ξ
2) ≤ 1 + C
(3.1)ξ
2n−1
X
j=0
Lip
θ,j(K)
2+ o(ξ
2) . Dividing by ξ
2on both sides and then taking the limit ξ → 0, we obtain the desired inequality.
Although we were not able to prove the Gaussian concentration bound for separately d
φ-Lipschitz functions, for many applications separately d
θ- Lipschitz functions are more natural. Furthermore there is a notable class of separately d
φ-Lipschitz functions, namely Birkhoff sums of the potential itself, for which our theorem holds. Indeed, when φ ∈ L
φ, the function K(x, . . . , T
n−1x) = S
nφ(x) is obviously separately d
φ-Lipschitz and Lip
φ,j(K) = Lip
φ(φ) for all j. We have the following result.
Theorem 3.5. Under the hypotheses of Theorem 3.1, there exists C
(3.5)> 0 such that, for any ψ ∈ L
φ, for all u > 0, and for all n ∈ N , we have
µ
φx ∈ Ω : 1
n S
nψ(x) − Z
ψ dµ
φ≥ u
≤ exp
− nu
24C
(3.5)Lip
φ(ψ)
2. (3.5)
The proof is left to the reader. The main (simple) modification lies in the proof of Lemma 4.3 in which considering a Birkhoff sum of a d
φ-Lipschitz function works fine, whereas we are stuck for a general separately d
φ-Lipschitz function.
We will apply this result with ψ = −φ to derive concentration bounds for hitting times. Note that under the assumptions of this theorem, {ψ(T
nx)}
n≥0satisfies the central limit theorem [18, Chapter 2].
3.2 Related works
The novelty here is to prove a Gaussian concentration bound for potentials
with a variation decaying subexponentially. For φ is Lipschitz, Theorem 3.1
was proved in [7]. The main goal of [7] was then to deal with nonuniformly
hyperbolic systems modeled by a Young tower. For a tower with a return- time to the base with exponential tails, the authors of [7] proved a Gaussian concentration bound. For polynomial tails, they proved moment concentration bounds. For C
1+ηmaps of the unit interval with an indifferent fixed point, which are thus nonuniformly expanding, we are in the latter situation. In view of Remark 2.2 above, we deal here with maps whose derivative is not H¨ older continuous, but which are still uniformly expanding.
Let us also mention the paper [14] in which the authors prove a Gaus- sian concentration bound for φ of summable variation (whereas we need a bit more than summable). Their proof is based on coupling. However, they consider functions K on A
n, not on A
Nn= Ω
nas in this paper. For such functions, the analogue of Lip
θ,i(K) is δ
i(K) = sup{|K(a
0, . . . , a
i, . . . , a
n−1) − K(b
0, . . . , b
i, . . . , b
n−1)| : a
j= b
j, ∀j 6= i}. It is clear that a Gaussian concen- tration bound for functions K : A
Nn→ R implies a Gaussian concentration bound for functions K : A
n→ R , but the converse is not true.
3.3 Applications
We now give several applications of the Gaussian concentration bound (3.1) and its corollaries. Throughout this section, µ
φis the equilibrium state for a potential φ satisfying one of the conditions 1-4 in Theorem 3.1.
3.3.1 Birkhoff sums
Let f : Ω → R be a d
θ-Lipschitz function and define K(x
0, . . . , x
n−1) = f (x
0) + · · · + f (x
n−1)
whence K(x, T x, . . . , T
n−1x) = f (x) + f (T x) + · · · + f (T
n−1x) := S
nf(x) is the Birkhoff sum of f . Clearly, Lip
θ,i(K) = Lip
θ(f ) for all i = 0, . . . , n − 1.
Applying Corollary 3.3 we immediately get µ
φx :
S
nf(x)
n −
Z f dµ
φ≥ u
≤ 2 exp −c
(3.6)nu
2(3.6) for all n ≥ 1 and u ∈ R
+, where
c
(3.6)= 1
4C
(3.1)Lip
θ(f )
2.
This bound can be compared with the large deviation asymptotics (1.2). We see that it has the right behavior in n. Replacing u by u/ √
n in (3.6) we get µ
φx :
S
nf(x) − n Z
f dµ
φ≥ u √ n
≤ 2 exp −c
(3.6)u
2for all n and u > 0. This can be compared with the central limit theorem (1.1). We can see that the previous bound is consistent with that theorem.
Note that the central limit is about convergence in law, whereas here we obtain
a (non-asymptotic) bound from which one cannot deduce a convergence in law.
3.3.2 Empirical frequency of blocks Take f (x) = 1
[a0,k−1](x) where
[a
0,k−1] = {x ∈ Ω : x
i= a
i, i = 0, . . . , k − 1}
is a given k-cylinder. Let
f
n(x, a
0,k−1) = P
n−kk=0
1
[a0,k−1](T
kx) n − k + 1 .
This is the ‘empirical frequency’ of the block a
0,k−1∈ A
kin the orbit of x up to time n − k. By Birkhoff’s ergodic theorem, we know that, for each a
0,k−1, f
n(x, a
0,k−1) goes to µ
φ([a
0,k−1]) for µ
φ-almost all x. The next theorem quantifies this asymptotic statement. Notice that we can control the fluctuations of f
n(x, a
0,k−1) around µ
φ([a
0,k−1]) uniformly in a
0,k−1.
Theorem 3.6. For all n ∈ N , for all 1 ≤ k ≤ n and for all u > 0 we have µ
φx : max
a0,k−1
f
n(x, a
0,k−1) − µ
φ([a
0,k−1])
≥ (u + c √ k ) θ
−k√ n − k + 1
!
≤ e
−u2 4C(3.1)
where c = 2 q
2C
(3.1)log |A|. Moreover, if k = k(n) = ζ log n for some ζ > 0, then
µ
φx : max
a0,k(n)−1
f
n(x, a
0,k(n)−1) − µ
φ([a
0,k(n)−1])
≥ (u + c
0√
log n )n
ζ|logθ|p n − k(n) + 1
!
≤ e
−u2 4C(3.1)
where c
0= 2 q
2ζC
(3.1)log |A|.
Proof. Define the function K : Ω
n−k+1→ R by K(x
0, . . . , x
n−k) = max
a0,k−1
Z(a
0,k−1; x
0, . . . , x
n−k) where
Z (a
0,k−1; x
0, . . . , x
n−k) =
P
n−kj=0
1
[a0,k−1](x
j)
n − k + 1 − µ
φ([a
0,k−1]) .
It is left to the reader to check that Lip
θ,j(K) =
Lipn−k+1θ(f)=
θk(n−k+1)1, so we get immediately from 3.6
µ
φx ∈ Ω : K(x, T x, . . . , T
n−kx) ≥ u + Z
K
y, T y, . . . , T
n−ky
dµ
φ(y)
≤ exp
− θ
2k4C
(3.1)(n − k + 1)u
2for all n ≥ 1 and u > 0. To complete the proof, we need a good upper bound for R
K y, T y, . . . , T
n−k−1y
dµ
φ(y). Actually, this can be done by using again the Gaussian concentration bound. Using (3.1) and Jensen’s inequality we get for any ξ > 0
exp
ξ Z
K
x, T x, . . . , T
n−kx
dµ
φ(x)
≤ Z
exp
ξ max
a0,k−1
Z(a
0,k−1; x, T x, . . . , T
n−kx)
dµ
φ(x)
≤ X
a0,k−1∈Ak
Z exp
ξZ(a
0,k−1; x, T x, . . . , T
n−kx)
dµ
φ(x)
≤ 2|A|
kexp C
(3.1)θ
−2kξ
2n − k + 1
! .
The third inequality is obtained by using the trivial inequality e
maxpi=1ai≤
p
X
i=1
e
ai.
Taking logarithms on both sides and then dividing by ξ, we have Z
K
x, T x, . . . , T
n−kx
dµ
φ(x) ≤ log 2 + k log |A|
ξ + C
(3.1)θ
−2kξ n − k + 1 . There is a unique ξ > 0 minimizing the right-hand side, hence
Z K
x, T x, . . . , T
n−kx
dµ
φ(x) ≤ 2θ
−ks
C
(3.1)(k + 1) log |A|
n − k + 1 where we used that log 2 ≤ log |A|. Hence we get the desired estimate.
Note that log |A| is the topological entropy of the full shift with alphabet A.
3.3.3 Hitting times and entropy For x, y ∈ Ω, let
T
x0,n−1(y) = inf{j ≥ 1 : y
j,j+n−1= x
0,n−1} .
This is the first time that the n first symbols of x appear in y. We assume that φ satisfies
var
n(φ) = O 1
n
αfor some α > 2 . (3.7)
One can prove (see [9]) that
n→∞
lim 1
n log T
x0,n−1(y) = h(µ
φ) , for µ
φ⊗ µ
φ-almost every (x, y) . Roughly, this means that, if we pick x and y independently, each one according to µ
φ, then the time it takes to see the first n symbols of x appearing in y for the first time is ≈ e
nh(µφ).
Theorem 3.7. If φ satisfies (3.7), then there exist strictly positive constants c
1, c
2and u
0such that, for all n and for all u > u
0,
(µ
φ⊗ µ
φ) n
(x, y) : 1
n log T
x0,n−1(y) ≥ h(µ
φ) + u o
≤ c
1e
−c2nu2and
(µ
φ⊗ µ
φ) n
(x, y) : 1
n log T
x0,n−1(y) ≤ h(µ
φ) − u o
≤ c
1e
−c2nu.
These bounds were obtained in [8] when φ is Lipschitz. Observe that the probability of being above h(µ
φ) is bounded above by c
1e
−c2nu2, whereas the probability of being below h(µ
φ) is bounded above by c
1e
−c2nu. The proof of this theorem being very similar to that given in [8], we omit the details and only sketch it. We cannot directly deal with T
x0,n−1(y) but we have log T
x0,n−1(y) = log T
x0,n−1(y)µ
φ([x
0,n−1])
− log µ
φ([x
0,n−1]). Then we use Theorem 3.5 for ψ = −φ, assuming (without loss of generality) that P(φ) = 0, that is, h(µ
φ) = − R
φ dµ
φ, because we can control uniformly in x the approx- imation − log µ
φ([x
0,n−1]) ≈ S
n(−φ)(x). To control the other term, we use that the law of T
x0,n−1(y)µ
φ([x
0,n−1]) is well approximated by an exponential law.
Another estimator of h(µ
φ) is the so-called plug-in estimator. We could also obtain concentration bounds for it in the spirit of [8].
3.3.4 Speed of convergence of the empirical measure
Instead of looking at the frequency of a block a
k1we can consider a global object, namely the empirical measure
E
n(x) = 1 n
n−1
X
j=0
δ
Tjx.
For µ
φ-almost every x, we know that 1 n
n−1
X
j=0
δ
Tjx−−−→
n→∞µ
φwhere the convergence is in the weak topology on the space of probability measures M (Ω) on Ω. This is a consequence of Birkhoff’s ergodic theorem.
The natural question is: how fast does this convergence takes place? We can answer this question by using the Kantorovich distance d
Kwhich metrizes weak topology on M (Ω):
d
K(ν
1, ν
2) = sup Z
f dν
1− Z
f dν
2: f : Ω → R such that Lip
θ(f ) = 1
.
We have the following result.
Theorem 3.8. For all u > 0 and all n ≥ 1 we have µ
φx :
d
K(E
n(x), µ
φ) − Z
d
K(E
n(y), µ
φ) dµ
φ(y) ≥ u
≤ 2 e
−c(3.8)nu2(3.8) where c
(3.8)= (4C
(3.1))
−1.
Proof. Let
K(x
0, . . . , x
n−1) = sup
1 n
n−1
X
j=0
f (x
j) − Z
fdµ
φ: f : Ω → R with Lip
θ(f ) = 1
.
Of course, K(x, T x, . . . , T
n−1x) = d
K( E
n(x), µ
φ). It is left to the reader to check that
Lip
θ,i(K ) ≤ 1
n , i = 0, . . . , n − 1 . The result follows at once by applying inequality (3.3).
It is natural to ask for a good upper bound for R
d
K( E
n(y), µ
φ)dµ
φ(y) because this would give a control on the fluctuations of d
K(E
n(x), µ
φ) around 0. Getting such a bound turns out to be difficult. In [5, Section 8] it is proved
that Z
d
K( E
n(y), µ
φ) dµ
φ(y) 1 n
1 2(1+log|A|)
.
For two positive sequences (a
n), (b
n), a
nb
nmeans that lim sup
nloglogabnn
≤ 1.
One could in principle get a non-asymptotic but messy bound.
3.3.5 Relative entropy, d-distance and speed of Markov approxi- ¯ mation
Given n ∈ N and x
0,n−1, y
0,n−1∈ A
nthe (non normalized) Hamming distance between x and y is
d ¯
n(x
0,n−1, y
0,n−1) =
n−1
X
i=0
1
{xi6=yi}. (3.9)
Now, given two shift-invariant probability measures µ, ν on Ω, denote by µ
nand ν
ntheir projections on A
n, and define their ¯ d
n-distance by
d ¯
n(µ
n, ν
n) = inf X
x0,n−1∈An
X
y0,n−1∈An
d ¯
n(x, y) P
n(x
0,n−1, y
0,n−1)
where the infimum is taken over all the joint shift-invariant probability distri- butions P
non A
n× A
nsuch that P
y0,n−1∈An
P
n(x
0,n−1, y
0,n−1) = µ
n(x
0,n−1) and P
x0,n−1∈An
P
n(x
0,n−1, y
0,n−1) = ν
n(y
0,n−1). By [20, Theorem I.9.6, p.
92], the limit following exists:
d(µ, ν) := lim ¯
n→∞
1 n
d ¯
n(µ
n, ν
n) (3.10) and defines a distance on the set of shift-invariant probability measures. It induces a finer topology than the weak topology and, in particular, the ¯ d-limit of ergodic measures is ergodic, and the entropy is ¯ d-continuous on the class of ergodic measures.
4Next, given n ∈ N and a shift-invariant probability measure ν on Ω, define the n-block relative entropy of ν with respect to µ
φby
H
n(ν|µ
φ) = X
x0,n−1∈An
ν
n(x
0,n−1) log ν
n(x
0,n−1) µ
φ,n(x
0,n−1) .
One can easily prove that the following limit exists and defines the relative entropy of ν with respect to µ
φ:
n→∞
lim 1
n H
n(ν
n|µ
φ,n) =: h(ν|µ
φ) = P (φ) − Z
φ dν − h(ν) (3.11) where P (φ) is the topological pressure of φ:
P(φ) = lim
n→∞
1
n log X
a0,n−1∈An
e
sup Snφ(x):x∈[a0,n−1].
This limit exists for any continuous φ. (To prove (3.11), we use that there exists a positive sequence (ε
n)
ngoing to 0 such that, for any a
0,n−1∈ A
nand any x ∈ [a
0,n−1], µ
φ([a
0,n−1])/ exp(−nP (φ) + S
nφ(x)) is bounded below by exp(−nε
n) and above by exp(−nε
n).) By the variational principle, h(ν|µ
φ) ≥ 0 with equality if and only if ν = µ
φ(recall that µ
φis the unique equilibrium state of φ). We refer to [21] for details. We can now formulate the first theorem of this section.
Theorem 3.9. For every shift-invariant probability measure ν on Ω and for all n ∈ N , we have
d ¯
n(ν
n, µ
φ,n) ≤ c
(3.9)q
nH
n(ν
n|µ
φ,n) (3.12)
4
These two properties are false in the weak topology.
where c
(3.9)= p
2C
(3.1). In particular d(ν, µ ¯
φ) ≤ c
(3.9)q
h(ν|µ
φ) . (3.13)
Proof. For a function f : A
n→ R , define for each i = 0, . . . , n − 1 δ
i(f ) = sup{|f (a
0,n−1) − f (b
0,n−1)| : a
j= b
j, ∀j 6= i} . We obviously have that for all a
0,n−1, b
0,n−1∈ A
n|f(a
0,n−1) − f (b
0,n−1)| ≤
n−1
X
j=0
1
{aj6=bj}δ
j(f ) .
A function f : A
n→ R such that δ
j(f ) = 1, i = 0, . . . , n − 1 is 1-Lipschitz for the Hamming distance (3.9). We now consider the set of functions
H(n, φ) =
f : A
n→ R : f 1-Lipschitz for ¯ d
n, Z
An
f dµ
φ,n= 0
.
We can identify a function f ∈ H(n, φ) with a function ˜ f : Ω
n→ R in a natural way: ˜ f (x
0, . . . , x
n−1) = f (π(x
0), . . . , π(x
n−1)) where π : Ω → A is defined by π(x) = x
0. We obviously have R f ˜ dµ
φ= 0 and it is easy to check that Lip
j( ˜ f ) = δ
j(f ) = 1, j = 0, . . . , n − 1. Therefore we can apply the Gaussian concentration bound (3.1) to get
Z
An
e
ξfdµ
φ,n≤ e
C(3.1)nξ2, for all f ∈ H(n, φ) and for all ξ ∈ R . (3.14) We now apply an abstract result [1, Theorem 3.1] which says that (3.14) is equivalent to
d(ν ¯
n, µ
φ,n) ≤ q
2C
(3.1)nH
n(ν
n|µ
φ,n) for all probability measures ν
non A
n. Hence (3.12) is proved. To get (3.13), divide by n on both sides and take the limit n → ∞ and use (3.10) and (3.11).
We now give an application of inequality (3.13). Let
φ
1(x) = log µ
φ(x
0) and φ
n(x) = log µ
φ(x
n−1|x
0,n−2), n ≥ 2.
The equilibrium state for φ
nis a (n − 1)-step Markov measure. One can
prove that in the weak topology (µ
φn)
nconverges to µ
φ, but one cannot get
any speed of convergence. We get the following upper bound on the speed of
convergence of (µ
φn)
nto µ
φin the finer ¯ d topology.
Corollary 3.10. Assume, without loss of generality, that φ is normalized in the sense that
X
a∈A
e
φ(ax)= 1, ∀x ∈ Ω.
Then there exists n
φ≥ 1 such that, for all n ≥ n
φ, we have
d(µ ¯
φn, µ
φ) ≤ ρ
φvar
n(φ) (3.15) where
ρ
φ= q
2|A| C
(3.1)(e −1) e
32kφk∞.
More details on how to normalize a potential are given in Subsection 4.1.
Proof. Using (3.11) and the variational principle we get h(µ
φn|µ
φ) = −
Z
φ dµ
φn− h(µ
φn) = Z
(φ
n− φ) dµ
φn. (3.16) Indeed, since φ and φ
nare normalized, we have in particular that P (φ) = P(φ
n) = 0, and by the variational principle h(µ
φn) = − R
φ
ndµ
φn. Now Z
(φ
n− φ) dµ
φn= Z
log e
φne
φdµ
φn= Z
log
1 + e
φn− e
φe
φdµ
φn≤
Z e
φn− e
φe
φdµ
φn(3.17)
where we used the inequality log(1 + u) ≤ u for all u > −1. Now using the shift-invariance of µ
φnand replacing e
φnby e
φn− e
φ+ e
φwe get
Z e
φn− e
φe
φdµ
φn= Z
dµ
φn(x) X
a∈A
e
φn(ax)e
φn(ax)−eφ(ax)e
φ(ax)= Z
dµ
φn(x) X
a∈A
e
φn(ax)− e
φ(ax)2e
φ(ax)+
Z
dµ
φn(x) X
a∈A
e
φn(ax)− e
φ(ax)≤ |A| e
kφk∞(k e
φn− e
φk
∞)
2(3.18) where we used that P
a∈A
(e
φn(ax)− e
φ(ax)) = 0. Combining (3.13), (3.16), (3.17) and (3.18) we thus obtain
d(µ ¯
φn, µ
φ) ≤ q
2|A| C
(3.1)e
kφk∞k e
φn− e
φk
∞. It remains to estimate k e
φn− e
φk
∞in terms of var
n(φ). We have k e
φn− e
φk
∞=
e
φn− e
φ ∞≤ e
kφk∞e
φ−φn−1
∞≤ (e −1) e
kφk∞kφ−φ
nk
∞provided that kφ−φ
nk
∞< 1, where we used the inequality | e
u−1| ≤ (e −1)|u|
valid for |u| < 1. Finally, since kφ − φ
nk
∞≤ var
n(φ), we define n
φto be the smallest integer sucht var
n(φ) < 1 and we can take
ρ
φ= q
2|A| C
(3.1)(e −1) e
32kφk∞. We thus proved (3.15).
Let us mention the paper [13] in which the authors obtain the same bound for the speed of convergence of Markov approximation, up to the constant.
Their approach is a direct estimation of ¯ d(µ
φn, µ
φ) by using a coupling method.
The point here is to obtain the same speed of convergence as an easy corollary of inequality (3.13). Let us remark that from (4.8) we get a worse result since we end up with a bound proportional to p
var
n(φ). The trick which leads to the correct bound was told us by Daniel Takahashi.
3.3.6 Shadowing of orbits
Let A be a Borel subset of Ω such that µ
φ(A) > 0 and define for all n ∈ N S
A(x, n) = 1
n inf
y∈A n−1
X
j=0
d
θ(T
jx, T
jy)
A basic example of set A is a cylinder set [a
0,k−1]. The quantity S
A(x, n), which lies between 0 and 1, measures how we can trace, in the best possible way, the orbit of some initial condition not in A by an orbit starting in A.
Theorem 3.11. For any Borel subset A ⊂ Ω such that µ
φ(A) > 0, for any n ∈ N and for any u > 0
µ
φx ∈ Ω : S
A(x, n) ≥ u
A+ u
√ n
≤ e
−u2 4C(3.1)
where
u
A= 2 q
− C
(3.1)ln µ
φ(A) . We give a shorter and simpler proof than in [11].
Proof. Let K(x
0, . . . , x
n−1) =
n1inf
y∈AP
n−1j=0
d
θ(x
j, T
jy). One can easily check that
Lip
θ,i(K) = 1
n , ∀i = 0, . . . , n − 1.
It follows from (3.2) that µ
φS
A(x, n) ≥ Z
S
A(y, n) dµ
φ(y) + u
√ n
≤ e
−u2 4C(3.1)
for all n ≥ 1 and for all u > 0. We now need an upper bound for R
S
A(y, n)dµ
φ(y).
We simply observe that by (3.1) and the definition of S
A(·, n) µ
φ(A) =
Z
e
−ξSA(x,n)1
A(x) dµ
φ(x) ≤ Z
e
−ξSA(x,n)dµ
φ(x)
≤ e
−ξRSA(y,n) dµφ(y)
e
C(3.1)ξ2 n
for all ξ > 0. Hence Z
S
A(y, n) dµ
φ(y) ≤ C
(3.1)ξ
n + 1
ξ ln(µ
φ(A)
−1) Optimizing this bound over ξ > 0 gives
Z
S
A(y, n) dµ
φ(y) ≤ 2 s
C
(3.1)ln(µ
φ(A)
−1)
n .
The theorem follows at once.
3.3.7 Almost-sure central limit theorem
It was proved in [18, Chapter 2] that (Ω, T, µ
φ) satisfies the central limit theorem for the class of d
θ-Lipschitz functions f : Ω → R such R
f dµ
φ= 0, that is, for any such f the process {f ◦ T
n}
n≥0satisfies
µ
φx : S
kf (x)
√
k ≤ u
= Z
1
Skf(x)√
k ≤t
dµ
φ(x) −−−→
k→∞G
0,σ2(f)((−∞, t]) (3.19) where
σ
2(f ) = Z
f
2dµ
φ+ 2 X
i≥1
Z
f · f ◦ T
idµ
φ∈ [0, +∞) .
If σ
2(f ) > 0, G
0,σ2denotes the law of a Gaussian random variable with mean 0 and variance σ
2(f), that is,
dG
0,σ2(f)(u) = 1 σ √
2π e
−u2
2σ2(f)
du, u ∈ R . When σ
2(f ) = 0 we set G
0,0= δ
0, the Dirac mass at zero.
Remark 3.1. In fact, a more general statement was proved in [18, Chapter I.2]. Namely, (3.19) holds when φ is such that P
k
k< +∞ and f ∈ L
φ. Now, for each N ≥ 1 and x ∈ Ω, define the probability measure
A
N(x) = 1 L
NN
X
n=1
1
n δ
Snf(x)√ n(3.20)
where L
N= P
N n=1 1n
and where, as usual, δ
uis the Dirac mass at point u ∈ R . Of course, L
N= log N + O(1). Notice that A
Nis a random probability measure. Finally, the Wasserstein distance between two probability measures ν, ν
0on the Borel sigma-algbra B ( R ) is
W
1(ν, ν
0) = inf
π∈Π(ν,ν0)
Z
d
θ(x, x
0) dπ(x, x
0)
where the infimum is taken over all probability measures such that Z
π(B, x
0) dx
0= ν(B) and Z
π(x, B) dx = ν
0(B)
for any Borel subset of R . By the Kantorovich-Rubinstein duality theorem, W
1(ν, ν
0) is equal to the Kantorovich distance which is the supremum of R ` dν − R
` dν
0over the set of 1-Lipschitz functions ` : R → R . We refer to [12] for background and proofs.
Now we can formulate the almost-sure central limit theorem.
Theorem 3.12. Let f : Ω → R be a d
θ-Lipschitz function. Then, for µ
φalmost every x ∈ Ω, we have
W
1(A
N(x), G
0,σ2(f)) −−−−−→
N→+∞
0 .
We make several comments. Recall that the Wasserstein distance metrizes the weak topology on the set of probability measures ν on B ( R ). Moreover, if (ν
n)
n≥1is a sequence of probability measures on B ( R ) and ν a probability measure on B ( R ), then
n→∞
lim W
1(ν
n, ν) = 0 ⇐⇒ ν
n−
law− → ν and Z
|u| dν
n(u) −
n→+∞−−−− → Z
|u| dν(u) where “ −
law− →” means weak convergence of probability measures on B ( R ).
To compare with (3.19), observe that Theorem 3.12 implies that for µ
φ-almost every x, A
N(x) −
law− → G
0,σ2(f), which in turn implies that
Z
1
{u≤t}dA
N(u) = 1 L
NN
X
n=1
1
n 1
{Snf /√n≤t}
−−−−−→
N→+∞
G
0,σ2(f)((−∞, t]).
Therefore, the expectation with respect to µ
φin (3.19) is replaced by a path- wise logarithmic average in the almost-sure central limit theorem.
Proof. The proof follows from an abstract theorem proved in [6]. In words,
that theorem says the following. Let (X
n)
n≥0be a stochastic stationary pro-
cess where the X
n’s are random variables taking values in Ω. Assume that
if f : Ω → R is d-Lipschitz and such that E [f (X
0)] = 0, then it satisfies the central limit theorem, that is, for all u ∈ R ,
P
P
n−1 j=0f(X
j)
√ n ≤ u
!
−−−→
n→∞G
0,σ2(f)((−∞, u])
where σ
2(f) := E [f
2(X
0)] + 2 P
`≥1
E [f (X
0)f (X
`)] is assumed to be 6= 0.
Moreover, assume that the process (X
n)
n≥0satisfies the following variance inequality: There exists C > 0 such that for all separately d-Lipschitz functions K : Ω
n→ R for some distance d on Ω,
E
(K(X
0, . . . , X
n−1) − E [K(X
0, . . . , X
n−1)])
2≤ C
n−1
X
i=0
Lip
i(K)
2. Then, the conclusion is that, almost surely,
1 L
NN
X
n=1