• Aucun résultat trouvé

Markov approximation of chains of infinite order in the $\bar{d}$-metric

N/A
N/A
Protected

Academic year: 2021

Partager "Markov approximation of chains of infinite order in the $\bar{d}$-metric"

Copied!
25
0
0

Texte intégral

(1)

HAL Id: hal-00913858

https://hal.archives-ouvertes.fr/hal-00913858

Submitted on 5 Dec 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Markov approximation of chains of infinite order in the

¯

d-metric

Sandro Gallo, Matthieu Lerasle, D.Y. Takahashi

To cite this version:

Sandro Gallo, Matthieu Lerasle, D.Y. Takahashi. Markov approximation of chains of infinite order in the ¯d-metric. Markov Processes and Related Fields, Polymath, 2013, 19 (1), pp.51–82. �hal-00913858�

(2)

THE ¯d-METRIC

S. GALLO, M. LERASLE, AND D. Y. TAKAHASHI

Abstract. We obtain explicit upper bounds for the ¯d-distance between a chain of in-finite order and its canonical k-steps Markov approximation. Our proof is entirely con-structive and involves a “coupling from the past” argument. The new method covers non-necessarily continuous probability kernels, and chains with null transition probabil-ities. These results imply in particular the Bernoulli property for these processes.

1. INTRODUCTION

Chains of infinite order are random processes specified by probability kernels (condi-tional probabilities), which may depend on the whole past. They constitute a wide class of flexible models that are very useful in different areas of applied probability and statistics, from bioinformatics (Bejerano & Yona, 2001; Busch et al., 2009) to linguistics (Galves et al., 2012). They are also models of considerable theoretical interest in ergodic theory (Coelho & Quas, 1998; Hulse, 1991; Quas, 1996; Walters, 2007) and in the general theory of stochastic process (Bramson & Kalikow, 1993; Comets et al., 2002; Fern´andez & Maillard, 2005). A natural approach to study chains of infinite order is to approximate the original process by Markov chains of growing orders. In this article, we obtain new upper-bounds on the ¯d-distance between a chain and its canonical k-steps Markov approximation.

Introduced by Ornstein (1974) to study the isomorphism problem for Bernoulli shifts, the ¯d-metric is of fundamental importance in ergodic theory where chains of infinite order are also known as g-measures. The ¯d-distance between two processes can be informally described as the minimal proportion of sites we have to change in a typical realization of one process in order to obtain a typical realization of the other. Ornstein (1974) showed that the set of processes which are measure theoretic isomorphic to Bernoulli shifts, i.e., have the Bernoulli property, is ¯d-closed. Ergodic Markov chains are examples of processes that are isomorphic to Bernoulli shifts. Therefore, if a process can be approximated arbitrary well under the ¯d-metric by a sequence of ergodic Markov chains, then this process has the Bernoulli property. In this article we prove the existence of Markov approximation schemes for classes of chains of infinite order with non-necessary continuous and with possibly null transition probabilities. Several of these processes were not considered before. For example, Coelho & Quas (1998), Fern´andez & Galves (2002), and Johansson et al. (2010) required the continuity of the probability kernels. Our results show that these new

2000 Mathematics Subject Classification. Primary 60G10; Secondary 60K99.

Key words and phrases. Chains of infinite order, coupling from the past algorithms, canonical Markov approximation, ¯d-distance.

SG was supported by FAPESP grant 2009/09809-1. ML was supported by FAPESP grant 2009/09494-0. DYT was supported by FAPESP grant 2008/08171-0 and Pew Latin American Fellowship. This work is part of USP project “Mathematics, computation, language and the brain”, FAPESP project “NeuroMat” (grant 2011/51350-6), and CNPq project “Stochastic modeling of the brain activity” (grant 480108/2012-9). The three authors thank NUMEC-USP for hospitality during the elaboration of the paper.

(3)

examples are isomorphic to Bernoulli shifts and provide explicit upper bounds for the Markov approximation in several important cases, giving therefore information on how good these approximations are.

Additionally, the ¯d-distance is useful in statistics and information theory. Rissanen (1983) proposed to model data as realizations of stochastic processes, and proved that these data can be optimally compressed using the (unknown) probability kernel of the chain. The statistical problem is then to recover this probability kernel from the observation of typical data. Since the number of parameters to estimate is infinite, this task is impossible in general. A possible strategy to overcome this problem is the following. (1) Couple the original chain with a Markov approximation and (2) work with the approximating Markov chain. The ¯d-distance between the chain and its Markov approximation controls the error made in step (1). The idea is that, if this control is good enough, the good properties of the approximating Markov chain proved in step (2) can be used to study the original chain. For instance, Duarte et al. (2006) and Csisz´ar & Talata (2010) obtained consistency results for chains of infinite order from the consistency of BIC estimators for Markov chains proved in Csisz´ar & Talata (2006). This “two steps” procedure was also used in Collet et al.(2005) to obtain a bootstrap central limit theorem for chains of infinite order from the renewal property of the approximating Markov chains.

Our main results use coupling arguments. We first introduce a flexible class of Coupling from the pastalgorithms (CFTP algorithms, see Section 2.3). CFTP algorithms constitute an important class of perfect simulation algorithms popularized by Propp & Wilson (1996). Our main assumption on the chain is that the original chain of infinite order can be perfectly simulated via such CFTP algorithms. We state a technical result, Lemma 4.1, which provides an abstract upper bound for the ¯d-distance between the chain and its canonical Markov approximation. This bound is then made explicit under various extra assumptions on the process used in the study of the CFTP algorithms (Comets et al., 2002; De Santis & Piccioni, 2012; Gallo, 2011; Gallo & Garcia, 2011).

To our knowledge, Fern´andez & Galves (2002) provide the best explicit bounds in the literature for the ¯d-distance between a chain of infinite order and its canonical Markov approximation, depending only on the continuity rate of the probability kernels. Their result applies to weakly non-null chains having summable continuity rates. Our method recovers the same bounds under weaker assumption, substituting weak non-nullness by a weaker assumption (see Theorem 4.1). Assuming weak non-nullness, we also obtain explicit upper bounds in some non-summable continuity regimes and other not even necessarily continuous, but satisfying certain types of localized continuity, as introduced in De Santis & Piccioni (2012), Gallo (2011) and Gallo & Garcia (2011). This is the content of Theorems 4.2 and 4.3 which provide, as far as we know, the first results for non-continuous chains. Our results should also be compared with the results in Johansson et al. (2010), where they prove the Bernoulli property for square summable continuity regime assuming strong non-nullness, although they don’t provide an explicit upper bound for the approximations. The article is organized as follows. In Section 2, we introduce the notation and basic definitions used all along the paper. In Section 3, we construct the coupling between the original chain and its canonical Markov approximation and we introduce the class of CFTP algorithms perfectly simulating the chains. Our main results are stated in Section 4. We postpone the proofs to Section 5. For convenience of the reader, we leave in Appendix

(4)

some extensions and technical results on the “house of cards” processes that are useful in our applications and are of independent interest.

2. NOTATION, DEFINITIONS AND BACKGROUND

2.1. Notation. We use the conventions that N ={1, 2, . . .} and N = N ∪ {0, ∞}. Let A be the set{1, 2, . . . , N } for some N ∈ N. Given two integers m ≤ n, let an

m be the string

am. . . an of symbols in A. For any m≤ n, the length of the string anm is denoted by |anm|

and is equal to n− m + 1. Let ∅ denote the empty string, with length |∅| = 0. For any n ∈ Z, we will use the convention that an

n+1 = ∅, and naturally |ann+1| = 0. Given two

strings v and v0, we denote by vv0 the string of length|v| + |v0| obtained by concatenating the two strings. If v0=∅, then v∅ = ∅v = v. The concatenation of strings is also extended to the case where v = . . . a−2a−1 is a semi-infinite sequence of symbols. If n∈ N and v is a finite string of symbols in A, vn = v . . . v is the concatenation of n times the string v.

In the case where n = 0, v0 is the empty string∅. Let A−N= A{...,−2,−1} and A? =

+∞

[

j=0

A{−j,...,−1},

be, respectively, the set of all infinite strings of past symbols and the set of all finite strings of past symbols. The case j = 0 corresponds to the empty string∅. Finally, we denote by a = . . . a−2a−1 the elements of A−N.

2.2. Kernels, chains and coupling.

Definition 2.1. A family of transition probabilities, or kernel, on an alphabet A is a function P : A× A−N [0, 1] (a, x) 7→ P (a|x) such that X a∈A P (a|x) = 1 , ∀x ∈ A−N.

P is called a Markov kernel if there exists k such that P (a|x) = P (a|y) when x−1−k = y−k−1. In the present article we are mostly interested in non-Markov kernels, in which P (a|x) may depend on the whole past x. When we consider a Markov kernel of order k, we will make explicit the dependence on the k past values and use the notation P (·|x−1−k) to indicate P (·|x).

Definition 2.2. A stationary stochastic process X={Xn}n∈Z with distribution µ on AZ

is said to be compatible with a family of transition probabilities P if the later is a regular version of the conditional probabilities of the former, that is

µ(X0 = a|X−∞−1 = x) = P (a|x) (1)

for everya∈ A and µ-a.e. x in A−N.

If P is non-Markov, it may be hard to prove the existence of a stationary process X compatible with it. In order to solve this issue, we will assume the existence of coupling from the past algorithms for the chain (see Section 2.3). This “constructive argument” garantees the existence and uniqueness of the stationary process X compatible with P .

(5)

Definition 2.3(Canonical k-steps Markov approximation). Assume that X is a stationary process with distributionµ. Let Pµ[k] be the kernel defined by

Pµ[k](a|x−1−k) = µ(X0= a|X−k−1= x−1−k).

The canonical k-steps Markov approximation of X is the stationary k-step Markov chain X[k] compatible with the kernel Pµ[k](a|x−1−k).

Since, in all cases considered in this article, µ is uniquely determined by P , we will omit in what follows the subscript µ in Pµ[k], and it will be understood that P[k]= Pµ[k].

Let us recall that a coupling between two chains X and Y taking values in the same alphabet A is a stochastic process Z ={Zn}n∈Z ={( ¯Xn, ¯Yn)}n∈Z on (A× A)Z such that

¯

Xhas the same distribution as X and ¯Y has the same distribution as Y. For any pair of stationary processes X and Y, letC(X, Y) be the set of couplings between X and Y. Definition 2.4 ( ¯d-distance). The ¯d-distance between two stationary processes X and Y is defined by

¯

d(X, Y) = inf

( ¯X, ¯Y)∈C(X,Y)P( ¯X0 6= ¯Y0).

For the class of ergodic processes, this distance has another interpretation which is more intuitive: it is the minimal proportion of sites we have to change in a typical realization of X in order to obtain a typical realization of Y. Formally,

¯ d(X, Y) = inf ( ¯X, ¯Y)∈C(X,Y)n→+∞lim 1 n n X i=1 1{ ¯Xi6= ¯Yi}.

2.3. Coupling from the past algorithm (CFTP). Our CFTP algorithm constructs a sample of the stationary process compatible with a given kernel P , using a sequence U = {Un}n∈Z of i.i.d. random variables uniformly distributed in [0, 1[. We denote by

(Ω,F, P) the probability space associated to U. The CFTP is completely determined by its update function F : A−N× [0, 1[→ A which satisfies, for any a ∈ A−N and for any a∈ A, P(F (a, U0) = a) = P (a|a). Using this function, we define the set of coalescence

times Θ and the reconstruction function Φ associated to F . For any pair of integers m, n such that −∞ < m ≤ n < +∞, let F{m,n}(a, Umn) ∈ An−m+1 be the sample obtained by

applying recursivelyF on the fixed past a, i.e, let F{m,m}(a, Um) := F (a, Um) and

F{m,n}(a, Umn) := F{m,n−1}(a, Umn−1)F (aF{m,n−1}(a, Umn−1), Un) .

Secondly, let F[m,m](a, Um) := F (a, Um) and

F[m,n](a, Umn) = F aF{m,n−1}(a, Umn−1), Un . (2)

F[m,n](a, Un

m) is the last symbol of the sample F{m,n}(a, Umn). The set

Θ[n] :={j ≤ n : F[j,n](a, Ujn) = F[j,n](b, Ujn) for all a, b∈ A−N} (3) is called the set of coalescence times for the time index n. Finally, the reconstruction function of time n is defined by

[Φ(U)]n= F[θ[n],n](a, Uθ[n]n ) (4)

where θ[n] is any element of Θ[n]. Given a kernel P , if Θ[0]6= ∅ a.s. and therefore Θ[n] 6= ∅ a.s. for any n ∈ Z, then ([Φ(U)]n)n∈Z is distributed according to the unique stationary

(6)

3. CONSTRUCTION OF THE COUPLING

For any a ∈ A−N, letI(a) := {Ik(a|a−1−k)}k∈N, a∈A be any partition of [0, 1[ having the

following properties:

(1) For any k ∈ N, the Lebesgue measure or length |Ik(a|a−1−k)| of Ik(a|a−1−k) only

depends on a and a−1−k, (2) for any a and a

X

k∈N

|Ik(a|a−1−k)| = P (a|a),

(3) the intervals are disposed as represented in the upper part of Figure 1.

I0(1|∅) I0(2|∅) I1(1|a−1) I1(2|a−1) I2(1|a−1−2) I2(2|a−1−2) . . . Ik(1|a−1−k) Ik(2|a−1−k) . . . . . . α0 α1(a−1) αk(a−1−k) I∞(1|a) I∞(2|a) I[k](1|a−1 −k) I[k](2|a−1 −k)

Figure 1. Illustration of a range partition related to some infinite past a. The upper partition is the one used for the original kernel P , whereas the one below is used for the approximating kernel P[k].

Definition 3.1. We call range partitions the partitions of [0, 1[ satisfying (1), (2) and (3) for some kernel P .

The following lemma is proved in Section 5.1.

Lemma 3.1. A set of range partitions satisfies, for anya and a∈ A,

k X i=0 |Ii(a|a−1−i)| ≤ inf z P (a|a −1 −kz) , ∀k ≥ 0 .

Given a range partitionI(a), the following F is an update function, due to property (2): F (a, U0) := X a∈A a.1    U0 ∈ [ k∈N Ik(a|a−1−k)    . (5)

This function F explains the name “range partition”: for a given past a, when the uniform r.v. U0 belongs to Sa∈A

Sk

i=0Ii(a|a−1−i), then F constructs a symbol looking at a range

≤ k in the past.

Let L : A−N× [0, 1[→ {0, 1, 2, . . .} be the range function defined by L(a, u) :=X

k∈N

k.1{u ∈ ∪a∈AIk(a|a−1−k)} . (6)

L associates to a past a and a real number u∈ [0, 1[ the length of the suffix of a that F needs in order to construct the next symbol when U0 = u.

(7)

Using these functions, define, as in Section 2.3, the related coalescence sets Θ[i], i∈ Z, and the reconstruction function Φ(U), which is distributed according to the unique stationary distribution compatible with P whenever Θ[0] is a.s. non-empty.

Let us now define the functions F[k] and L[k] that we will use for the construction of X[k]. Observe that, on the one hand, by definition of the canonical k-steps Markov

approximation we have for any a∈ A and a−1−k ∈ Ak

P[k](a|a−1−k) := µ(X0 = a|X−k−1 = a−1−k) = Z A−N P (a|a−1−kz)dµ(z|a−1−k)≥ inf z P (a|a −1 −kz) .

On the other hand, by Lemma 3.1, infzP (a|a−1−kz) ≥ Pkj=0|Ik(a|a−1−k)|. Thus we can

define, for any a−1−k, the set of intervals {I[k](a|a−1

−k)}a∈A having length |I[k](a|a−1−k)| =

P[k](a|a−1 −k)−

Pk

j=0|Ik(a|a−1−k)| and disposed as in Figure 1. The functions F[k] and L[k]

are defined as follows

F[k](a, U0) := X a∈A a1{U0 ∈ ∪kj=0Ij(a|a−j−1)∪ I[k](a|a−1−k)} (7) and L[k](a, U0) := k X j=0

j.1{U0∈ ∪a∈AIj(a|a−1−j)} + k.1{U0 ∈ ∪a∈AI[k](a|a−1−k)}. (8)

Observe that, since F[k] and L[k] are derived from k-steps Markov kernels, these func-tions depend on a only through the k last value a−1−k. For this reason we will use the notation F[k](a−1−k, U0) to indicate F[k](a, U0); the same for L[k].

Now, using these functions, define, as in Section 2.3, the related coalescence sets Θ[k][i], i ∈ Z, and the reconstruction function Φ[k](U), which is distributed according to the

unique stationary distribution compatible with P[k] whenever Θ[k][0] is a.s. non-empty. Using the same sequence of uniforms U and assuming that Θ[0] and Θ[k][0] are a.s. non-empty, (Φ(U), Φ[k](U)) is a (A× A)-valued chain with coordinates distributed as X

and X[k] respectively. It follows that (Φ(U), Φ[k](U)) is a coupling between both chains. Hence, we have constructed a CFTP algorithm for perfect simulation of the coupled chains.

4. STATEMENTS OF THE RESULTS

4.1. A key lemma. Let us first state a technical lemma that is central in the proof of our main results.

Lemma 4.1. Assume that there exists a set of range partitions{I(a)}a such that the sets

of coalescence timesΘ[0]∩ Θ[k][0] is P-a.s. non-empty. Then, for any θ[0]∈ Θ[0],

¯ d(X, X[k])≤ P   [ a 0 [ i=θ[0] n LaF{θ[0],i−1}(a, Uθ[0]i−1), Ui  > ko  . (9)

where, fori = θ[0], the event reads {L a, Uθ[0] > k}.

Examples of range partitions satisfying the conditions of this lemma have already been built, for example in Comets et al. (2002), Gallo (2011), Gallo & Garcia (2011) and De Santis & Piccioni (2012). These works assume some regularity conditions on P and

(8)

some non-nullness hypothesis which are presented in Sections 4.2, 4.3, 4.4. In these sec-tions, we obtain explicit upper bounds for (9) under the respective assumptions. Before that, let us give an interesting remark on Bernoullicity.

Observation 4.1 (A remark on Bernoullicity). In the conditions of each works cited above, we will exhibitθ[0]∈ Θ[0] which belongs to Θ[k][0] for any sufficiently large k’s, and

we will prove that

P   [ a 0 [ i=θ[0] n LaF{θ[0],i−1}(a, Uθ[0]i−1), Ui  > ko   k−→ 0.→∞ (10)

It follows, by Lemma 4.1, that in this case lim

k→∞

¯

d(X, X[k]) = 0.

We also have, for any sufficiently large k’s, that X[k] is an ergodic Markov chain since

Θ[k][0] is non-empty. Now, by the ¯d-closure of the set of processes isomorphic to a Bernoulli shift (see for example Ornstein (1974)) and the fact that ergodic Markov processes have the Bernoulli property (Ornstein (1974), p.45), we conclude using the results of following sections that the processes considered in Comets et al. (2002), Gallo (2011), Gallo & Garcia (2011) and De Santis & Piccioni (2012) have the Bernoulli property.

4.2. Kernels with summable continuity rate. Let us first define continuity.

Definition 4.1 (Continuity points and continuous kernels). For any k ∈ N, a and a−1−k, let αk(a|a−1−k) := infzP (a|a−1−kz) and α0(a) := infzP (a|z). A past a is called a continuity

point for P or P is said to be continuous in a if αk(a−1−k) :=

X

a∈A

αk(a|a−1−k)k→+∞−→ 1.

We say that P is continuous when αk:= inf

a−1−k

αk(a−1−k)k→+∞−→ 1.

We say that P has summable continuity rate whenP

k≥0(1− αk) <∞.

We also define weak non-nullness.

Definition 4.2. We say that a kernel P is weakly non-null if α0 > 0, where α0 :=

P

a∈Aα0(a).

De Santis & Piccioni (2012) have introduced a more general assumption that we call very weak non-nullness, see Definition 5.2. We postpone this definition to Section 5.3 in order to avoid technicality at this stage.

Theorem 4.1. Assume thatP has summable continuity rate and is very weakly non-null. Then, there exists a constantC < +∞ such that, for any sufficiently large k,

¯

(9)

Remark 4.1. This upper bound is new since we do not assume weak non-nullness. Fern´andez & Galves (2002) showed, under weak non-nullness, that for any sufficiently largek, there exists a positive constant C such that

¯

d(X, X[k])≤ Cβk

where

βk:= sup{|P (a|a−1−kx)− P (a|a−1−ky)| : a ∈ A, a−1−k ∈ Ak, x, y∈ A−N}.

This later quantity is related toαkthrough the inequalities1−αk≥ |A|2 βkand1−αk≤ Dβk

for some D > 1 and sufficiently larges k’s. Moreover, 1− αk = βk for binary alphabets.

Thus, Theorem 4.1 extends the bound in Fern´andez & Galves (2002).

4.3. Using a prior knowledge of the histories that occur. De Santis & Piccioni (2012) introduced the following assumptions on kernels. Define

∀k ≥ 1, Jk(U−k−1) :={x ∈ A−N s.t. ∀1 ≤ l ≤ k, x−l= a if U−l∈ I(a|∅) for some a ∈ A},

∀k ≥ 1, Ak(U−k−1) := inf{αk(x−1−k) : x∈ Jk(U−k−1)} .

Finally, let

`(U−∞0 ) := inf{j ≥ 0 : U0< Aj(U−j−1)} where A0(U01) := α0. (11)

Theorem 4.2. If X has a kernel that satisfies EQ

k≥0Ak(U−k−1)−1



< ∞, then there exists a positive constant C < +∞ such that

¯

d(X, X[k])≤ CP(`(U−∞0 ) > k).

In order to illustrate the interest of this result, let us give two simple examples. Other examples can be found in De Santis & Piccioni (2012) and Gallo & Garcia (2011). Summable continuity regime with weak non-nullness. Theorem 4.2 allows to re-cover the result of Theorem 4.1 in the weakly non-null case. To see this, it is enough to observe that, for any U−k−1, Ak(U−k−1) ≥ αk (see Definition 4.1 for αk). It follows that

Q

k≥0αk > 0 (which is equivalent to Pk≥0(1− αk) < +∞), implies that Qk≥0Ak(U−k−1)

is bounded away from zero, hence, its inverse has finite expectation. Hence, Theorem 4.2 applies and gives

¯

d(X, X[k])≤ CP(`(U−∞0 ) > k)≤ C(1 − αk) .

A simple discontinuous kernel on A ={1, 2}. Let  ∈ (0, 1/2) and let {pi}i≥0 be any

sequence such that, ≤ pi < 1−  for any i ≥ 0. Let t(a) := inf{i ≥ 0 : a−i−1 = 2} let ¯P

be the following kernel:

∀a ∈ {1, 2}−N, ¯P (2|a) = pt(a) . (12)

The existence of a unique stationary process compatible with this kernel is proven in Gallo (2011) for instance. This chain is the renewal sequence, that is, a concatenation of blocks of the form 1 . . . 12 having random length with finite expectation. It is clearly weakly non-null, however, it is not necessarily continuous. In fact, a simple calculation shows that αk = 1− supl,m≥k|pl− pm|, which needs not to go to 1. Nevertheless, if we assume

furthermore that supk≥0αk> 1− α0(2), we have E

 Q

k≥0Ak(U−k−1)−1



<∞. The proof of this fact was originally done in an unpublished preliminary version of the paper of De Santis & Piccioni (2012). We include it here for the sake of completeness. Define

(10)

and observe that (1) Ak(U−k−1) = 1 for any k≥ N(U−∞−1 ), and (2) N (U−∞−1 ) has geometric

distribution, with probability of success α0(2). Now, for any δ∈ (0, supkαk− 1 + α0(2)),

we choose n0(δ) such that αn0 ≥ 1 − α0(2) + δ, and compute

  Y k≥0 Ak(U−k−1)−1  = N Y k=0 Ak(U−k−1)−1 ! ≤ (1 − α0(2) + δ)−N n0 Y k=0 α−1k .

The last equality follows from the fact that Ak(U−k−1)≥ αkfor any k and any sample U−k−1.

Since the probability generating function of N has radius of convergence α0(2)

1−α0(2) which is

strictly smaller than (1− α0(2) + δ)−1, it follows that

 Q

k≥0Ak(U−k−1)−1



is integrable. We can now obtain an upper bound for P(`(U0

−∞) > k)

P(`(U−∞0 ) > k) = P(inf{j ≥ 0 : U0< Aj(U−j−1)} > k)

≤ P(U0 ≥ Ak(U−k−1))≤ P(Ak(U−k−1) < 1)≤ P(N(U−∞−1) > k)≤ (1 − )k.

Observation 4.2. The preceding theorems yield explicit upper bounds. However, they hold under restrictions we would like to surpass.

First, in the continuous regime, we have assumed thatP

k≥0(1−αk) < +∞. Nevertheless,

CFTP are known to exist with the weaker assumption P

k≥1

Qk−1

i=0 αi = +∞, and it is

known that ¯d(X, X[k]) goes to zero in this case. We will be interested in upper bounds for

the rate of convergence to zero under these weak conditions. Second, in Theorem 4.2, the assumption E

 Q

k≥0Ak(U−k−1)−1



<∞ is generally difficult to check: this is particularly clear for the example of ¯P where, moreover, it requires the (unnecessary) extra-assumptionsupk≥0αk> 1− α0(2).

The next section will solve part of these objections.

4.4. A simple upper bound under weak non-nullness. Hereafter, we assume that P is weakly non-null. Let Θ0[0] be the following subset of Θ[0]:

Θ0[0] :={i ≤ 0 : for any a , L(aF{i,j−1}(a, Uij−1), Uj)≤ j − i , j = i, . . . , 0}. (13)

We have the following theorem in which a priori nothing is assumed on the continuity. Theorem 4.3. Assume thatP is weakly non-null and that we can construct a set of range partitions{I(a)}a for which Θ0[0]6= ∅, P-a.s. Then, for any θ[0] ∈ Θ0[0]

¯

d(X, X[k])≤ P(θ[0] < −k). (14)

In order to illustrate this result, let us consider the examples of continuous kernels and of the kernel ¯P . Gallo & Garcia (2011) proposed a unified framework, including these examples and several other cases, which provides more examples of applications of this theorem. This is postponed to Appendix A in order to avoid technicality.

Application to the continuity regime. Let us first introduce the following range par-tition.

(11)

Definition 4.3. Let{I(1)(a)}a be the range partition such that, for any a and a ∀k ≥ 0, I (1) k (a|a−1−k) := αk(a|a −1 −k)− αk−1(a|a−1−(k−1)) , I (1) ∞ (a|a) := P (a|a) − lim k→∞αk(a|a −1 −k) ,

with the convention α−1(a|∅) = 0.

Let F(1) and L(1) be the associated functions defined in (5) and (6). Let

θ[0] := max{i ≤ 0 : Uj ≤ αj−i, j = i, . . . , 0}.

Observe that L(1) satisfies L(1)(a, U0) ≤ k whenever U0 ≤ αk. Hence, θ[0] belongs to

Θ0(1)[0], the set defined by (13) using F(1) and L(1). Moreover, Comets et al. (2002) proved that, ifP

k≥1

Qk−1

i=0 αi = +∞ (that is, under weak non-nullnes but not necessarily

summable continuity) P(θ[0] <−k) ≤ vk:= k X j=1 X t1, . . . , tj ≥ 1 t1+ . . . + tj = k j Y m=1 (1− αtm−1) tm−2 Y l=0 αl (15)

which goes to 0. This upper bound is not very satisfactory since it is difficult to handle in general. Nevertheless, Propositions B.1 and B.2, given in Appendix B, shed light on the behavior of this vanishing sequence. In particular, under the summable continuity assumptionP

k≥0(1− αk) < +∞, Proposition B.1 states that (15) essentially recovers the

rates of Theorems 4.1 and 4.2. Also, if there exists a constant r∈ (0, 1) and a summable sequence (sk)k≥1 such that, ∀k ≥ 1, 1 − αk = rk + sk, then, from Proposition B.2, there

exists a positive constant C such that ¯

d(X, X[k])≤ C(log k)

3+r

k2−(1+r)2. (16)

Application to the kernel ¯P . As a second direct application of Theorem 4.3, let us consider the kernel ¯P defined in (12). Let{I(2)(a)}

a be the set of range partitions, such

that |I(2)(2|∅)| = α0(2), |I(2)(1|∅)| = α0(1) and Ik(2)(a|a−1−k) =∅ for any k ≥ 1 except for

k = t(a) + 1 for which |Ik(2)(1|a−1−k)| = 1 − pk− α0(1) and |Ik(2)(2|a−1−k)| = pk− α0(2). It

satisfies

L(2)(a, U0) = (t(a) + 1)1{U0> α0}.

Hence, θ[0] := max{i ≤ 0 : Ui ∈ I(2)} belongs to Θ0[0] (Θ0[0] is defined by (13) with the

functions F(2) and L(2) obtained from the set of range partitions{I(2)(a)}a). Therefore,

¯

d(X, X[k])≤ P(θ[0] < −k) ≤ (1 − )k (17) independently of the value supk≥0αk. For this simple example, Theorem 4.3 is then less

restrictive than Theorem 4.2.

5. PROOFS OF THE RESULTS 5.1. Proof of Lemma 3.1. Assume that for some k≥ 0 we have

k X i=0 |Ii(a|a−1−i)| > inf z P (a|a −1 −kz) .

(12)

Then, consider a past z? such that |I(a)| +Pk

i=1|Ii(a|a−1−i)| > P (a|a−1−kz?). As, for all

l≥ k + 1, |Il(a|a−1−kz−1−l)| ≥ 0, we have k X i=0 |Ii(a|a−1−i)| + X l≥k+1 |Il(a|a−1−kz−l−1)| > P (a|a−1−kz?) .

This is a contradiction with the second properties of the partition. This concludes the

proof. 

5.2. Proof of Lemma 4.1. We assume that Θ[0]∩ Θ[k][0] is P-a.s. non-empty, and we

therefore have a coupling (Φ(U), Φ[k](U)) of both chains. By definitions of F[k] and L[k], we observe that

L(ba−1−k, U0)≤ k ⇒ for any b we have



L(ba−1−k, U0) = L[k](a−1−k, U0) and,

F (ba−1−k, U0) = F[k](a−1−k, U0). (18)

Assume that,∀a ∈ A−N and ∀i = θ[0], . . . , 0, L(aF

{θ[0],i−1}(a, Uθ[0]i−1), Ui)≤ k. Then, using

recursively (18), F{θ[0],0}(a, Uθ[0]0 ) = F{θ[0],0}[k] (a−1−k, Uθ[0]0 ). In particular, θ[0] ∈ Θ[k][0] and

[Φ(U)]0= [Φ[k](U)]0. Therefore,

¯ d(X, X[k])≤ P([Φ(U)]0 6= [Φ[k](U)]0) ≤ P   [ a 0 [ i=θ[0] n

LaF{θ[0],i−1}(a, Uθ[0]i−1), Ui

 > ko

.

5.3. Proof of Theorem 4.1. This section is divided in three parts. First, as mentioned before the statement of the theorem, we define very weak non-nullness. Then, we prove some technical lemmas allowing to apply Lemma 4.1. Finally, we prove the theorem. 5.3.1. Definition of very weak non-nullness. Consider the set of range partitions{I(1)(a)}

a

of Definition 4.3. As observed by De Santis & Piccioni (2012), in the continuous case, since {αk}k≥0 increases monotonically to 1, there exists k≥ 0 such that αk > 0. Let k? be the

smallest of these integers and let F? be the following update function F?(a−1−k?, U0) := F(1)(b, U0αk?) ∀b s.t. a−1

−k? = b−1−k? .

F? is well defined, since U0αk? ≤ αk?, hence L(1)(b, U0αk?)≤ k?. In the case where k?= 0,

F? is simply defined as

F?(∅, U0) := F(1)(b, U0α0) =

X

a∈A

1{U0α0 ∈ I0(a|∅)} ∀b .

Definition 5.1(Coalescence set). For m≥ k?+ 1, let E

m, the coalescence set (different

from the set of coalescence times), be defined as the set of all u0−m+1 ∈ Am such that

F{−k? ?+1,0} a−1−k?F{−m+1,−k? ?} a−1−k?, u−k−m+1? αk? ! ,u 0 −k?+1 αk? !

does not depend on a−1−k?.

When k?= 0 and m = 1, we have E1:=∪a∈AI0(a|∅).

Definition 5.2. We say thatP is very weakly non-null if

(13)

Weak non-nullness corresponds to P(U0 ∈ E1) > 0, hence, it implies very weak

non-nullness.

5.3.2. Technical lemmas. Let Θ(1)[0] be the set of coalescence times defined by (3) for the function F(1). In a first part of the proof, we define a random time θ[0] (see (20)) and we show that it belongs to Θ(1)[0] and that it has finite expectation wheneverP

k≥k?(1−αk) <

+∞. This random variable is defined in the proof of Theorem 2 in De Santis & Piccioni (2012).

Recall that, by construction of the range partition {I(1)(a)}

a, for any a, L(a, Ui) = k

whenever αk−1 ≤ Ui < αk. This means that the sequence of ranges forms a sequence

{Li}i∈Z :={L(a, Ui)}i∈Z of i.i.d. (N∪ {0})-valued r.v.’s. We now introduce two sequences

of random times in the past, which are represented on Figure 2, in the particular case where k? = 2. Let

W1 := sup{m ≤ 0 : Uj < αj−m+k?, j = m, . . . , 0},

and for any i≥ 1

Yi:= inf{m < Wi : Un< αk?, n = m + 1, . . . , Wi} and Wi+1:= sup{m ≤ Yi : Uj < αj−m+k?, j = m, . . . , Yi}. time 0 −1 . . . −2 . . . W1 W2 WQ Y1 Y2 YQ YQ−1 |BQ| |B2| |B1|

Figure 2. We consider a realization of L0−∞in the particular case k? = 2, that is, the arrows, which represent the length function at each time index, have length larger or equal to 2.

Consider now the random variable

Q := inf{i ≥ 1 : (UYi+1, . . . , UWi−1)∈ EWi−Yi−1}

(see Definition 5.1 for Em) and put

θ[0] := YQ. (20)

Lemma 5.1. θ[0]∈ Θ(1)[0].

Proof. If θ[0] = −k, then there exists some l(= −WQ + 1) ≤ k such that U−i ≤ αk?,

i = l, . . . , k, and moreover, U−k−l ∈ Ek−l+1, that is

F{−l−k? ?+1,−l}  a−1−k?F{−k,−l−k? ?}(a−1−k?, 1 αk?U −l−k? −k ), 1 αk?U −l −l−k?+1  is independent of a−1−k?.

Since U−i ≤ αk?, i = l, . . . , k, it follows that

F{−l−k?+1,−l}  bF{−k,−l−k?}(b, U−l−k ? −k ), U−l−k−l ?+1  is independent of b.

By definition of the random times Wi, all the symbols in times {WQ, . . . , 0} can then

(14)

WQ until 0 go further time WQ− k?, see the Figure 2. Therefore, the construction of the

symbol at times 0 does not depend on the symbols before θ[0], i.e θ[0]∈ Θ(1)[0]. 

Lemma 5.2. E|W1| < +∞ whenever Pk≥k?(1− αk) < +∞.

Proof. Letting ¯αl−k? = αl for any l≥ k?, we have

W1 = sup{m ≤ 0 : Uj < ¯αj−m, j = m, . . . , 0}.

Thus W1 is defined exactly as τ [0] of display (4.2) in Comets et al. (2002), substituting

their ak’s by our ¯αk’s. They proved (see display (4.6) and item (ii) Proposition 5.1 therein)

that E|τ[0]| < +∞ whenever P

k≥0(1− ¯αk) < +∞. It follows that E|W1| < +∞ whenever

P

k≥k?(1− αk) < +∞. 

Lemma 5.3. E|θ[0]| < +∞ wheneverP

k≥k?(1− αk) < +∞.

Proof. As observed in De Santis & Piccioni (2012),{Wi− Yi− 1}i≥1 is a sequence of i.i.d.

geometric r.v.’s with success probability 1− αk? and {Yi− Wi+1}i≥0 (with Y0 := 0) is a

sequence of i.i.d. r.v.’s distributed as−W1, conditional to be non-zero. Moreover, Lemma

5.2 states that E|W1| < +∞. It follows that {Bi}i≥1 :={Yi−1− Yi− 1}i≥1 is a sequence

of i.i.d. N-valued r.v.’s with finite expectation. ThusPn

k=1Bi− nEB1 forms a martingale

with respect to the filtrationF(B1, . . . , Bi : i≥ 1) and we have by the optional sampling

theorem E|θ[0]| := E|YQ| = E Q X i=1 Bi ! = EQ.EB1< +∞.  We finally need the following lemma.

Lemma 5.4. For anyk≥ k?, θ[0]∈ Θ(1),[k][0].

Proof. For any k ≥ k?, F(1),[k] and L(1),[k] satisfy (18). This implies that, in the interval {YQ, . . . , WQ− 1}, coalescence occurs as well for F(1),[k], i.e. U−θ[0]WQ−1 ∈ E[k]θ[0]−WQ. Both

constructed chains are equals until the first time F(1) uses a range larger than k. But

at this moment, due to the definition of the Wi’s, we have already perfectly simulated at

least k symbols of both chains, and therefore, we can continue constructing until time 0 because the ranges of F(1),[k] are smaller of equal to k. It follows that YQ is a coalescence

time for F(1),[k], and therefore, θ[0]∈ Θ(1),[k][0] for any k≥ k?.  5.3.3. Proof of Theorem 4.1. By definition of F(1) and L(1), we have for any sufficiently

large k, [ a 0 [ i=θ[0] n

L(1)aF{θ[0],i−1}(1) (a, Uθ[0]i−1), Ui

 > ko 0 [ i=θ[0] {Ui> αk}.

By Lemmas 5.1 and 5.4, Lemma 4.1 applies and gives, for sufficiently large k’s ¯ d(X, X[k])≤ P   0 [ i=θ[0] {Ui > αk}  = P   |θ[0]| X i=0 1{Ui > αk} ≥ 1  ≤ E   |θ[0]| X i=0 1{Ui> αk}  

where we used the Markov inequality for the last inequality. Using the fact that θ[0] is a stopping time in the past for the sequence Ui, i≤ 0, and that it has finite expectation by

Lemma 5.3, we can apply the Wald’s equality to obtain ¯

(15)

5.4. Proof of Theorem 4.2. We divide this proof into two parts. First, we prove tech-nical lemmas allowing to use Lemma 4.1. Then, we prove the theorem.

5.4.1. Technical lemmas. Using the quantity `(U0

−∞) defined by (11), we define

θ[0] := sup{j ≤ 0 : `(U−∞i )≤ i − j , i = j, . . . , 0}. (21) Lemma 5.5. θ[0], defined by (21), belongs to Θ(1)[0]∩ Θ(1),[k][0] for any k≥ 0 and

P   [ a 0 [ i=θ[0] n L(1)aF{θ[0],i−1}(1) (a, Uθ[0]i−1), Ui  > ko  ≤ P   0 [ i=θ[0] ` Ui −∞ > k  . (22)

Proof. For any U−k−1, the way the sets of strings {zF{−k,−1}(1) (z, U−k−1)}z and Jk(U−k−1) are

defined ensure that the former is included in the later. It follows that, for any U−k−1, Ak(U−k−1) := inf x−1−k:x∈Jk(U−k−1) X a∈A inf z P (a|x −1 −kz)≤ X a∈A inf z P (a|F (1) {−k,−1}(z, U−k−1)z).

As the inequality A0≤Pa∈AinfzP (a|z) is also true, we deduce that, for any k ≥ 0, any

z∈ A−N, and any U−∞0

`(U−∞0 )≤ k ⇒ L(1)zF{−k,−1}(1) (z, U−k−1), U0

 ≤ k . By recurrence, this means that, for all θ[0]≤ i ≤ 0, F{θ[0],i}(1) (z, Ui

θ[0]) does not depend on z.

Hence, θ[0] is also a coalescence time for the update function F(1), that is θ[0]∈ Θ(1)[0].

Observe that we have proved, more specifically, that θ[0]∈ Θ0(1)[0], where Θ0(1)[0]⊂ Θ(1)[0]

is defined by 13 using F(1)and L(1). By Lemma 5.6 below, this implies that θ[0]∈ Θ(1),[k][0]

for any k≥ 0 as well.

We now prove the second statement of the lemma. If there exist i ≥ k, U−i−1 and z such that L(1)zF{−i,−1}(1) (z, U−i−1), U0



> k, then there exists some past a (take a = zF{−i,−k−1}(1) (z, U−i−k−1) for instance) such that L(1)aF(1)

{−k,−1}(a, U−k−1), U0

 > k. We now have the following sequence of inclusions

[ a 0 [ i=θ[0] n L(1)aF{θ[0],i−1}(1) (a, Uθ[0]i−1), Ui  > ko =[ a 0 [ i=θ[0] n L(1)aF{θ[0],i−1}(1) (a, Uθ[0]i−1), Ui  > ko∩ {θ[0] ≤ i − k − 1} ⊂[ a 0 [ i=θ[0] n L(1)aF{i−k,i−1}(1) (a, Uii−k−1), Ui  > ko∩ {θ[0] ≤ i − k − 1} ⊂ 0 [ i=θ[0] ` Ui −∞ > k .

This concludes the proof of the lemma. 

Recall the definition (13) of Θ0[0] for generic range partitions of a weakly non-null kernel P . We will need the following lemma.

(16)

Lemma 5.6. For anyk≥ 0, Θ0[0]⊂ Θ[k][0].

Proof. Let θ[0]∈ Θ0[0]. For any fixed k≥ 0, we separate two cases.

(1) If θ[0]≥ −k, then, by the definition of Θ0[0], the ranges used by F from θ[0] to 0 are

all smaller than or equals to k, and therefore using (18), we have that the length used by F[k] in the same interval of indexes are the same and the constructed symbols are the same as well. Thus θ[0]∈ Θ[k][0].

(2) If θ[0] <−k, then, by the definition of Θ0[0], we can apply the same method as in

the preceding case, and obtain that θ[0] is a coalescence time for F[k] for the time indexes from θ[0] up to θ[0] + k. But θ[0] is also a coalescence time for the time indexes from θ[0] + k + 1 up to 0, since the ranges used by F[k] are always smaller than or equal to k. Thus, in this case also, θ[0]∈ Θ[k][0].

 5.4.2. Proof of Theorem 4.2. In the conditions of this theorem, by Theorem 1 in De Santis & Piccioni (2012), θ[0] is P-a.s. finite. Moreover, by Lemma 5.5, θ[0]∈ Θ(1)[0]∩ Θ(1),[k][0]

for any k≥ 0. Thus we can apply Lemma 4.1, and obtain, using Lemma 5.5 ¯ d(X, X[k])≤ P   0 [ i=θ[0] ` Ui −∞ > k   and moreover P   0 [ i=θ[0] ` Ui −∞ > k  = P   0 X i=θ[0] 1` U−∞i  > k ≥ 1   ≤ E   0 X i=θ[0] 1` U−∞i  > k   .

Consider the σ-algebra Fk generated by U−k0 , k ≥ 0. Then, `(U−∞0 ) is a stopping time

with respect to Fk and, by definition, so is θ[0]. Moreover, `(U−∞i ) is independent of

Ui+10 by independence of the Uj’s. Finally, by stationarity, ` U−∞i

 D

= ` U−∞0 , hence E 1` U−∞i  > k  = E 1 ` U−∞0  > k , for any i ∈ Z. By Theorem 1 in De Santis & Piccioni (2012), θ[0] has finite expectation, hence we can use Wald equality to obtain

¯

d(X, X[k])≤ E|θ[0]|P(` U−∞0  > k) . (23) This concludes the proof of Theorem 4.2.

5.5. Proof of Theorem 4.3. Recall the definition of the set Θ0[0] given by (13). If θ[0]∈ Θ0[0] and θ[0]≥ −k, then we know that the range LaF{θ[0],i−1}(a, Uθ[0]i−1), Ui

 ≤ k for any i = θ[0], . . . , 0 and any a, therefore

[ a 0 [ i=θ[0] n LaF{θ[0],i−1}(a, Uθ[0]i−1), Ui  > ko⊂ {θ[0] < −k}.

By Lemma 5.6, any θ[0]∈ Θ0[0] also belongs to Θ[k][0] for any k≥ 0. We can thus apply

(17)

References

Bejerano, G. & Yona, G. (2001). Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43.

Bramson, M. & Kalikow, S. (1993). Nonuniqueness in g-functions. Israel J. Math. 84(1-2), 153–160.

Bressaud, X., Fern´andez, R. & Galves, A. (1999). Decay of correlations for non-H¨olderian dynamics. A coupling approach. Electron. J. Probab. 4, no. 3, 19 pp. (elec-tronic).

Busch, J. R., Ferrari, P. A., Flesia, A. G., Fraiman, R., Grynberg, S. P. & Leonardi, F. (2009). Testing statistical hypothesis on random trees and applications to the protein classification problem. Annals of applied statistics 3(2).

Coelho, Z. & Quas, A. (1998). Criteria for ¯d-continuity. Transactions of the American Mathematical Society 350, 3257–3268.

Collet, P., Duarte, D. & Galves, A. (2005). Bootstrap central limit theorem for chains of infinite order via Markov approximations. Markov Process. Related Fields 11(3), 443–464.

Comets, F., Fern´andez, R. & Ferrari, P. A. (2002). Processes with long memory: regenerative construction and perfect simulation. Ann. Appl. Probab. 12(3), 921–943. URL http://dx.doi.org/10.1214/aoap/1031863175.

Csisz´ar, I. & Talata, Z. (2006). Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52(3), 1007–1016. Csisz´ar, I. & Talata, Z. (2010). On rate of convergence of statistical estimation of

stationary ergodic processes. IEEE Trans. Inform. Theory 56(8), 3637–3641.

De Santis, E. & Piccioni, M. (2012). Backward coalescence times for perfect simulation of chains with infinite memory. J. Appl. Probab. 49(2), 319–337.

Duarte, D., Galves, A. & Garcia, N. L. (2006). Markov approximation and consis-tent estimation of unbounded probabilistic suffix trees. Bull. Braz. Math. Soc. (N.S.) 37(4), 581–592.

Fern´andez, R. & Galves, A. (2002). Markov approximations of chains of infinite order. Bull. Braz. Math. Soc. (N.S.) 33(3), 295–306. Fifth Brazilian School in Probability (Ubatuba, 2001).

Fern´andez, R. & Maillard, G. (2005). Chains with complete connections: general theory, uniqueness, loss of memory and mixing properties. J. Stat. Phys. 118(3-4), 555–588. URL http://dx.doi.org/10.1007/s10955-004-8821-5.

Gallo, S. (2011). Chains with unbounded variable length memory: perfect simulation and a visible regeneration scheme. Adv. in Appl. Probab. 43(3), 735–759. URL http: //dx.doi.org/10.1239/aap/1316792668.

Gallo, S. & Garcia, N. L. (2011). General context-tree-based approch to perfect simulation for chains of infinite order. Submitted, arXiv: 1103.2058v3 .

Galves, A., Galves, C., Garc´ıa, J. E., Garcia, N. L. & Leonardi, F. (2012). Context tree selection and linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6(1), 186–209. URL http://dx.doi.org/10.1214/11-AOAS511.

Hulse, P. (1991). Uniqueness and ergodic properties of attractive g-measures. Er-godic Theory Dynam. Systems 11(1), 65–77. URL http://dx.doi.org/10.1017/ S0143385700006015.

Johansson, A., ¨Oberg, A. & Pollicott, M. (2010). Unique bernoulli g-measures. arXiv:1004.0650v1 , 1–18.

(18)

Kallenberg, O. (2002). Foundations of modern probability. Probability and its Appli-cations (New York). New York: Springer-Verlag, second ed.

Ornstein, D. S. (1974). Ergodic theory, randomness, and dynamical systems. New Haven, Conn.: Yale University Press. James K. Whittemore Lectures in Mathematics given at Yale University, Yale Mathematical Monographs, No. 5.

Propp, J. G. & Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. In: Proceedings of the Seventh International Conference on Random Structures and Algorithms (Atlanta, GA, 1995), vol. 9.

Quas, A. (1996). Non-ergodicity for c1 expanding maps and g-measures. Ergodic Theory and Dynamical Systems 16, 531–544.

Rissanen, J. (1983). A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664.

Walters, P. (2007). A natural space of functions for the ruelle operator theorem. Ergodic Theory and Dynamical Systems 27, 1323–1348.

Appendix A. Local continuity with respect to the past 1

In this section, we assume that A = {1, 2}, and that P has only one discontinuity point, the point 1 = . . . 111. We refer the interested reader to Gallo & Garcia (2011) for examples with discontinuities in more complicated set of pasts. To begin, we need the following definition.

Definition A.1(Local continuity with respect to the past 1). We say that a kernel P on {1, 2} is locally continuous with respect to the past 1 if

∀i ≥ 0, inf a−1−k X a∈A inf z P (a|1 i2a−1 −kz)

converges to1 as k diverges. We distinguish two particular situations of interest.

• We say that P is strongly locally continuous with respect to 1 if there exists an integer function ` : (N∪ {0}) → (N ∪ {0}) such that

∀i ≥ 0, inf a−1−k X a∈A inf z P (a|1 i2a−1 −kz) = 1 (24)

for any k≥ `(i), and

• we say that P is uniformly locally continuous with respect to 1 if α1k:= inf i≥0ainf−1−k X a∈A inf z P (a|1 i2a−1 −kz) (25) converges to 1 as k diverges.

Strongly locally continuous kernels are known as probabilistic context trees, a model that have been introduced by Rissanen (1983) as a universal data compression model. It was first consider, from the “CFTP point of view”, by Gallo (2011). The kernel ¯P is a simple example which is strongly and uniformly locally continuous with respect to 1. Assumption 1: P is strongly locally continuous with respect to 1.

(19)

Assumption 2: P is uniformly locally continuous with respect to 1. Notation A.1. Let us introduce the following notation.

• Stationary processes compatible with kernels satisfying Assumptions i=1 and 2 are denoted X(i), and the corresponding canonical k-steps Markov approximations are denoted X(i),[k].

• We use the notations r0(i):= α0 for i=1 and 2, and for k≥ 1,

r(1)k := rk−1(1) ∨ (1 − (1 − α(2))`−1(k)) r(2)k := rk(2)−1∨ (1 − (1 − α1k)/α(2))

where ` and α1k are the parameters of the kernels under assumptions 1 and 2 respectively.

• For i=1 and 2 vk(i):= k X j=1 X t1, . . . , tj ≥ 1 t1+ . . . + tj = k j Y m=1 (1− r(i)tm−1) tm−2 Y l=0 rl(i) (26) where Q−1 l=0:= 1.

• And finally, for any k ≥ 0, let uk:=bkα(2)/2cP   bkα(2)/2c X j=0 ξj −bkα(2)/2c α(2) > k/2   (27)

It is well-known that this sequence goes exponentially fast to 0 (see Kallenberg (2002) for instance). An explicit upper bound is obtained in Appendix C.

Corollary A.1. Under the weak non-nullness assumption, we have for i=1 and 2 that, if P k≥1 Qk−1 i=0 r (i) k =∞, ¯

d(X(i), X(i),[k])≤ uk+ v(i)bkα(2)/2c→ 0. (28)

The quantity defined on display (26) is related to the house of card process presented in Section B (see equation (30)). We provide in Propositions B.1 and B.2 explicit upper-bounds on the term (26) that can be plugged in (28). The term (27) is studied in Corollary C.1. It follows in particular from these propositions that, whenever rk(i)is not exponentially decreasing, the leading term in (28) is v(i)k , and therefore, we obtain for some constant C > 1 and any sufficiently large k

¯

d(X(i), X(i),[k])≤ Cv(i)bkα(2)/2c.

For instance, Proposition B.2, states that, if 1− rk(i) = rk+ sk, k ≥ 1 with r ∈ (0, 1) and

{sk}k≥1 is any summable sequence, we obtain for some constant C > 1

¯

d(X(i), X(i),[k])≤ C(log k)

3+r

k2−(1+r)2 . (29)

Proof. Under Assumptions 1 and 2 with weak non-nullness, Gallo & Garcia (2011) con-structed a set of range partitions generating a set of coalescence times Θ0[0] which is a.s. non-empty. This is what is stated in Corollaries 5.1 and 5.2 (and the discussions follow-ing them) for respectively Assumption 1 and 2. They defined a random time Λ(i)[0] (see

(20)

display (26) therein) which belongs to Θ0[0], as stated by Lemma 6.2 therein. They also prove that P(Λ(i)[0] < −k) is upper bounded by u(i)

k + v (i)

bkα(2)/2c. This is the content of

display (29) therein, making the necessary changes from their notation to our. By Theorem 4.3, these upper bounds are therefore upper bounds for ¯d(X, X[k]).

 Appendix B. Some results on the House of Cards Markov chain Fix a non-decreasing sequence{rk}k≥0of [0, 1]-valued real numbers converging to 1. The

house of Cards Markov chain H ={Hn}n≥0related to this sequence is the (N∪{0})-valued

Markov chain starting from state 0 and having transition matrix Q = {Q(i, j)}i≥0, j≥0

where Q(i, j) := ri1{j = i + 1} + (1 − ri)1{j = 0}. Let us denote vk:= Pr(Hk = 0), the

probability that the house of cards is at state 0 at time k. We want to obtain explicit rates of convergence to 0 of this sequence when H is not positive recurrent. These results will be used in the next section in order to obtain explicit upper bounds for ¯d(X, X[k]) under several types of assumptions. Decomposing the event {Hk = 0} into the possible

come back of the process{H`}`=0,...,k to 0 yields, for any n≥ 1

vk:= k X j=1 X t1, . . . , tj ≥ 1 t1+ . . . + tj = k j Y m=1 (1− rtm−1) tm−2 Y l=0 rl, (30) where Q−1

l=0 := 1. Although explicit, this bound cannot be used directly and has to be

simplified. As a first insight, we borrow the following Proposition of Bressaud et al. (1999). Proposition B.1. (i) vk goes to zero as k diverges if Pm≥1

Qm−1

l=0 rl= +∞,

(ii) vk is summable in k if 1− rk is summable in k,

(iii) vk behaves as O(1− rk) if 1− rk is summable ink and supjlim supk→+∞(11−t−rkjj)≤ 1

(iv) vk goes to zero exponentially fast if 1− rk decreases exponentially.

As observe in Bressaud et al. (1999), the conditions of item (iii) are satisfied if, for example, 1− r(i)k ∼ (log k)ηk−ζ for some ζ > 1, and for any η. However, this is one of

the only cases in which this proposition yields explicit rates. In the present paper, we will prove the following proposition.

Proposition B.2. We have the following explicit upper bounds.

(i) A non summable case: if 1− rk = kr + sk, k≥ 1 where r ∈ (0, 1) and {sn}n≥1 is a

summable sequence, there exists a constant C > 0 such that vk≤ C

(ln k)3+r (k)2−(1+r)2.

(ii) Generic summable case: if t:=Q

k≥0rk> 0, then

vk≤ inf K=1,...,k K

2(1− r

k/K) + (1− t∞)K .

(iii) Exponential case: if 1− rk ≤ Crrk, k ≥ 1, for some r ∈ (0, 1) and a constant

Cr∈ (0, log1r) then

vk ≤

1 Cr

(21)

B.1. Proof of Proposition B.2. Before we come into the proofs of each item of this proposition, let us collect some simple remarks on the House of Cards Markov chain.

Let{Tk}k≥0 be a sequence of the stopping times defined as T0 := 0 and, recursively, for

any k≥ 1, Tk := inf{l ≥ Tk−1+ 1 s.t. Hl= 0}. The Markov property ensures that the

random variables Ik:= Tk+1− Tk are i.i.d., valued in N and it is easy to check that

∀k ≥ 1, tk := Pr (I1 = k) = (1− rk−1) k−2 Y i=0 ri , whereQ−1

l=0 := 1. We have, for any n≥ 0,

Pr (Hn= 0) = Pr (∃k ≥ 0, s.t. Tk= n) = ∞

X

k=0

Pr (Tk= n) .

We write Tk = Pkl=0−1Il. As all the Il ≥ 1, we have Pr (Tk = n) = 0 for all k > n.

Therefore, for all K ∈ [1, n], Pr (Hn= 0) = n X k=0 Pr (Tk = n) = K X k=0 Pr (Tk = n) + n X k=K+1 Pr (Tk= n) . (31)

Fact B.1. Let K ∈ [1, n], we have Pr (∀l ∈ [1, K], Il ≤ n) = (1 − νn+1)K. In particular,

ifK ∈ [1, n], then Pr ∃j ∈ [K, n], s.t. j X l=0 Il = n ! ≤ Pr (∀l ∈ [1, K], Il≤ n) = (1− νn+1)K . In order to control PK

k=0Pr (Tk= n) = Pr (∃k = 0, . . . , K, Tk= n), we can simply

re-mark that, if there exists k ∈ 1, . . . K such that Pk

i=1Il = n, there exists necessarily

i∈ [1, K] and r ∈ [1, . . . , K] such that Ii= n/r. This implies that

Pr (∃k = 0, . . . , K, Tk= n)≤ Pr  ∃i ∈ [1, K], ∃r ∈ [1, . . . , K], s.t. Ii= n r  ≤ K X i=1 K X j=1 PrI1 = n r  ≤ K2tn/K . We obtained the following result.

Fact B.2. Let K∈ [1, n], we have

K

X

k=0

Pr (Tk= n) ≤ K2tn/K .

Restricting our attention to the summable case (that is, when P

k≥0(1− rk) < +∞),

the following fact is fundamental. Its proof is immediate. Fact B.3. If P

n≥0(1− rn) <∞, then t∞ := Pr (I1 =∞) = Q∞i=0ri > 0, in particular,

νn:= Pr (I1 ≥ n) ≥ t∞> 0. Moreover, for all n≥ 0, t∞(1− rn)≤ tn≤ (1 − rn)

Using Facts B.1, B.2 and B.3, we are ready to prove items (i) and (ii) of Proposition B.2.

(22)

Proof of Item (i) of Proposition B.2. As far as we know, all the results on the house of card process hold in the summable case. WhenP

k≥0(1− rk) =∞, it is only known that

P

n≥0Pr (Hn= 0) = ∞. It is interesting to notice that we can still obtain some rate of

convergence for Pr (Hn= 0) from our elementary facts, at least in the following example.

Let us assume that there exists r < 1 and a summable sequence snsuch that, for all n≥ 1,

1− rn= nr+ sn. In this case, we have Pn≥0(1− rn) =∞, therefore t∞= 0. Nevertheless, n Y i=0 ri ≤ n Y i=1 e−(1−ri)= e−r ln n+O(1)≤ Cn−r .

Therefore tn ≤ Cn−(1+r). Moreover, using the inequality (1− u) ≥ e−u−u

2

, valid for all u < 1/8, we see that tn≥ cn−(1+r). Therefore, νn=Pk≥ntk≥ cn−r. It follows from Fact

B.1 that, for large K and n,

n

X

k=K+1

Pr (Tk= n)≤ (1 − νn+1)K ≤ e−cKn−r .

Using Fact B.1, we also have

K

X

k=0

Pr (Tk= n)≤ CK2tn/K ≤ CK3+rn−(1+r) . (32)

We deduce then from (31) that, for all K∈ [0, n], Pr (Hn= 0)≤ C K 3+r n1+r + e−cKn −r . For K = 2nrln n, we obtain Pr (Hn= 0)≤ C (ln n)3+r n1−2r−r2 = C (ln n)3+r n2−(1+r)2 .

If 0 < r < 1, we have 2−(1+r)2> 0. This bound may not be optimal, but it is interesting

to see that we still can obtain rates of convergence from our basic remarks even in this

pathological example. 

Proof of Item (ii) of Proposition B.2. We deduce from Facts B.1 and B.3 that, in the sum-mable case

n

X

k=K+1

Pr (Tk= n)≤ (1 − t∞)K .

Therefore, from Facts B.2 and B.3, Pr (Hn= 0)≤ inf

K=1,...,n K

2(1− r

n/K) + (1− t∞)K . (33)

 Proof of Item (iii) of Proposition B.2. In this section, we assume that, for all k, 1− rk ≤

(23)

independence, Pr k X l=1 Il= n ! = X i1+...+ik=n Pr k \ l=1 Il= il ! = X i1+...+ik=n k Y l=1 Pr (Il= il) ≤ X i1+...+ik=n Crkri1+...+ik = Ck rrn X i1+...+ik=n 1 . Let us evaluate the numbers pk,n =Pi1+...+ik=n1. We have p1,n= 1 and

pk,n= n−k+1 X l=1 X ik=l X i1+...+ik−1=n−l 1 = n−k+1 X l=1 p1,lpk−1,n−l = n−k+1 X l=1 pk−1,n−l .

Let us then assume that, for some k, we have, for all n≥ k − 1, pk−1,n ≤ nk−2/(k− 2)!.

Notice that this is the case for k = 2, then, for all n≥ k, pk,n≤ n−k+1 X l=1 (n− l)k−2 (k− 2)! = n−1 X l=k−1 lk−2 (k− 2)! ≤ Z n k−1 x(k−2) (k− 2)! ≤ nk−1 (k− 1)! . We deduce that n X k=1 Crk X i1+...+ik=n 1≤ 1 Cr n X k=1 (Crn)k−1 (k− 1)! ≤ eCrn Cr . Therefore, Pr (Hn= 0) = n X k=1 Pr (Tk= n)≤ 1 Cr (eCrr)n .

Hence, when Cr < ln(1/r), eCrr < 1 and Pr (Hn= 0) decreases exponentially fast. 

Appendix C. Concentration of geometric random variables

Let ξ, ξ1:nbe i.i.d. geometric random variables with parameter α, i.e.,∀k ≥ 1, P (ξ = k ) =

(1− α)k−1α. We obtain in this section the following upper bounds.

Proposition C.1. letC1,α= 1−αα + 4 1−αα 2, C2,α= ln  2−α 2(1−α)∧ 2  . Then, ∀x > 0, P 1 n n X i=1 Xi− 1 α > x ! ≤ e−n  x2 2C1,α∧ C2,α 2 x  . (34) P 1 n n X i=1 Xi− 1 α <−x ! ≤ e−n  x2 2C1,α∧ x 2  .

As a corollary of this result, we obtain the following bound when n = bkα/2c and x = 1/α.

(24)

Corollary C.1. Let k ∈ N, α ∈ (0, 1), n = bkα/2c, x = k/(2n) ≥ 1/α, ξ1:n be i.i.d.

random variables with parametersα, and

uk:= nP   n X j=1 ξj− n α > nx   .

Then, we have, forC3,α:= 4(1−α)(4−3α)α ∧14ln



2−α 2(1−α)∧ 2



, for all > 0 and all k > k(), uk≤ αe−k(C3,α−) .

C.1. Chernov’s bound. Let Y, Y1:n be i.i.d. random variables such that ∀a < λ < b,

E eλY < ∞, then, ∀x > 0, P 1 n n X i=1 Yi > x ! ≤ inf na<λ<nbe −λx E  eλnY  n . (35)

Proof. We have, by independence of the Yi and Markov’s inequality, for all na < λ < nb,

P 1 n n X i=1 Yi > x ! = Peλn Pn i=1Yi > eλx≤ e−λx E  eλn Pn i=1Yi= e−λx E  eλnY  n .  C.2. Exponential moments of geometric random variables. Let ξ be a geometric random variable with parameter α, then

∀λ < − ln(1 − α), E  eλξ αe λ 1− (1 − α)eλ , (36) ∀λ > ln(1 − α), E  eλ(−ξ) αe−λ 1− (1 − α)e−λ .

Proof. By definition, we have,∀λ < − ln(1 − α), E  eλξ=X k≥1 eλk(1− α)k−1α = αeλX k≥0  (1− α)eλk= αe λ 1− (1 − α)eλ .

Moreover, for all λ > ln(1− p), E  e−λξ= αe−λX k≥0  (1− α)e−λk= αe−λ 1− (1 − α)e−λ .

C.3. Proof of the deviation bounds. Plugging (36) in (35), we obtain, for all λ < −n ln(1 − α), P 1 n n X i=1 ξi > 1 α + x ! ≤ e−λ(α1+x) αe λ/n 1− (1 − α)eλ/n !n = αne−λ(α1+x−1)e−n ln(1−(1−α)eλ/n)

(25)

Choosing λ = n for  ≤ ln2(12−α−α)∧ 2, using the inequalities e ≤ 1 +  + 2 for all

≤ ln 2 and − ln(1 − u) ≤ 1 + u + u2 when u≤ 1/2, this last bound is equal to  αe−(α1+x−1)e− ln( 1−(1−α)e) n ≤  αe−(α1+x−1)e− ln(α)−ln  1−(1−α)α (e−1) n ≤  e−(α1+x−1)e (1−α) α (e−1)+ (1 −α) α (e−1) 2n ≤ e−n  x−1−αα +4( 1−α α ) 2  .

Let Cα= 1−αα + 4 1−αα 2, choosing ≤ x/(2Cα), we have x− Cα≥ x/2, hence, choosing

 = 2Cxα ∧ ln2(1−α)2−α ∧ 2, we conclude the proof. Plugging (36) in (35), we obtain, for all λ > n ln(1− α), P 1 n n X i=1 ξi < 1 α − x ! = P 1 n n X i=1 (−ξi) >− 1 α + x ! ≤ e−λ(−α1+x) αe −λ/n 1− (1 − α)e−λ/n !n = αne−λ(−α1+x+1)e−n ln(1−(1−α)e−λ/n)

Choosing λ = n, with ≤ 1, this last bound is equal to  αe−(−α1+x+1)e− ln(1−(1−α)e−) n =αe−(−α1+x+1)e− ln( α )−ln(1−(e−−1)1−αα ) n ≤e−(−1α+x+1)e(e−−1)1−αα +((e−−1)1−αα ) 2n ≤e−(−α1+x+1)e(−+2) 1−α α +((−+2) 1−α α ) 2n ≤ e−n h x−21−α α +4( 1−α α ) 2 i .

Let Cα= 1−αα + 4 1−αα 2, choosing ≤ x/(2Cα), we have x− Cα≥ x/2, hence, choosing

 = 2Cx

α ∧ 1, we conclude the proof. 

Departamento de m´etodos estat´ısticos, Instituto de Mat´ematica, Universidade Federal do Rio de Janeiro, Caixa Postal 68530, 21945-970, Rio de Janeiro, Brasil

E-mail address: sandro@im.ufrj.br

Laboratoire J.A.Dieudonn´e UMR CNRS 6621, Universit´e de Nice Sophia-Antipolis, Parc Valrose, 06108 Nice Cedex 2, France

E-mail address: mlerasle@unice.fr

Institute of Neuroscience and Psychology Department, Princeton University, Green Hall, NJ, 08540

Figure

Figure 1. Illustration of a range partition related to some infinite past a.
Figure 2. We consider a realization of L 0 −∞ in the particular case k ? = 2, that is, the arrows, which represent the length function at each time index, have length larger or equal to 2.

Références

Documents relatifs

The mapping h with c = 1 can be based on a space-filling curve: Wächter and Keller [2008] use a Lebesgue Z-curve and mention others; Gerber and Chopin [2015] use a Hilbert curve

is a continuous linear functional on E^ (resp. We topologize this space with the topology of convergence of coefficients. Q,) be the space of formal power series whose

Tikhonov-type Cauchy problems are investigated for systems of ordinary differential equations of infinite order with a small parameter  and initial conditions.. The

In this context, we say that ρ is modular if it is isomorphic to the p-adic Galois representation ρΠ attached to a cuspidal automorphic Galois representation Π of GL2 AF which

for elliptic curves E defined over function fields provided the Tate-Shafarevich group for the function field is finite.. This provides an effective algorithm

tree onstraints, that is to say, the ability to onisely desribe trees whih the most powerful.. mahine will never have time

Expressiveness of Full First Order Constraints in the Algebra of Finite or Infinite Trees.. Alain Colmerauer,

Let K (1) be the identity S × S matrix (seen as a speed transfer kernel, it forces the chain to keep going always in the same direction) and consider again the interpolating