Exponential inequalities for VLMC empirical trees

(1)

January 2008, Vol. 12, p. 219–229 www.esaim-ps.org

DOI: 10.1051/ps:2007035

EXPONENTIAL INEQUALITIES FOR VLMC EMPIRICAL TREES

^∗

Antonio Galves

¹

, V´ eronique Maume-Deschamps

²

and Bernard Schmitt

²

Abstract. A seminal paper by Rissanen, published in 1983, introduced the class of Variable Length Markov Chains and the algorithm Context which estimates the probabilistic tree generating the chain.

Even if the subject was recently considered in several papers, the central question of the rate of convergence of the algorithm remained open. This is the question we address here. We provide an exponential upper bound for the probability of incorrect estimation of the probabilistic tree, as a function of the size of the sample. As a consequence we prove the almost sure consistency of the algorithm Context. We also derive exponential upper bounds for type I errors and for the probability of underestimation of the context tree. The constants appearing in the bounds are all explicit and obtained in a constructive way.

Mathematics Subject Classification. 62M05, 60G99.

Received October 22, 2006. Revised June 10, 2007.

1. Introduction

Variable Length Markov Chains were ﬁrst introduced in the information theory literature by Rissanen [11]

under the name of ﬁnite memory sourceor probabilistic tree. More recently this class of models became quite popular in the statistics literature under the name of Variable Length Markov Chains (VLMC)(B¨uhlman and Wyner [2]).

VLMC is a ﬂexible class of Markov chains in which the part of the past which is relevant to predict the next symbol has a variable length depending on the observed past values. These relevant parts of the past are called contexts. The set of contexts deﬁne a partition of the past and can be represented by a tree.

The notion of context makes VLMC models parsimonious, with less parameters to estimate than the tradi- tional approach to Markov chains in which the number of parameters grows exponentially with the length of the fixed past. Rissanen [6] introduced VLMC models as an universal tool to perform data compression. Recently, it has been used as a flexible class of processes to model scientific data, for instance in genomics to classify proteins [1, 9].

Keywords and phrases. Variable Length Markov Chain, context tree, algorithm context, weak dependance.

∗This work has been partially supported by the program ACI-NIMDynamicAland by CNPq grant 301301/79 (A.G.) ans is part of Pronex/FAPESP’s ProjectStochastic behavior, critical phenomena and rhythmic pattern identiﬁcation in natural languages (grant number 03/09930-9), FAPESP-CNRS ProjectProbabilistic phonlogy of rhythmand CNPq ProjectStochastic modeling of speech(grant 475177/2004-5).

1Instituto de Matemática e Estat´ıstica, Universidade de São Paulo, BP 66281, 05315-970 São Paulo, Brasil;[email protected]

2 Institut de Math´ematiques de Bourgogne, BP 47870, 21078 Dijon cedex France;{vmaume;schmittb}@u-bourgogne.fr c EDP Sciences, SMAI 2008

Article published by EDP Sciences and available at http://www.esaim-ps.org or http://dx.doi.org/10.1051/ps:2007035

(2)

Rissanen [11] not only introduced the notion of VLMC, but he also introduced the algorithm Context to estimate the tree of contexts of the VLMC as well as the family of probability transitions generating the VLMC.

The way the algorithm Context works can be summarized as follows. Given a sample produced by a VLMC, we start with a maximal tree of candidate contexts for the sample. The branches of this ﬁrst tree are then pruned until we obtain a minimal tree of contexts well adapted to the sample. Given a candidate tree of contexts we associate to each context a probability transition which is estimated as the proportion of time in the sample the context is followed by this symbol. Several variants of the algorithm Context have been presented in the literature. In all the variants the decision to prune a branch is taken by considering a gainfunction. A branch is pruned if the gain function assumes a value smaller than a given threshold. The estimated context tree is the smallest tree satisfying this condition. The estimated family of probability transitions is the one associated to the minimal tree of contexts.

Several recent papers addressed the question of the estimation of the tree of contexts of the VLMC as well as its associated set of probability transitions, either using variants of the algorithm Context (cf. [2, 11] for the bounded case, and Ferrari and Wyner [8] for inﬁnite trees), the BIC (Csiszar and Talata [4]), or other algorithms (cf. [12, 13]). However the central question of the rate of convergence of the estimation algorithm remained open. This is the question we address here.

The central result of this paper is an exponential upper bound for the probability of error estimation of the ﬁnite probabilistic tree deﬁning a Variable Length Markov Chain. As a consequence we prove the almost sure consistency of the algorithm Context. We also derive exponential upper bounds for type I errors and for the probability of underestimation of the context tree.

Our proofs are inspired by recent exponential upper bounds obtained by Dedecker and Doukhan [5], Dedecker and Prieur [6] and Maume-Deschamps [10]. In the first paper, an upper bound for the L^p norm of sums of weakly dependant variables is obtained. This upper bound is used in the second paper to obtain an exponential inequality for sums of weakly dependent random variables. Finally, this exponential inequality leads to an estimation of conditional probabilities in the third paper. VLMC satisfy theτ-dependence condition as defined in [6], then exponential inequalities follows. Nevertheless, the present paper takes advantage of the VLMC structure to obtain a more specific version of the estimation of conditional probabilities. This makes possible to control the probability of error of the version of the algorithm Context we consider here. Our proofs are constructive and the constants appearing in the bounds are all explicit and computable.

This paper is organized as follows. In Section 2, we state the deﬁnitions and our results. Section 3 is devoted to the proof of an exponential bound for conditional probabilities. In Section 4, we apply this exponential bound to estimate the rate of convergence of the algorithm Context and the corollaries.

2. Definitions and results

Let (Xn)_n∈Z be a stationary process taking values on a ﬁnite alphabetA. Given two time indicesm < nwe shall use the short hand notationxⁿ_mto denote the string (xm, . . . , xn). The process is called aVariable Length Markov Chain (VLMC)if for any pastx⁻¹_−∞ there exists an indexk=k(x⁻¹_−∞) such that for anya∈Awe have P(X₀=a|X_−∞⁻¹ =x⁻¹_−∞) =P(X₀=a| X_−k⁻¹=x⁻¹_−k). (2.1) Following Rissanen [11], we call any suﬃx stringx⁻¹_−k satisfying (2.1) acontext.

We will use the shorthand notation

p(a|x⁻¹_−k) =P(X0=a|X_−k⁻¹=x⁻¹_−k) and p(x⁻¹_−k) =P(X_−k⁻¹=x⁻¹_−k).

(3)

Let us callτ the set of all contexts associated to the VLMC (Xn)_n∈Z. We will assume thatτ is minimal. This means that for each contextx⁻¹_−k∈τ, there is no 1≤j < ksuch that

p(a|x⁻¹_−j) =p(a|x⁻¹_−k),

for anya∈A. This minimality condition is equivalent to thesuffix propertywhich means that no context is a suffix of any other context. The suffix property implies thatτ can be represented as a treein which contexts are identified by the leaves of the tree. The tree of contextsτ defines a partition of the pasts accepted by the process (Xn)_n∈Z.

The tree of contextsτ together with a familypof conditional probabilities onA, indexed by the leaves ofτ, will be called aprobabilistic tree.

In what follows we shall always assume that the treeτ isﬁnite. This means that for any pastx⁻¹_−∞the index k=k(x⁻¹_−∞) appearing in (2.1) is bounded.

From now on the length of a ﬁnite string wwill be denoted |w|. Given two ﬁnite stringsw andv, we will denote by vw the string with length|w|+|v|obtained by concatenating the two strings. We will also denote byh(τ) the height of the treeτ,i.e.

h(τ) = max{|w|;w∈τ}.

Given a finite sample (X1, . . . , Xn) and a finite stringw, with|w| ≤n, we define the counting random variables Nn(w) as the number of timesw appears in the sample

N_n(w) =

m=n−|w|+1

m=1

1{X_m^m+|w|−1=w}.

For any elementa∈A, the empirical probability transition ˆpn(a|w) is deﬁned by

pn(a|w) = Nn(wa) + 1

N_n(w) +|A|· (2.2)

This definition ofp_n(a|w) is convenient because it is asymptotically equivalent to ^N_Nⁿ_n^(wa)_(w) and it avoids an extra definition in the case Nn(w) = 0. To estimate the probabilistic tree producing a sampleX1, . . . , Xn we define thegainfunction ∆ as follows. For any pair of finite stringsuand w, define the random function

∆_n(u, w) = max

a∈A|pˆn(a|w)−pˆn(a|uw)|.

For any≥1 and anyδ >0, we consider the complete tree of heightand prune it in such way that all nodes wof the resulting tree satisfy:

maxa∈A∆_n(a, w)> δ.

Denoteτn(, δ) the resulting tree.

We associate to eachw∈τn(, δ) a probabilitypn(·|w) onAusing Deﬁnition (2.2).

If there is no danger of confusion we will omit to mentionandδin the notationτn(, δ). This construction of the empirical probabilistic tree (τn,pn) is a variant of Rissanen’s algorithm Context.

Let us introduce some quantities associated to the family of probability transitionspwhich will be useful in what follows. First of all, we extend the conditional probabilitiespto suﬃxes of the contexts in a natural way.

Namely, if w= (w₋₁, . . . , w_−k)∈τ, for alli= 1, . . . , k−1, let

p(a|w⁻¹_−i) =P(X0=a|X_−i⁻¹=w_−i⁻¹),

(4)

where the probabilities appearing in the formula are those of the stationary VLMC corresponding to (τ, p). Let us introduce some extra notations

D(τ) = min

w∈τ min

i=1,...,|w|−1max

a∈A|p(a|w)−p(a|w⁻¹_−i)|, ρ= min

w∈τa∈A

p(a|w), p_min= min

w∈τp(w),

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩

θ= min

(u,v)∈τ

a∈A

min(p(a|u), p(a|v))

γ=

1−θ^h(τ)_h(τ)¹ β= (1−γ).

(2.3)

The quantityθis called the Dobrushin’s coeﬃcient of (Xn)_n∈Z. Observe thatτ ﬁnite implies thatpmin>0.

We will assume the following assumption.

Positivity Assumptionρ >0.

This implies that the process is aperiodic and irreducible and that 0 < θ. Obviouslyθ is always bounded above by 1. Moreover it only takes the value 1 in the trivial case in which the process is a sequence of independent random variables which will not be considered here. Our main result is the following theorem.

Theorem 2.1. For all ≥h(τ), there existsn¯(), such that for all n≥n¯ the following inequality holds

P(τn() =τ)≥1−4e¹^e

|A|exp

−nβp²_minρ^−h(τ) 4e

δ

2 −|A|+ 1 np_min

₂

+|A|^h(τ)exp

−nβp²_min 4e

D(τ)−δ

2 −|A|+ 1 npmin ρ^−h(τ)

₂ .

A ﬁrst consequence of the theorem is the following almost sure consistency result for the empirical probability trees.

Corollary 2.2. Let us assume that the sample(X1, . . . , Xn)has been produced by a VLMC whose probabilistic tree is (τ, pτ). Letκ= (κn)_n≥1 be any sequence such that

n≥1

exp[−κ²_nnβp²_min 4e ]<∞.

Then for almost all inﬁnite samples X₁, X₂, . . .there existsn¯= ¯n(X₁, X₂, . . .), such that for all n≥¯nwe have

• τn=τ and

• sup_w∈τmax_a∈A|pn(a|w)−p(a|w)|< κn.

Where(ˆτn,pn)is the probabilistic tree constructed with=h(τ).

This shows that our variant of the context algorithm is consistent almost surely.

A convenient choice for κn could be _n¹_α,0< α < ¹₂.

Another application of the theorem are the following exponential upper bounds for type I errors and for the probability of underestimation of the context tree.

(5)

In the ﬁrst case, we want to test thenull assumptionthat the sample was generated by a VLMC corresponding to the probabilistic tree (τ, p). To estimate the probability of type I error, we take the control parameter=h(τ).

The result is the following

Corollary 2.3. Let us assume that the sample (X1, . . . , Xn, . . .) has been produced by a VLMC whose proba- bilistic tree is (τ, p_τ). Then the following inequality holds

P

{τ_n=τ } ∪ {τ_n =τ , sup

w∈τmax

a∈A|p_n(a|w)−p_τ(a|w)|> t}

≤

e¹^e

8|A|^h(τ)exp[−nk1] + 2|A|exp

−

t−|A|+ 1 npmin

₂ nk2

where

k1= βp²_min 8e max

[δ/2− |A|/(npmin)]²ρ^h(τ)−1,

D(τ)−δ

2 −|A|+ 1 npmin

₂

and

k2=βp²_min 8e ·

Corollary 2.3 provides an exponential upper bound for the error of type I,i.e. the probability of erroneously rejecting the null hypothesis that the sample was generated by the probabilistic tree (τ, p).

When testing in the class of probabilistic trees, the most serious mistake consists in erroneously accepting that the contexts are smaller than they really are. The following lemma provides an exponential upper bound for this type of error that we callunderestimate error.

Given two trees of contextτ andτ, we shall writeτ≤τ if allw∈τ is a suffix of some w=uw∈τ. This defines an order on trees, which is not complete. Obviously, if τ < τ, there is at least one w∈τ which is a strict suffixw=uw∈τ, |u| ≥1.

For 0< θ0<1, 0< β0<1, 0< p0<1 and∈Ndeﬁne

T(D₀, β₀, p₀, ) ={(τ, p), / h(τ)≤, D(τ)≥D₀, β(τ)≥β₀, p_min(τ)≥p₀}.

Corollary 2.4. Let us suppose that the sample X1, . . . , Xn has been generated by a probabilistic tree belonging to the class T(D₀, β₀, p₀, ). The following holds

sup

(τ,p)∈T(D0,β0,p0,)P(ˆτn< τ)≤4e¹^e|A|exp

−β₀p²₀n 8e

D₀−δ

2 −|A|+ 1 np₀

₂ .

The main ingredient of the proof are the following exponential inequalities.

Theorem 2.5. For any ﬁnite sequencea^j₀ and any t >0 the following inequality holds P(|N_n(a^j₀)−np(a^j₀)|> t)≤e¹^eexp

−β t²p_min 2enp(a^j₀) . Moreover, for any n >(|A|+ 1)/tp(a^j₀) we have

P

|ˆpn(aj|a^j−1₀ )−p(aj|a^j−1₀ )|> t

≤2·e¹^eexp

⎛

⎜⎝−

t−_np(a^|A|+1j−1 0

₂

npminp(a^j−1₀ )β 8e

⎞

⎟⎠. (2.4)

(6)

3. Proof of Theorem 2.5

The proof of Theorem 2.5 follows the strategy developed in Maume-Deschamps [10] together with the following classical mixing property of Markov chains.

Lemma 3.1. Let (Xi)_i∈Z be a VLMC satisfying the Positivity Assumption ρ > 0. Let τ be the related tree whose height equalsh <∞. For anyn≥0, for any m≥0, any word u⁻¹_−k inτ and any b^m₀ ﬁnite sequence, the following inequality holds

|P(X_n^n+m=b^m₀|X_−k⁻¹=u⁻¹_−k)−p(b^m₀)| ≤ p(b^m₀)

p_min γⁿ. (3.1)

Recall that γandpmin have been deﬁned by (2.3).

Proof. We will use the method of maximal coupling as presented in [7]. The VLMC process (Xi)_i∈Z being a Markov chain of order h, we consider the Markov process of order 1 taking values in A^h: (Yn)_n∈Z, Yn = (Xnh, . . . , X(n+1)h−1). For anyn∈Z, (a^2h−1_h )∈A^h, (b^h−1₀ )∈A^h,

P(Y1=a^2h−1_h |Y0=b^h−1₀ )

=

h−1

j=0

P(Xj=aj|X_j+1^2h−1=a^2h−1_j+1 ).

If we denoteθh the Dobrushin’s coeﬃcient of (Yn)_n∈Z, a simple computation givesθh≥θ^h.

The irreducibility of (Xn)_n∈Zinduces irreducibility of (Yn)_n∈Z. The conditionρ >0 implies that (Xn)_n∈Zis aperiodic of degree 1 and (Yn)_n∈Z is also aperiodic of degree 1. It remains to apply Theorem 3.2.3 in Ferrari and Galves [7] for the process (Yn)_n∈Z. We obtain : ∀(u^(n+1)h−1_nh )∈A^h, ∀(v^h−1₀ )∈A^h, ∀n≥0,

|P(Yn =u^(n+1)h−1_nh |Y0=v₀^h−1)−µ(u^(n+1)h−1_nh )| ≤(1−θh)ⁿ (3.2) whereµis the invariant probability measure of (Yn)_n∈Z onA^h. Of course,µcoincides with the probability that we have denotedp. We can reformulate (3.2) in the following way: ∀(u^(n+1)h−1_nh )∈A^h, ∀(v^h−1₀ )∈τ, ∀n≥0,

|P(X_nh^(n+1)h−1=u^(n+1)h−1_nh |X₀^h−1=v₀^h−1)−p(u^(n+1)h−1_nh )| ≤(1−θ^h)ⁿ. (3.3) We are now able to prove the announced mixing property. Takea^h−1₀ ∈A^h and any ﬁnite sequenceb^m₀ and any n≥0, denoten₀= [ⁿ_k]. Using the fact that (X_n)_n∈Zis a Markov chain of orderh:

P(X_−h⁻¹=a⁻¹_−h∩X_n^n+m=b^m₀)

= P(X_−h⁻¹=a⁻¹_−h)

z⁻¹_−h∈A^h

P(X_n^n+m=b^m₀|Y_n₀ =z_−h⁻¹)·P(Y_n₀=z_−h⁻¹|Y₀=a⁻¹_−h)

= P(Y₀=a⁻¹_−h)

P(X_n^n+m=b^m₀) +

z⁻¹_−h∈A^h

P(X_n^n+m=b^m₀ |Y_n₀ =z_−h⁻¹)· P(Y_n₀ =z_−h⁻¹|Y₀=a⁻¹_−h)−p(z_−h⁻¹)

.

Using (3.2), we get :

|P(Y0=a⁻¹_−h∩X_n^n+m=b^m₀)−P(Y0=a⁻¹_−h)P(X_n^n+m=b^m₀)|

≤ P(Y0=a⁻¹_−h)

pmin ·P(X_n^n+m=b^m₀)(1−θ^h)^[ⁿ^h^]. (3.4)

(7)

Then, (3.1) follows from (3.4) by dividing by P(Y0 =a⁰_−h) and observing that the conditional probabilities of the VLMC (Xn)_n∈Z by the event{Y0 =a⁰_−h} are the same as those conditioned by the event {X_−k⁻¹ =a⁻¹_−k}

witha⁻¹_−k ∈τ (see formulae (2.1)).

Corollary 3.2. Let (X_i)_i∈Z be a VLMC satisfying the Positivity Assumption ρ >0. Let τ be the related tree whose height equals h <∞. For anyn≥0, for anym≥0, any ﬁnite wordsu⁻¹_−j,b^m₀, inequality (3.1) holds.

Proof. We only have to prove (3.1) for a wordu⁻¹_−k which is a strict suﬃx of a branch ofτ. We denote by τ_u⁻¹

−j

the set of branches ofτ admittingu⁻¹_−k as a strict suﬃx. These are wordswsuch that : wu⁻¹_−k∈τ. Then :

|P(X_n^n+m=b^m₀ ∩X_−j⁻¹=u⁻¹_−j)−P(X_n^n+m=b^m₀)P(X_−j⁻¹=u⁻¹_−j)|

= |

w∈τ_u−1

−j

P(X_n^n+m=b^m₀ ∩X_−j−1^−j−|w|=w∩X_−j⁻¹=u⁻¹_−j)

−P(X_n^n+m=b^m₀)P(X_−j−1^−j−|w|=w∩X_−j⁻¹=u⁻¹_−j)|

≤ 1

pmin|

w∈τ_u−1

−j

P(X_n^n+m=b^m₀)·P(X_−j−|w|^−j−1 =w∩X_−j⁻¹=u⁻¹_−j)(1−θ^h)^[ⁿ^h^]|

= 1

pmin(1−θ^h)^[ⁿ^h^]P(X_n^n+m=b^m₀ )·P(X_−j⁻¹=u⁻¹_−j).

We conclude the proof by dividing byP(X_−j⁻¹=u⁻¹_−j).

Proof of Theorem 2.5. It is similar to Corollary 1.3 and Theorem 1.5 in Maume-Deschamps [10]. It is based on an inequality by Dedecker-Doukhan (2003)[5], Proposition 4.

In our setting, let a^j₀ = w·u with w ∈ τ and u a ﬁnite sequence. Let Yi = 1_{X^i+j−1

i =a^j₀}−p(a^j₀) and Mi be the σ-algebra generated by X0, . . . , Xi−1. From the deﬁnition of the counting function Nn, following Dedecker-Doukhan [5], Prop. 4), we have for anyq≥2,

Nn(a^j₀)−np(a^j₀) q ≤

2q

n−j−1

i=1

i≤≤nmax Yi

k=i

E(Yk|Mi) ^q₂ ¹₂

≤

2q

n−j−1

i=1

i≤≤nmax Y_i ^q₂

n−j−1

k=i

E(Y_k|M_i) _∞ ¹₂

≤

2q

n−j−1

i=1 n−j−1

k=i

sup

xⁱ⁻¹₀ ∈Aⁱ|P(X_k^k+j=a^j₀|X₀ⁱ =xⁱ₀)−p(a^j₀)|

¹₂

≤

n2qp(a^j₀) p_minβ

¹₂ .

The last inequality follows from (3.4), where the parameterβ was deﬁned in (2.3).

Then, we follow Dedecker-Prieur (2005)[6] and we get that for allt >0, P(|N_n(a^j₀)−np(a^j₀)|> t)≤e¹^eexp

−t²βpmin

2enp(a^j₀)

. (3.5)

A similar estimation may be found in I. Csisz´ar [3], but our parameters and our proof are diﬀerent.

(8)

Inequality (2.4) follows from (3.5) with an improvement of the proof of Theorem 1.5 in Maume-Deschamps [10].

Let us sketch it.

First of all, remark that

Nn(a^j₀) + 1 N_n(a^j−1₀ ) +|A| ≤1.

Now,

P

|pˆn(aj|a^j−1₀ )− np(a^j₀) + 1 np(a^j−1₀ ) +|A||> t

≤P

|N_n(a^j₀)−np(a^j₀)|> t

2(np(a^j−1₀ ) +|A|)

+P

|N_n(a^j−1₀ )−np(a^j−1₀ )|> t

2(np(a^j−1₀ ) +|A|)

≤2e¹^eexp

− βp_min

8enp(a^j−1₀ )t²(np(a^j−1₀ ) +|A|)² using (3.5). To conclude the proof, it is enough to observe that

p(aj|a^j−1₀ )− np(a^j₀) + 1 np(a^j−1₀ ) +|A|

≤ p(a^j−1₀ )(|A|+ 1) p(a^j−1₀ )(np(a^j−1₀ ) +|A|)

≤ |A|+ 1 n p(a^j−1₀ .

We get (2.4).

4. Proof of the main theorem

Proof of the main theorem. First of all we need to bound above the probability of selecting a context which is too long. Given two stringsuandwdeﬁne On(u, w) as the event in which ∆n(u, w)> δ. The event in which the algorithm selects a context longer than the real one is

O_n=

uw∈ˆw∈ττn

O_n(u, w).

We also need to estimate the probability of selecting a context shorter than the real one. This event is represented by

Un(u, w) ={∆_n(u, w)≤δ} , Un =

w∈ˆuw∈ττn

Un(u, w).

Ifn > then

{τ_n =τ}=O_n∪U_n. The proof of the theorem follows from a succession of lemmas.

Lemma 4.1. For any w∈τ,uw∈τˆ_n and≥h(τ)and for any n≥ 2|A|

δpminρ^−h(τ),

(9)

we have

P(O_n(u, w))≤4e¹^e|A|exp

−βnp²_minρ⁻¹ 8e

δ 2 − |A|

npmin

₂ . Proof. Recall that

∆_n(u, w) = max

a∈A|pˆn(a|w)−pˆn(a|uw)|.

Remark thatw∈τ then for all ﬁnite sequenceuanda∈Awe havep(a|w) =p(a|uw). Therefore, P(∆_n(u, w) > δ)≤

a∈A

P

|ˆp_n(a|w)−p(a|w)|> δ 2

+P

|ˆpn(a|uw)−p(a|uw)|> δ 2

.

Using Theorem 2.5 we bound above the right hand side of this inequality by 4e^e¹|A|exp

−βnp(uw)p_min 8e

δ 2− |A|

npminρ^−h(τ) ₂

.

Recall that, by deﬁnition, a complete branch of ˆτnhas its length bounded above by. Thusp(uw)≥pminρ^−h(τ).

This concludes the proof.

Lemma 4.2. For any uw∈τ, for any

n≥ 2|A|

p_min(D(τ)−δ), andw∈τˆ_n we have

P(U_n(u, w))≤4e¹^eexp

−βnp²_min 8e

D(τ)−δ

2 − |A|

npmin

₂ . Proof. We start by observing that for anya∈A,

|pˆ_n(a|w)−pˆ_n(a|uw)| ≥ |p(a|w)−p(a|uw)| − |ˆp_n(a|w)−p(a|w)|

+|pˆn(a|uw)−p(a|uw)|].

I follows that for anya∈Awe have

∆_n(u, w) ≥ D(τ)− |pˆn(a|w)−p(a|w)| − |pˆn(a|uw)−p(a|uw)|.

Therefore,

P(∆_n(u, w)≤δ) ≤ P

∀a∈A, |ˆp_n(a|w)−p(a|w)| ≥ D(τ)−δ 2

+P

∀a∈A, |pˆ_n(a|uw)−p(a|uw)| ≥ D(τ)−δ 2

·

Observing thatp(uw)≥p_min, the result now follows from Theorem 2.5.

We can now conclude the proof of the main theorem. We have P(τn =τ) =P(On) +P(Un).

(10)

It follows from the deﬁnition that

P(On)≤

uw∈ˆw∈ττn

P(On(u, w)).

Using Lemma 4.1 we obtain the inequality P(O_n)≤4e¹^e|A|exp

−βnp²_minρ^−h(τ) 8e

δ

2 − |A|

npminρ^−h(τ) ₂

.

Using Lemma 4.2 we obtain the bounds

P(Un)≤4e¹^e|A|^h(τ)exp

−βnp²_min 8e

D(τ)−δ

2 − |A|

np_min ₂

· (4.1)

To conclude the proof, it suﬃces to sum these two terms.

Proof of Corollary 2.2. Using Theorem 2.5, we have that P(sup

w∈τmax

a∈A|p_n(a|w)−p(a|w)|> κ_n)

≤2e¹^e|A|exp

−

κ_n− |A|

npmin

₂

nβp²_min 8e .

Now, we use Theorem 2.1 with=h(τ) andδ= (1−ε)D(τ). The Borel-Cantelli lemma implies the following almost sure result

P

τn =τ, sup

w∈τmax

a∈A|pn(a|w)−p(a|w)|< κn, ∀n≥n

≥ 1−Ce^−L˜ⁿ+Ke¹^e|A|

n≥˜n

exp

−κ²_nnβp²_min 8e

,

whereK andC are suitable positive constants.

Proof of Corollary 2.3. We take=h(τ) and apply Theorem 2.1 together with equation (2.4).

Proof of Corollary 2.4. This is an immediate application of equation (4.1).

As a ﬁnal remark, we observe that Theorem 2.1 could be extended by takingincreasing withnin a suitable way. An example of this appears in Corollary 2.4 where we could take =O(lnn). In the same way, δ may be decreasing with n in a suitable way. This appears in Corollary 2.3 where we could takeδ=O(n^−α) with 0< α <1/2.

Acknowledgements. Many thanks to Pierre Collet whose careful reading of a preliminary version of this paper helped us making it hopefully more clear. We are also grateful to the referee for many interesting questions that remain open and are a source for a further work.

(11)

References

[1] G. Bejerano and G. Yona, Variations on probabilistic suﬃx trees: statistical modeling and prediction of protein families.

Bioinformatics17(2001) 23–43.

[2] P. B¨uhlmann and A. Wyner, Variable length Markov chains.Ann. Statist.27(1999) 480–513.

[3] I. Csisz´ar, Large-scale typicality of Markov sample paths and consistency of MDL order estimators. Special issue on Shannon theory: perspective, trends, and applications.IEEE Trans. Inform. Theory48(2002) 1616–1628.

[4] I. Csisz´ar and Z. Talata,Context tree estimation for not necessarily finite memory processes via BIC and MDL, manuscript (2005).

[5] J. Dedecker and P. Doukhan, A new covariance inequality and applications.Stochastic Process. Appl.106(2003) 63–80.

[6] J. Dedecker and C. Prieur, New dependence coeﬃcients. Examples and applications to statistics.Prob. Theory Relat. Fields 132(2005) 203–236.

[7] P. Ferrari and A. Galves,Coupling and regeneration for stochastic processes. Notes for a minicourse presented in XIII Escuela Venezolana de Matematicas. Can be downloaded fromwww.ime.usp.br/˜pablo/book/abstract.html (2000).

[8] F. Ferrari and A. Wyner, Estimation of general stationary processes by variable length Markov chains.Scand. J. Statist.30 (2003) 459–480.

[9] F. Leonardi and A. Galves, Sequence Motif identiﬁcation and protein classiﬁcation using probabilistic trees. Lect. Notes Comput. Sci.3594(2005) 190–193.

[10] V. Maume-Deschamps, Exponential inequalities and estimation of conditional probabilities in Dependence in probability and statistics,Lect. Notes in Stat., Vol.187, P. Bertail, P. Doukhan and P. Soulier Eds. Springer (2006).

[11] J. Rissanen, A universal data compression system.IEEE Trans. Inform. Theory29(1983) 656–664.

[12] T.J. Tjalkens and F.M.J.F. Willems, Implementing the context-tree weighting method: arithmetic coding. Recent advances in interdisciplinary mathematics (Portland, ME, 1997).J. Combin. Inform. System Sci.25(2000) 49-58.

[13] F.M. Willems, Y.M. Shtarkov and T.J Tjalkens, The context-tree weighting method: basic properties.IEEE Trans. Inform.

Theory41(1995) 653–664.