COMPOUND POISSON APPROXIMATION OF WORD
COUNTS IN DNA SEQUENCES
SOPHIE SCHBATH
Abstract. Identifying words with unexpected frequencies is an impor- tant problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number of occurrences N(W) of a wordW. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approxima- tions for two dierent counts. We approximate the \declumped" count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions.
1. Introduction
Because of many important sequencing projects, biologists now have large sets of DNA sequences from many dierent organisms. They need quanti- tative tools and automatic methods to help them in analyzing sequences.
Statistics, computer science and graphical representations have already pro- vided a lot of useful ways to analyze sequences.
A simple representation of a sequence is a nite series
X
1X
2X
n of letters taken from the alphabet A=fACGTg, with the four letters corre- sponding to the four bases adenine, cytosine, guanine and thymine. Quite a number of sub-sequences, or words, have a known biological function. Each single occurrence may interact with proteins during biological processes as replication, translation, repairing. Moreover, the statistical repetition of a given word or of a group of words, may be related to a biological code (Trifonov, 1989).Therefore, the question of identifying words
W
with an unexpected fre- quency with respect to a given model is of interest. Here we study the asymptotic distribution ofN
(W
), the number of occurrences of a given wordW
in a sequence. These occurrences can overlap ifW
has a periodic composition. In this paper, we model the sequenceX
1X
2X
n with an homogeneous Markov chain of orderm
on the state space A. This sim- ple model is useful to identify exceptional long words, given (m
+ 1)-words frequencies.Prum et al. (1995) and Schbath et al. (1995) study the normal approx- imation of
N
(W
) corresponding to the asymptotic frame where the expec- tation ofN
(W
) converges to innity withn
. If the expectation ofN
(W
) is bounded whenn
increases, we then say thatW
is rare, and Poisson approxi- mations are more reasonable. We show here that the number of overlappingReceived by the journal February 1, 1995. Revised September 4, 1995. Accepted for publication October 26, 1995.
occurrences of a rare word can be approximated by a compound Poisson variable, which reduces to a Poisson variable if the word
W
cannot overlap itself. As we do not want the model to depend onn
, we consider a series of wordsW
n, with lengthh
n converging slowly to innity at a rate greater or equal to logn
, so that EN
(W
n) is bounded.In the literature, the case of i.i.d. variables
X
i has been widely consid- ered. The convergence ofN
(W
) for rare words to a Poisson variable is then proved either with generating functions or by using the Chen-Stein method (Chryssaphinou and Papastavridis (1988a, 1988b), Godbole (1991), Hirano and Aki (1993), Godbole and Schaner (1993), Fu (1993)). When the se- quence (X
i)i=1 n is a rst order Markov chain onf01gandW
is a run of ones, some of these authors show the convergence ofN
(W
) to a Poisson or compound Poisson variable whenh
n!+1. Others show the convergence when the transition probabilities (11) and(01) converge to zero. Geske et al. (1995) considered the case of a rst order Markov chain with states in a general alphabet. They proved the compound Poisson convergence for rare words with a single principal period. More renements are required in the general case when a word overlaps in more ways than those associated with one principal period.In this paper, we consider this general case, using combinatory results to get the whole set of overlapped compound words based on
W
. We suppose that the Markov chain is of order one and stationary. We do not loose generality, because we may write anm
-order chain on A as a rst-order chain on Am. IfW
can appear in clumps, the probability of a second occurrence after a rst occurrence is dierent from that of an isolated one, so that the number of occurrences is well approximated by a compound Poisson variable, while the number of clumps is approximated by a Poisson variable. This result is easily shown using the Chen-Stein method (Chen (1975), Arratia et al. (1989), Barbour et al. (1992b)). The method used here is very similar to the method in Karlin and Ost (1987), Arratia et al.(1990), Godbole and Schaner (1993).
The Chen-Stein method gives a bound of the total variation distance between the distribution of a sum of non i.i.d. Bernoulli variables
Y
i,i
2I
, and the distribution of a Poisson variable with parameter=Pi2IEY
i. In our case,Y
i is one ifW
=w
1w
2w
h occurs at positioni
, zero otherwise:Y
i = 1IfX
i=w
1X
i+1 =w
2::: X
i+h;1=w
hg andN
(W
) =Pni=1;h+1Y
i.Chen-Stein theorem says that
d
TV(L(N
(W
))Po())2(b
1+b
2+b
3) whereb
1 = Xi2I
X
j2Bi
E
Y
iEY
jb
2 = Xi2I
X
j2Binfig
E(
Y
iY
j)b
3 = Xi2I
EjE(
Y
i ;EY
ij(Y
jj
2B
ci))jESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
I
=f1::: n
;h
+ 1gandB
iI
is a neighborhood ofi
.We choose
B
i as a set of indices`
so that, forj
not inB
i, there are no commonX
k toY
i andY
j. Moreover, we want theX
k deningY
i and those deningY
j to be separated by at leastr
positions, with somer >
0. Takingr=
logn
greater than a certain constant is sucient to haveb
3 converging to zero, as shown in the Appendix. Therefore, we chooseB
i as the indices`
such thatB
i=fi
;r
;h
+ 2::: i
+r
+h
;2g\I:
Therefore,
b
1(n
;h
+ 1)(2r
+ 2h
;3)2(W
)where
(W
) = EY
i. Ifr
andh
are both o(n
) andh=
logn
is greater than some xed numberC
, thenn
(W
) =O(1) andb
1 is o(1). Let us consider this asymptotic framework.The second term
b
2 is also writtenb
22Xi2I
X
`2f1:::r+h;2g
E(
Y
iY
i+`)using the symmetry of
B
i. Considering separately the indices`
where` < h
corresponds to overlapping occurrences ofW
, and` h
corresponds to non overlapping occurrences, we get two termsb
02 = 2Xi2I
X
1`<h
E(
Y
iY
i+`)b
002 = 2Xi2I
X
h`r+h;2
E(
Y
iY
i+`):
For` h
,E(
Y
iY
i+`) = ((w
1));1`;h+1(w
1w
h)2(W
)where is the transition matrix of the model and
is the invariant proba- bility. Thereforeb
002 is of order O(nr
2(W
)) and converges to zero.For
` < h
, there are non-zero terms if the wordW
overlaps, that is when there exists an integerp
, 1p
h
;1, such thatw
i =w
i+p fori
= 1::: h
;p
. Such an integer is called a period ofW
. We denote P(W
) the set of periods ofW
. If`
is not in P(W
), E(Y
iY
i+`) = 0. If`
is inP(
W
), E(Y
iY
i+`) = (W
(p)W
) whereW
(p)W
is the concatenated wordw
1w
2w
pw
1w
h. In general,b
02 is not O(nh
2(W
)). For instance, ifp
is bounded, it is O(nh
(W
)), which does not converge to zero.Therefore, it is necessary to declump the count of occurrences and consider
N
e(W
) =Pi2IY
ei, whereY
ei only counts occurrences which do not overlap a preceding occurrence :Y
ei=Y
i(1;Y
i;1) (1;Y
i;h+1):
In the terme
b
2, associated withN
e(W
), we have nowE(Y
eiY
ei+`) = 0 for` < h
, solving the previous problem. For other cases, we useY
eiY
i. Therefore, if we take the neighborhoodB
ei =fi
;r
;2h
+ 3::: i
+r
+ 2h
;3g\I
, the second termeb
2is of order O(n
(r
+h
)2(W
)), whileeb
1 is O(n
(r
+2h
)e2(W
)).Moreover, we prove in the Appendix that the third terme
b
3is of order O(n
r) for some 0< <
1, where only depends on the Markov chain. SinceESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
r >
logn=
log , eb
3 converges to zero. We therefore have the following theorem.Theorem 1.
d
TVL(N
e(W
))Po((n
;h
+ 1)e(W
))(
n
;h
+ 1);A
1(2h
+r
)e2(W
) +A
2(h
+r
)2(W
)+A
3n
r whereA
1A
2A
3 are of order O(1).Remark 2. The denition of
N
e(W
) = Pi2IY
eisupposes thatX
0X
;1:::
X
;h+2 are known. In practical situation, we use the number of clumpsN
e(W
) = Pi2IY
ei, whereY
e1 =Y
1,Y
ei =Y
iQi;1j=1(1;
Y
j) if 1< i < h
, andY
ei =Y
ei otherwise. As the total variation betweenN
e(W
) andN
e(W
) is bounded by 2PfN
e(W
)6=N
e(W
)g, andPf
N
e(W
)6=N
e(W
)gh
(W
) both counts have the same asymptotic distribution.The next point is to calculate
e(W
) = EY
ei. This is easy using the setP(
W
) of the periods ofW
, as shown in the next section. Finally, an esti- mation of e(W
) is necessary for a statistical use of the theorem. We use the fact that the total variation distance between two Poisson variables of parameters and 0 is less than j;0j. In Markov chain model, using the LIL, the estimator of (W
n) dened with the MLE of the parameters (plug-in estimator) is such that (n
;h
+1)(b(W
n);(W
n)) =o(1) a.e. This solves the problem of the approximation of the declumped count.To approximate the count
N
(W
), we writeN
(W
) =Xk1
k N
e(k)(W
)where
N
e(k)(W
) is the number ofk
-clumps. We say that ak
-clump oc- curs at positioni
if there exists a concatenated wordC
composed of ex- actlyk
overlapping occurrences ofW
, and if there is an occurrence ofC
at positioni
which does not overlap any other occurrence ofW
in the se- quence. We denote byY
ei(k) the variable that is one if ak
-clump occurs ini
, and is zero otherwise. For example, for the wordW
=AACAA, the sequence TGAACAAACAACAATAGAACAAAA has a 3-clump ati
= 3 and an iso- lated occurrence, or 1-clump, ati
= 18. We use the process version of the Chen-Stein theorem as stated by Arratia et al. (1990) for the process Y= (Y
ei(k))ik. The process (N
e(k)(W
))k is therefore approximated by a Poisson process (Z
(k))k with parameter =(n
;h
+ 1)ek = (n
;h
+ 1)EY
ei(k)k, andZ
= Pk1k Z
(k) is a compound Poisson variable with parameter . Again, we need combinatory results to calculate ek (Sections 2 and 3). The use of the process version of Chen-Stein theorem needs a denition of the neighborhoodB
ik of the double index (ik
), and its application to our prob- lem requires careful calculation of the bounds in order to deal with the case when we have words with more than one principal period. This is done in Section 4.ESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
2. Combinatory results
In this section, we give a simple expression of the two variables,
Y
ei =Y
i(1;Y
i;1) (1;Y
i;h+1) andY
ei(k), dened in the previous section, in order to calculate their expectations, denoted e and ek. If we develop the expressionY
i(1;Y
i;1) (1;Y
i;h+1), we obtain 2h;1 terms that are products likeY
iY
i;j1Y
i;j` a lot of them are equal to zero because they do not correspond to possible overlaps ofW
. If we know the periods of the word, we simply get the terms of this sum, which have a positive expectation.We show that they are necessarily written as
Y
iY
i;p wherep
is a period ofW
. We then use the same ideas to describe the set of concatenated wordsC
composed of exactlyk
overlapping occurrences ofW
. Conversely, every possible overlap corresponds to a period of the word.We study now the set of periods of a given word.
Definition 3. Let
W
=w
1w
h be a given word of lengthh
andp
2f1
::: h
;1g.p
is a period ofW
ifw
i =w
i+pi
2f1::: h
;p
g.For each period
p
, the wordw
1w
p, denotedW
(p), is called a root ofW
. Example 4. The periods of the wordAAACAAAACAAAare 5, 9, 10 and 11.Remark 5. The set P(
W
) of the periods ofW
is empty if and only ifW
cannot overlap itself.If
p
is a period ofW
, we decompose the word asW
= (yz
)ry
whereyz
is the rootW
(p),z
is a non-empty word andr
1 (Figure 1). WeW y z
| {z }
p
y z y z y
6 6
Q
Q s
Q
Q
Q s
Q
Q
Q
y z y z y z y
sW
Figure 1. Periodic decomposition of a word
W
with periodp
. also consider the canonical decomposition ofW
which is associated with the minimal periodp
0. The rootW
(p0) is then called the minimal root.The question of nding the whole set of periods of a word
W
is usually solved in a recursive way (Guibas and Odlyzko, 1981). Suppose the minimal periodp
0 is known andyz
=W
(p0) then, the following theorem species how the other possible periods are obtained from the periods of the shorter wordyzy
.Theorem 6. If
p
0 is the minimal period of a wordW
, and(yz
)ry
its canon- ical decomposition, the set P(W
) of the periods ofW
isP(
W
) =fjp
0 1j
r
gf(r
;1)p
0+q q
2P(yzy
)g:
The only dicult point is to prove (Section 3) that any period is either a multiple of the minimal period or corresponds to a second word
W
starting only in the last partyzy
of the word.Remark 7. Moreover, it will be shown in Section 3 that, if
yz
is the minimal root ofW
, then the periods ofyzy
are greater than the length ofy
, denotedj
y
j.ESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
Example(cont.)
p
0= 5,r
= 2, periods 9 and 11 correspond to the periods 4 and 6 of AAACAAA=yzy
.When two occurrences of
W
overlap in the sequence, Figure 1 shows that the second occurrence is preceded by a root ofW
. Thus, an occurrence ofW
at positioni
does not overlap a preceding one if and only if there is no root ofW
just before this occurrence. Now, if an occurrence is not preceded by the minimal rootW
(p0), it cannot be preceded by one of the rootsW
(jp0). Therefore, we dene the set P0(W
) of the principal periods ofW
as the periods ofW
that are not a multiple ofp
0. Consequently, an occurrence ofW
at positioni
does not overlap a preceding one if and only if there is no occurrence of any principal rootW
(p) at positioni
;p
, forp
in P0(W
). Moreover, we show in Section 3 that it is not possible to observe two dierent principal roots,W
(p) at positioni
;p
andW
(q) at positioni
;q
. Using these two results, we get an expression ofY
ei asY
ei =Y
i(W
); Xp2P0(W)
Y
i;p(W
(p)W
) (1) whereY
i(W
0) is the indicator of the occurrence ofW
0 at positioni
in the innite sequence (X
i)ii=+1=;1. Clearly,Y
i(W
) =Y
i.Taking the expectation in (1), we get
e=(W
); Xp2P0(W)
(W
(p)W
):
(2) This gives an easy way to calculate and then estimate the expected de- clumped count.To extend this result to an occurrence of a
k
-clump at positioni
in the innite sequence (X
i)ii=+1=;1, we write thatY
ei(k) = 1 if and only if there is an occurrence at positioni
of a wordC
, composed of exactlyk
overlapping occurrences ofW
, which does not overlap any other occurrence ofW
in the sequence. We denote Ck the whole set of concatenated words composed of exactlyk
overlapping occurrences ofW
.We noticed before that an occurrence at position
i
ofC
does not overlap a preceding occurrence ofW
if and only if there is no occurrence of any principal rootW
(p)at positioni
;p
. We consider now suxes ofW
, denotedW
(p) =w
h;p+1w
h. With the same methods, an occurrence ofC
does not overlap a following occurrence ofW
if noW
(p) occurs just afterC
, withp
2 P0(W
). Since we have Ck = fW
(p1)W
(p2)W
(pk ;1)W p
j 2P
0(
W
)g, we show in Section 3 that simultaneous occurrences of two dierent compound words ofCk,C
andC
0, at positioni
in the sequence is not possible.Therefore, we obtain the following expression of
Y
ei(k)(W
):Y
ei(k) = XC2Ck
0
@
Y
i(C
); Xp2P0(W)
Y
i;pW
(p)C
; Xq2P0(W)
Y
i;C W
(q)+ X
pq2P0(W)
Y
i;pW
(p)C W
(q)1
A
:
(3)ESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
Using (3), we obtain expression of
ek =EY
eik as ek = XC2Ck
(C
) ;2 XC02Ck +1
(C
0) + XC0 02Ck +2
(C
00):
(4) A straightforward manipulation of formula (4) leads to the following expres- sion which shows the geometric structure ofek (Schbath, 1995): ek = (1;A
)2A
k;1(W
) withA
=Pp2P0(W)Qpj=1(w
jw
j+1).Remark 8. Since
Y
ei =Pk1Y
ei(k), formula (4) has to and does verify e=Xk1
ek:
(5)We can also easily prove that
X
k1
k
ek =(W
) (6)this result has to be interpreted carefully taking care that
N
(W
) dened byN
(W
) =Pk1k
Pi2IY
ei(k) is not equal toN
(W
).3. Combinatory proofs
All the proofs below are based on the property that the minimal root
yz
cannot be simultaneously written asxx
0andx
0x
. This property follows from the Theorem of Lothaire (1983).Theorem 9 (Lothaire (1983)). Two nonempty words
x
andx
0commute if and only if they are powers of the same word.The next proposition is a simple application of this theorem.
Proposition 10. If
yz
is the minimal root ofW
, then the periods ofyzy
are greater than jy
j.Proof. If
yzy
has a period less or equal to jy
j, the minimal rootyz
has a non trivial decomposition asyz
=xx
0=x
0x
(in Figure 2, the wordx
is in black). Using Theorem 9, we get the result thatyz
is a power of another word, and thereforeyz
is not the minimal root ofW
.y z y
z }|yzz
{z }|y {=) #
y z y y
| {z }
yz
y
Figure 2.
yzy
has a period less or equal tojy
j.Proof of Theorem 6. We show that if (
yz
)ry
is the canonical de- composition ofW
, then there can be two occurrences ofW
at positioni
andi
+`
only if`
is a multiple of the minimal periodp
0 or is of the form`
= (r
;1)p
0+q
whereq
is a period ofyzy
. Suppose that two wordsW
occurESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
with a shift of
`
positions. Obviously, if` >
(r
;1)p
0 thenq
=`
;(r
;1)p
0 is a period ofyzy
, and`
is inP(W
). If` <
(r
;1)p
0, we show that`
is a mul- tiple ofp
0. If it is not, we consider separately the two cases corresponding to a second occurrence ofW
starting in a wordy
(Figure 3.a), or in a wordz
(Figure 3.b). In the two cases, the minimal rootyz
can be decomposed asyz
=xx
0 =x
0x
where the wordsx
, in black in the gure, andx
0 are not empty. By Theorem 9, this is a contradiction. Such an overlap is notpossible. 2
(
a
)W y z
z }|yzz
{z }|y {W y
#| {z }
yz
y z y
(
b
)W
z }|yz {z }|yz {y W
#| {z }
yz
y z y
Figure 3. Simultaneous occurrences of
W
at positioni
andi
+`
in two cases: (a
)`
is writtenr
0p
0+`
0 with 0< `
0jy
j, (b
)`
is writtenr
00p
0+`
00withjy
j< `
00<
jyz
j.Proposition 11. If
W
occurs at positioni
in the sequence, there cannot be occurrences of two dierent principal rootsW
(p) at positioni
;p
andW
(s) at positioni
;s
.Proof. By Theorem 6, we classify the principal periods into three classes:
the minimal period
p
0 (class I), the periods of the form (r
;1)p
0+q
such that jy
j< q <
jyz
j (class II), and the periods of the form (r
;1)p
0 +q
such that jyz
j< q <
jyzy
j(class III). To prove the proposition, we have to consider the ve cases corresponding to (ps
) in classes (I-II), (I-III), (II-II), (II-III) and (III-III). We only study the two cases (I-III) and (III-III) for the other cases, the same method is used.Case (I-III): We suppose that
W
(p0)=yz
occurs at positioni
;p
0 andW
(s) = (yz
)ry
0 occurs at positioni
;s
wherey
0 is a root ofy
. This overlap, represented in Figure 4, implies thatyz
=xy
0 =y
0x
for a nonempty wordx
, becausey
0 is a root ofy
. Theorem 9 completes the proof this overlap is not possible.Case (III-III): We suppose that
W
(p)= (yz
)ry
0occurs at positioni
;p
andW
(s) = (yz
)ry
00 occurs at positioni
;s
wherey
0 andy
00 are two dierent roots ofy
. This overlap, represented in Figure 5, leads toyz
=xx
0=x
0x
. Theorem 9 completes the proof.ESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
W
(p0)y z
zx
}|yz {=) "
W
(s)y z
|{z}y0
x
| {z }
yz |{z}y0 Figure 4. Simultaneous occurrences of
W
(p0) at positioni
;p
0 andW
(s) at positioni
;s
whens
is in class (III).W
(p)y z y
0 z }|yzx
{ y0
z }| {
=) #
W
(s)y z y
00x
| {z }
yz
y
00Figure 5. Simultaneous occurrences of
W
(p)at positioni
;p
andW
(s) at positioni
;s
whenp
ands
are in class (III).Proposition 12. Simultaneous occurrences at position
i
of two dierent composed words in Ck are not possible.Proof. Let
C
andC
0 be two dierent composed words in Ck. We can de- compose these words into principal roots ofW
in the following way:C
=W
(p1)W
(p2)W
(pk ;1)W
andC
0 =W
(q1)W
(q2)W
(qk ;1)W
withp
jq
j 2 P0(W
),j
= 1k
;1. SinceC
andC
0 are dierent, there exists at least one pair (p
jq
j) such thatp
j 6=q
j we denotej
the rst index such thatp
j is not equal toq
j. IfC
andC
0 both occur at positioni
in the se- quence, the two dierent principal rootsW
(pj) andW
(qj) occur also at the same position. Moreover the two composed wordsW
(pj)W
(pk ;1)W
andW
(qj)W
(qk ;1)W
appear at the same position. Some tedious manipula- tion, using the three classes of principal roots, as in the proof of Proposition 11, yields to a decomposition of the minimal rootyz
in all the cases, which is impossible by Theorem 9.4. Compound Poisson approximation
We noticed in Section 1 that to approximate the count
N
(W
), we need to study the occurrences of allk
-clumps of this word in the sequence. Therefore, let us suppose that the innite sequence fX
igii=+1=;1 is observed we dene the following countN
(W
) =Xk1
k
Xi2I
Y
ei(k)where
I
=f1::: n
;h
+ 1g andY
ei(k) is the indicator of the occurrence of ak
-clump at positioni
in the innite sequence, calculated in (3).Comparison of
N
(W
) andN
(W
). The countsN
(W
) andN
(W
) are not equal but their dierence is negligible in probability. Actually, they canESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
only dier in two cases: a clump of
W
starts before positionn
;h
+ 2 and stops beyond positionn
in the innite sequence fX
igii=+1=;1 such a clump, if it exists, is unique. Consequently, all the occurrences ofW
contained in this clump are taken into account inN
(W
), but only those we can observe in the nite sequenceX
1X
n are counted inN
(W
). In this case, there is necessarily an occurrence ofW
at one of the positionsn
;2h
+3::: n
;h
+1.The second case is when a clump starts before position 1 and stops beyond position
h
;1 in the innite sequence. Here, no occurrence ofW
contained in this clump is counted inN
(W
), but those observed inX
1X
2X
n are taken into account inN
(W
). Necessarily,W
occurs at one of the (h
;1) rst positions of the nite sequence. Therefore, we havePf
N
(W
)6=N
(W
)g2h
(W
):
Moreover, these bias complement each other as shown by (6) which is
E
N
(W
) =EN
(W
). Our problem is now just to approximate the variableN
(W
). 2We now consider the Poisson processZ= (
Z
i(k))i2Ik1such thatZ
i(k) are independent Poisson variables with mean ek =E(Y
ei(k)). A process version of the Chen-Stein theorem, as formulated in Arratia et al. (1990), allows us to approximate the process Ye = (Y
ei(k))i2Ik1 by the Poisson process Z. The total variation is bounded byd
TV(L(Ye)L(Z))2(2b
1+ 2b
2+b
3)where the expressions of
b
1b
2 andb
3 are similar to those ofb
1b
2 andb
3 given in Section 1, for the variablesY
ei(k) with the double index (ik
).Therefore, with the same error bound, we approximate the count
N
(W
) by the variable Pk1k
Pi2IZ
i(k), which is distributed according to the compound Poisson distribution CP(), where the discrete measure =P
k1k
k is dened by k = (n
;h
+ 1)ek, calculated in (4). Using the triangular inequality, we then haved
TV(L(N
(W
))CP())2(2b
1+ 2b
2+b
3) + 4h
(W
):
(7) Our task is now to show thatb
1b
2 andb
3 tend to zero. First, we need to choose a suitable neighborhood of the double index (ik
) 2I
, whereI
=I
Nnf0g.Choice of the neighborhood
B
ik. We deneZ
(ik
) =fj
:i
;h
j
i
+ (k
+ 1)h
g (Figure 6). Therefore,Z
(ik
) contains the positionsj
of the lettersX
j that dene the variableY
ei(k). Actually, the length of a compound wordC
inCk is less thankh
, and we need to know (h
;1) letters at the ends ofC
to be sure that an occurrence ofC
at positioni
is in fact an occurrence of ak
-clump of compositionC
.Moreover, in the term
b
3, we need to have at leastr
positions (r >
0) between the setsZ
(ik
) andZ
(j`
), so we dene (ik
) and (j`
) as not neighbors ifZ
(ik
) andZ
(j`
) are separated with at leastr
positions, that is if one ofESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.
Z(ik)
z }| {
i;r;h i;h i i+(k+1)h i+(k+1)h+r
Figure 6. Construction of the neighborhood of the index (
ik
).the two following inequalities holds (Figure 6):
i
+ (k
+ 1)h
+r < j
;h j
+ (`
+ 1)h
+r < i
;h:
Therefore, the neighborhood
B
ik of (ik
) is the set of indices (j`
) that are neighbors on (ik
):B
ik =f(j`
) : ;(`
+ 2)h
;r
j
;i
(k
+ 2)h
+r
g \I
:
A bound for the term
b
1. By denition, the rst termb
1 is writtenb
1 = X(ik)2I
X
(j`)2Bik
E(
Y
ei(k))E(Y
ej(`)):
E(
Y
ei(k)) =ek and E(Y
ej(`)) =e` does not depend onj
we then separate the indicesj
2B
ik such thatj < i
andj i
, and we haveb
1Xi2I
X
k1
X
`11 + (
k
+ 2)h
+r
]eke`+Xi2I
X
k1
X
`1(
`
+ 2)h
+r
]eke`:
Using the symmetry betweenk
and`
, the above expression reduces tob
1 2Xi2I
0
@ X
`1
e`1
A 0
@ X
k1(
k
+ 2)h
+r
+ 1]ek1
A
:
Therefore, it follows from (5) and (6) thatb
1 2(n
;h
+ 1)(h
+ (2h
+r
+ 1)e)e (8) where denotes(W
).A bound for the term
b
2. The second termb
2 is sum of EY
ei(k)Y
ej(`) with (j`
) inB
ik but not equal to (ik
). SinceY
ei(k)Y
ei(`) = 0 ifk
6=`
, and using the symmetry ofB
ik,b
2 is also writtenb
22 X(ik)2I
X
`1
i+(kX+2)h+r j=i+1
E
Y
ei(k)Y
ej(`):
A straightforward majoration is not enough. Bounding P`1
Y
ej(`) =Y
ej byY
j gives eitherb
2O(n
(3h
+r
)) if we only use thatY
j 1, orb
2 O;n
2Pk(k
+ 2)h
+r
]A
k;1:
card(Ck) if we bound EY
ei(k)Y
jby
PC2Ck(C
) and use that (C
)A
k;1 for a certain jA
j<
1, as in Geske et al. (1995). In the second case, the upper bound can diverge because card(Ck) = card(P0(W
))k;1 we can only conclude if card(P0(W
)) = 1 which is exactly the framework of Geske et al. (1995).We have to take into account that other products
Y
ei(k)Y
ej(`)are equal to zero,ESAIM : P&S November-1995, Vol.1, Art.1, pp.1-16.