2 Theoretical Results - Challenges in Computational Statistics and Data Mining

Our starting point will be formulae for average subword complexity of strings drawn from some stochastic processes. A few such formulae have been given in [14,18, 19]. We will derive a weaker but a more general bound. First of all, let us recall, cf. [25], that function f(k|X₁ⁿ)for a fixed sample X₁ⁿis unimodal andkfor which f(k|Xⁿ₁)attains its maximum is called the maximal repetition. Forkgreater than the maximal repetition we have f(k|Xⁿ₁)=n−k+1. If we want to have a nondecreasing function ofk, which is often more convenient, we may consider f(k|Xⁿ₁⁺^k⁻¹). Now

let us observe that f(k|Xⁿ₁⁺^k⁻¹)≤min

(cardX)^k,n

[19]. For a stationary process this bound can be strengthened in this way.

Theorem 1 For a stationary process(Xi)i∈Zwe have E f(k|X₁ⁿ⁺^k⁻¹)≤ ˜Snk :=

For independent random variables, bound (4) can be strengthened again. Let ok(f(k))denote a term that divided by f(k)vanishes in infinity, i.e., limk→∞ok(f(k)) f(k)

=0 [15, Chap. 9].

Theorem 2 ([14, Theorem 2.1])For a sequence of independent identically distrib-uted (IID) random variables(Xi)i∈Zwe have

Formula (5) is remarkable. It states that the expectation of subword complexity for an IID process is asymptotically such as if each substringwwere drawnntimes with replacement with probability P(w). Correlations among overlaps of a given string

Estimation of Entropy from Subword Complexity 57

asymptotically cancel out on average [14]. FunctionSnkis also known in the analysis of large number of rare events (LNRE) [21], developed to investigate Zipf’s law in quantitative linguistics [2,30].

For the sake of further reasoning it is convenient to rewrite quantitiesS˜nkandSnk

as expectations of certain functions of block probability. Forx>0, let us define

where the last formula holds for sufficiently largen(looking at the graphs ofgnand g, for sayn ≥20).

Usually probability of a block decreases exponentially with the increasing block length. Thus it is convenient to rewrite formulae (9) and (10) further, using minus log-probabilityYk = −logP(X₁^k). The expectation of this random variable equals, by definition, block entropyEYk=H(k). In contrast, we obtain

Apparently, formulae (4) and (5) combined with (11) and (12) could be used for estimating block entropy of a process. In fact, we have the following proposition:

Theorem 3 For a stationary ergodic process(Xi)i∈Z, we have S˜nk

n = ˜σ(H(k)+ok(k)−logn)+ok(1) , (13) Snk

n =σ_n(H(k)+ok(k)−logn)+ok(1) (14)

≈σ(H(k)+ok(k)−logn)+ok(1) . (15) Proof By the Shannon-McMillan-Breiman theorem (asymptotic equipartition prop-erty), for a stationary ergodic process, the differenceYk−H(k)is of orderok(k)with probability 1−ok(1)[1]. Since functionsσ˜(x)andσ_n(x)increase in the considered domain and take values in range[0,1], then the claims follow from formulae (11) and (12) respectively.

Simulations performed in the next section suggest that for a large class of processes, observed subword complexity f(k|Xⁿ₁)is practically equal to its expecta-tion. Hence ifnis large and the termok(k)in inequality (15) is negligible, Theorems 2and3suggest the following estimation procedure for block entropy of IID proceses.

First, we compute the subword complexity for a sufficiently large sample Xⁿ₁ and, secondly, we apply some inverse function to obtain an estimate of entropy. Namely, by (5) and (15), we obtain the following estimate of block entropy H(k),

H_est⁽¹⁾(k):=log(n−k+1)+σ⁻¹

f(k|X₁ⁿ) n−k+1

. (16)

Formula (16) is applicable only to sufficiently smallk, which stems from using sample X₁ⁿrather thanX₁ⁿ⁺^k⁻¹. Consequently, this substitution introduces an error that grows withkand explodes at maximal repetition. As we have mentioned, fork greater than the maximal repetition we have f(k|X₁ⁿ)=n−k+1, which implies H_est⁽¹⁾(k)= ∞, since limx→1σ⁻¹(x)= ∞. Forksmaller than the maximal repetition, we haveH_est⁽¹⁾(k) <∞.

Estimator H_est⁽¹⁾(k)resembles in spirit inverting Padé approximant proposed by Ebeling and Pöschel [10]. Quality of this estimator will be tested empirically for Bernoulli processes in the next section. In fact, formula (16) works very well for uniform probabilities of characters. Then termsok(k)andok(1)in inequality (15) vanish. Thus we can estimate entropy of relatively large blocks, for which only a tiny fraction of typical blocks can be observed in the sample. Unfortunately, this estimator works so good only for uniform probabilities of characters. The termok(k) in inequality (15) is not negligible for nonuniform probabilities of characters. The more nonuniform the probabilities are, the larger the termok(k)is. The sign of this term also varies. It is systematically positive for smallkand systematically negative for largek. Hence reliable estimation of entropy via formula (16) is impossible in general. This suggests that the approach of [10] cannot be trusted, either.

Estimation of Entropy from Subword Complexity 59

The persistent error of formula (16) comes from the fact that the asymptotic equipartition is truly only an asymptotic property. Now we will show how to improve the estimates of block entropy for quite a general stationary process. We will show that termsok(k)andok(1)in equality (13) may become negligible forS˜nk/nclose to 1/2. Introduce the Heaviside function

θ(y)=

⎧⎪

⎨

⎪⎩

0 y<0 , 1/2 y=0 , 1 y>0 .

(17)

In particular,Eθ(Yk−B)is a decreasing function ofB. Thus we can defineM(k), the median of minus log-probabilityYkof block X^k₁, as

M(k):=sup{B:Eθ(Yk−B)≥1/2} . (18)

Theorem 4 For any C >0we have S˜nk

n ≥ 1+exp(−C)

2 =⇒ M(k)≥logn−C , (19)

S˜nk

n < 1

2 =⇒ M(k)≤logn . (20)

Proof First, we haveσ˜(y)≤(1−exp(−C))θ(y+C)+exp(−C). Thus S˜nk/n=Eσ˜(Yk−logn)≤(1−exp(−C))Eθ(Yk−logn+C)+exp(−C) . Hence ifS˜nk/n ≥(1+exp(−C))/2 thenEθ(Yk − logn +C)≥1/2 and conse-quentlyM(k)≥logn−C. As for the converse,σ˜(y)≥θ(y). Thus

S˜nk/n=Eσ˜(Yk−logn)≥Eθ(Yk−logn+C) .

Hence ifS˜nk/n<1/2 thenEθ(Yk−logn) <1/2 and consequentlyM(k)≤logn.

In the second step we can compare the median and the block entropy.

Theorem 5 For a stationary ergodic process(Xi)i∈Z, we have

M(k)=H(k)+ok(k) . (21) Proof The claim follows by the mentioned Shannon-McMillan-Breiman theorem.

Namely, the differenceYk−H(k)is of orderok(k)with probability 1−ok(1). Hence the differenceM(k)−H(k)must be of orderok(k)as well.

A stronger inequality may hold for a large subclass of processes. Namely, we suppose that the distribution of strings of a fixed length is skewed towards less probable values. This means that the distribution of minus log-probability Yk is right-skewed. Consequently, those processes satisfy this condition:

Definition 1 A stationary process(Xi)i∈Zis called properly skewed if for allk≥1 we have

H(k)≥M(k) . (22)

Assuming that the variance of subword complexity is small, formulae (22), (19), and (4) can be used to provide a simple lower bound of block entropy for properly skewed processes. The recipe is as follows. First, to increase precision, we extend functionss(k)of natural numbers, such as subword complexity f(k|X₁ⁿ)and block entropy H(k), to real arguments by linear interpolation. Namely, forr =qk+(1− q)(k−1), whereq∈(0,1), we will puts(r):=qs(k)+(1−q)s(k−1). Then we apply this procedure.

1. Choose aC>0 and letS:=(1+exp(−C))/2.

2. Computes(k):= f(k|Xⁿ₁)/(n−k+1)for growingkuntil the leastkis found such thats(k)≥S. Denote thiskask1and letk2:=k1−1.

3. Define

k^∗:=(S−s(k2))k1+(s(k1)−S)k2

s(k1)−s(k2) . (23) (We haves(k^∗)=S).

4. Estimate the block entropyH(k^∗)as

H_est⁽²⁾(k^∗):=log(n−k^∗+1)−C. (24) 5. If needed, estimate entropy rateh =limk→∞H(k)/kas

h⁽_est²⁾=H_est⁽²⁾(k^∗)/k^∗. (25) For a fixed sample sizen, the above procedure yields an estimate of block entropy H(k)but only for a single block lengthk =k^∗. Thus to compute the estimates of block entropyH(k)for varyingk, we have to apply the above procedure for varying n. So extended procedure is quite computationally intensive and resembles baking a cake from which we eat only a cherry put on the top of it. By Theorems1and4 we conjecture that estimator H_est⁽²⁾(k^∗)with a large probability gives a lower bound of block entropy H(k^∗)for properly skewed processes. The exact quality of this bound remains, however, unknown, because we do not know the typical difference between subword complexity and functionS˜nk. Judging from the experiments with Bernoulli processes described in the next section, this difference should be small but it is only our conjecture. Moreover, it is an open question whetherk^∗tends to infinity

Estimation of Entropy from Subword Complexity 61

for growingnand whether estimatorh⁽_est²⁾is consistent, i.e., whetherh⁽_est²⁾−htends to 0 forn → ∞. That the estimator H_est⁽²⁾(k^∗)yields only a lower bound of block entropy is not a great problem in principle since, if some upper bound is needed, it can be computed using a universal code, such as the Lempel-Ziv code [31,32].

Dans le document Challenges in Computational Statistics and Data Mining (Page 66-72)