Relative Entropy Densities - Entropy and Information Theory

Many of the convergence results to come will be given and stated in terms of relative entropy densities. In this section we present a simple but important result describing the asymptotic behavior of relative entropy densities. Although the result of this section is only for finite alphabet processes, it is stated and proved in a manner that will extend naturally to more general processes later on. The result will play a fundamental role in the basic ergodic theorems to come.

Throughout this section we will assume that M and P are two process distributions describing a random process{Xn}. Denote as before the sample vectorXⁿ= (X₀, X₁,· · ·, Xn−1), that is, the vector beginning at time 0 having length n. The distributions on Xⁿ induced by M and P will be denoted by Mn and Pn, respectively. The corresponding pmf’s are mXⁿ and pXⁿ. The key assumption in this section is that for all n if mXⁿ(xⁿ) = 0, then also pXⁿ(xⁿ) = 0, that is,

Mn >> Pn for alln. (2.30)

2.7. RELATIVE ENTROPY DENSITIES 45 If this is the case, we can define the relative entropy density

hn(x)≡ln pXⁿ(xⁿ)

mXⁿ(xⁿ)= lnfn(x), (2.31) where

fn(x)≡

½ _p

Xn(xⁿ)

m_Xn(xⁿ) ifmXⁿ(xⁿ)6= 0

0 otherwise (2.32)

Observe that the relative entropy is found by integrating the relative entropy density:

HP||M(Xⁿ) =D(Pn||Mn) =X

xⁿ

pXⁿ(xⁿ) ln pXⁿ(xⁿ) mXⁿ(xⁿ)

= Z

ln pXⁿ(Xⁿ)

mXⁿ(Xⁿ)dP (2.33)

Thus, for example, if we assume that

H_P_||_M(Xⁿ)<∞, alln, (2.34) then (2.30) holds.

The following lemma will prove to be useful when comparing the asymptotic behavior of relative entropy densities for different probability measures. It is the first almost everywhere result for relative entropy densities that we consider. It is somewhat narrow in the sense that it only compares limiting densities to zero and not to expectations. We shall later see that essentially the same argument implies the same result for the general case (Theorem 5.4.1), only the interim steps involving pmf’s need be dropped. Note that the lemma requires neither stationarity nor asymptotic mean stationarity.

Lemma 2.7.1: Given a finite alphabet process{Xn}with process measures P, M satisfying (2.30), Then

lim sup

n→∞

nhn≤0, M −a.e. (2.35)

and

lim inf

n→∞

nhn≥0, P −a.e.. (2.36)

If in addition M >> P, then

nlim→∞

nhn= 0, P −a.e.. (2.37)

Proof: First consider the probability M(1

nhn ≥²) =M(fn ≥e^n²)≤ EM(fn) e^n² ,

where the final inequality is Markov’s inequality. But EM(fn) =

dM fn= X

xⁿ:m_Xn(xⁿ)6=0

mXⁿ(xⁿ)pXⁿ(xⁿ) mXⁿ(xⁿ)

= X

xⁿ:m_Xn(xⁿ)6=0

pXⁿ(xⁿ)≤1 and therefore

M(1

nhn ≥²)≤2⁻^n² and hence

X∞ n=1

M(1

nhn> ²)≤X^∞

n=1

e⁻^n² <∞.

From the Borel-Cantelli Lemma (e.g., Lemma 4.6.3 of [50]) this implies that M(n⁻¹hn≥²i.o.) = 0 which implies the first equation of the lemma.

Next consider P(−1

nhn> ²) = X

xⁿ:−n¹lnpXn(xⁿ)/mXn(xⁿ)>²

pXⁿ(xⁿ)

= X

xⁿ:−n¹lnp_Xn(xⁿ)/m_Xn(xⁿ)>² andm_Xn(xⁿ)6=0

pXⁿ(xⁿ)

where the last statement follows since ifmXⁿ(xⁿ) = 0, then alsopXⁿ(xⁿ) = 0 and hence nothing would be contributed to the sum. In other words, terms violating this condition add zero to the sum and hence adding this condition to the sum does not change the sum’s value. Thus

P(−1

nhn> ²) = X

xⁿ:−¹nlnpXn(xⁿ)/mXn(xⁿ)>²andmXn(xⁿ)6=0

pXⁿ(xⁿ)

mXⁿ(xⁿ)mXⁿ(xⁿ)

= Z

f_n<e^−n²dM fn≤ Z

f_n<e^−n²dM e⁻^n²

=e⁻^n²M(fn< e⁻^n²)≤e⁻^n².

Thus as before we have thatP(n⁻¹hn> ²)≤e⁻^n² and hence thatP(n⁻¹hn ≤

−²i.o.) = 0 which proves the second claim. If alsoM >> P, then the first equation of the lemma is also trueP-a.e., which when coupled with the second equation proves the third. 2

Chapter 3

The Entropy Ergodic Theorem

3.1 Introduction

The goal of this chapter is to prove an ergodic theorem for sample entropy of finite alphabet random processes. The result is sometimes called the ergodic theorem of information theory or the asymptotic equipartion theorem, but it is best known as the Shannon-McMillan-Breiman theorem. It provides a common foundation to many of the results of both ergodic theory and information the-ory. Shannon [129] first developed the result for convergence in probability for stationary ergodic Markov sources. McMillan [103] provedL¹ convergence for stationary ergodic sources and Breiman [19] [20] proved almost everywhere con-vergence for stationary and ergodic sources. Billingsley [15] extended the result to stationary nonergodic sources. Jacobs [67] [66] extended it to processes dom-inated by a stationary measure and hence to two-sided AMS processes. Gray and Kieffer [54] extended it to processes asymptotically dominated by a sta-tionary measure and hence to all AMS processes. The generalizations to AMS processes build on the Billingsley theorem for the stationary mean. Follow-ing generalizations of the definitions of entropy and information, correspondFollow-ing generalizations of the entropy ergodic theorem will be considered in Chapter 8.

Breiman’s and Billingsley’s approach requires the martingale convergence theorem and embeds the possibly one-sided stationary process into a two-sided process. Ornstein and Weiss [117] recently developed a proof for the stationary and ergodic case that does not require any martingale theory and considers only positive time and hence does not require any embedding into two-sided processes. The technique was described for both the ordinary ergodic theorem and the entropy ergodic theorem by Shields [132]. In addition, it uses a form of coding argument that is both more direct and more information theoretic in flavor than the traditional martingale proofs. We here follow the Ornstein and Weiss approach for the stationary ergodic result. We also use some modifications

similar to those of Katznelson and Weiss for the proof of the ergodic theorem.

We then generalize the result first to nonergodic processes using the “sandwich”

technique of Algoet and Cover [7] and then to AMS processes using a variation on a result of [54].

We next state the theorem to serve as a guide through the various steps. We also prove the result for the simple special case of a Markov source, for which the result follows from the usual ergodic theorem.

We consider a directly given finite alphabet source {Xn} described by a distributionm on the sequence measurable space (Ω,B). Define as previously X_kⁿ = (Xk, Xk+1,· · ·, Xk+n−1). The subscript is omitted when it is zero. For any random variableY defined on the sequence space (such as X_kⁿ) we define the random variablem(Y) bym(Y)(x) =m(Y =Y(x)).

Theorem 3.1.1: The Entropy Ergodic Theorem

Given a finite alphabet AMS source {Xn} with process distributionmand stationary mean ¯m, let{m¯x;x∈ Ω} be the ergodic decomposition of the sta-tionary mean ¯m. Then

nlim→∞

−lnm(Xⁿ)

n =h; m−a.e. and inL¹(m), (3.1) where h(x) is the invariant function defined by

h(x) = ¯Hm¯x(X). (3.2) Furthermore,

Emh= lim

n→∞

nHm(Xⁿ) = ¯Hm(X); (3.3) that is, the entropy rate of an AMS process is given by the limit, and

H¯m¯(X) = ¯Hm(X). (3.4) Comments: The theorem states that the sample entropy using the AMS measurem converges to the entropy rate of the underlying ergodic component of the stationary mean. Thus, for example, if m is itself stationary and er-godic, then the sample entropy converges to the entropy rate of the process m-a.e. and in L¹(m). The L¹(m) convergence follows immediately from the almost everywhere convergence and the fact that sample entropy is uniformly integrable (Lemma 2.3.6). L¹ convergence in turn immediately implies the left-hand equality of (3.3). Since the limit exists, it is the entropy rate. The final equality states that the entropy rates of an AMS process and its stationary mean are the same. This result follows from (3.2)-(3.3) by the following argument:

We have that ¯Hm(X) =Emhand ¯Hm¯(X) =Em¯h, buthis invariant and hence the two expectations are equal (see, e.g., Lemma 6.3.1 of [50]). Thus we need only prove almost everywhere convergence in (3.1) to prove the theorem.

In this section we limit ourselves to the following special case of the theo-rem that can be proved using the ordinary ergodic theotheo-rem without any new techniques.

3.1. INTRODUCTION 49 Lemma 3.1.1: Given a finite alphabet stationarykth order Markov source {Xn}, then there is an invariant functionhsuch that

nlim→∞

−lnm(Xⁿ)

n =h; m−a.e.and inL¹(m), wherehis defined by

h(x) =−Em¯xlnm(Xk|X^k), (3.5) where {m¯x} is the ergodic decomposition of the stationary mean ¯m. Further-more,

h(x) = ¯Hm_¯_x(X) =Hm_¯_x(Xk|X^k). (3.6) Proof of Lemma: We have that

−1

Since the process iskth order Markov with stationary transition probabilites, fori > kwe have that

m(Xi|Xⁱ) =m(Xi|Xi−k,· · ·, Xi−1) =m(Xk|X^k)Tⁱ⁻^k.

The terms−lnm(Xi|Xⁱ),i= 0,1,· · ·, k−1 have finite expectation and hence are finitem-a.e. so that the ergodic theorem can be applied to deduce

−lnm(Xⁿ)(x)

proving the first statement of the lemma. It follows from the ergodic decom-position of Markov sources (see Lemma 8.6.3) of [50]) that with probability 1,

mx(Xk|X^k) =m(Xk|ψ(x), X^k) =m(Xk|X^k), whereψis the ergodic component function. This completes the proof. 2

We prove the theorem in three steps: The first step considers stationary and ergodic sources and uses the approach of Ornstein and Weiss [117] (see also Shields [132]). The second step removes the requirement for ergodicity. This result will later be seen to provide an information theoretic interpretation of the ergodic decomposition. The third step extends the result to AMS processes by showing that such processes inherit limiting sample entropies from their stationary mean. The later extension of these results to more general relative entropy and information densities will closely parallel the proofs of the second and third steps for the finite case.

Dans le document Entropy and Information Theory (Page 66-72)