Entropy Rate Revisited - Entropy and Information Theory

Markov chainand note that givenZ,X does not depend onY. (Note that if Y →Z →X is a Markov chain, then so isX →Z→Y.) Thus the conditional mutual information is 0 if and only if the variables form a Markov chain with the conditioning variable in the middle. One might be tempted to infer from Lemma 2.3.3 that given finite valued measurementsf,g, andr

I(f(X);g(Y)|r(Z))(?)

≤I(X;Y|Z).

This does not follow, however, since it is not true that if Q is the partition corresponding to the three quantizers, thenD(Pf(X),g(Y),r(Z)||P_f(X)×g(Y)|r(Z)) is HPX,Y,Z||P_X×Y_|Z(f(X), g(Y), r(Z)) because of the way that PX×Y|Z is con-structed; e.g., the fact that X and Y are conditionally independent given Z implies thatf(X) andg(Y) are conditionally independent givenZ, but it does not imply thatf(X) andg(Y) are conditionally independent given r(Z). Al-ternatively, if M is PX×Z|Y, then it is not true that Pf(X)×g(Y)|r(Z) equals M(f gr)⁻¹. Note that if this inequality were true, choosing r(z) to be trivial (say 1 for all z) would result in I(X;Y|Z) ≥ I(X;Y|r(Z)) = I(X;Y). This cannot be true in general since, for example, choosing Z as (X, Y) would give I(X;Y|Z) = 0. Thus one must be careful when applying Lemma 2.3.3 if the measures and random variables are related as they are in the case of conditional mutual information.

We close this section with an easy corollary of the previous lemma and of the definition of conditional entropy. Results of this type are referred to aschain rulesfor information and entropy.

Corollary 2.5.1: Given finite alphabet random variables Y, X1, X2, · · ·, Xn,

H(X1, X2,· · ·, Xn) = Xn i=1

H(Xi|X1,· · ·, Xi−1)

Hp||m(X1, X2,· · ·, Xn) = Xn i=1

Hp||m(Xi|X1,· · ·, Xi−1)

I(Y; (X1, X2,· · ·, Xn)) = Xn

i=1

I(Y;Xi|X1,· · ·, Xi−1).

2.6 Entropy Rate Revisited

The chain rule of Corollary 2.5.1 provides a means of computing entropy rates for stationary processes. We have that

nH(Xⁿ) = 1 n

n−1

i=0

H(Xi|Xⁱ).

First suppose that the source is a stationarykth order Markov process, that is, for anym > k

Pr(Xn =xn|Xi=xi; i= 0,1,· · ·, n−1)

= Pr(Xn =xn|Xi=xi; i=n−k,· · ·, n−1).

For such a process we have for all n≥k that

H(Xn|Xⁿ) =H(Xn|X_n^k₋_k) =H(Xk|X^k),

whereX_i^m=Xi,· · ·, Xi+m−1. Thus taking the limit asn→ ∞of thenth order entropy, all but a finite number of terms in the sum are identical and hence the Ces`aro (or arithmetic) mean is given by the conditional expectation. We have therefore proved the following lemma.

Lemma 2.6.1: If{Xn}is a stationarykth order Markov source, then H¯(X) =H(Xk|X^k).

If we have a two-sided stationary process{Xn}, then all of the previous defi-nitions for entropies of vectors extend in an obvious fashion and a generalization of the Markov result follows if we use stationarity and the chain rule to write

nH(Xⁿ) = 1 n

n−1

i=0

H(X0|X₋1,· · ·, X₋i).

Since conditional entropy is nonincreasing with more conditioning variables ((2.22) or Lemma 2.5.2),H(X0|X₋1,· · ·, X₋i) has a limit. Again using the fact that a Ces`aro mean of terms all converging to a common limit also converges to the same limit we have the following result.

Lemma 2.6.2: If{Xn}is a two-sided stationary source, then H¯(X) = lim

n→∞H(X0|X₋1,· · ·, X₋n).

It is tempting to identify the above limit as the conditional entropy given the infinite past,H(X0|X−1,· · ·). Since the conditioning variable is a sequence and does not have a finite alphabet, such a conditional entropy is not included in any of the definitions yet introduced. We shall later demonstrate that this interpretation is indeed valid when the notion of conditional entropy has been suitably generalized.

The natural generalization of Lemma 2.6.2 to relative entropy rates unfor-tunately does not work because conditional relative entropies are not in general monotonic with increased conditioning and hence the chain rule does not imme-diately yield a limiting argument analogous to that for entropy. The argument does work if the reference measure is akth order Markov, as considered in the following lemma.

2.6. ENTROPY RATE REVISITED 43 Lemma 2.6.3: If{Xn}is a source described by process distributionspand mand ifpis stationary andmiskth order Markov with stationary transitions, then forn≥k H_p_||_m(X₀|X₋₁,· · ·, X₋n) is nondecreasing innand

H¯_p_||_m(X) = lim

n→∞H_p_||_m(X₀|X₋₁,· · ·, X₋n)

=−H¯p(X)−Ep[lnm(Xk|X^k)].

Proof: Forn≥kwe have that

Hp||m(X0|X−1,· · ·, X₋n)

=−Hp(X0|X₋1,· · ·, X₋n)−X

x^k+1

p_Xk+1(x^k+1) lnm_X_k_|_Xk(xk|x^k).

Since the conditional entropy is nonincreasing with nand the remaining term does not depend onn, the combination is nondecreasing withn. The remainder of the proof then parallels the entropy rate result. 2

It is important to note that the relative entropy analogs to entropy properties often requirekth order Markov assumptions on the reference measure (but not on the original measure).

Markov Approximations

Recall that the relative entropy rate ¯Hp||m(X) can be thought of as a distance between the process with distributionpand that with distributionmand that the rate is given by a limit if the reference measuremis Markov. A particular Markov measure relevant to p is the distribution p^(k) which is the kth order Markov approximation to p in the sense that it is a kth order Markov source and it has the samekth order transition probabilities asp. To be more precise, the process distributionp^(k) is specified by its finite dimensional distributions

p^(k)_Xk(x^k) =pX^k(x^k)

p^(k)_Xn(xⁿ) =pX^k(x^k)

nY−1 l=k

p_X_l_|_X^k

l−k(xl|x^kl−k); n=k, k+ 1,· · · so that

p^(k)_X_k_|_Xk =pX_k|X^k.

It is natural to ask how good this approximation is, especially in the limit, that is, to study the behavior of the relative entropy rate ¯H_p_||_p(k)(X) ask→ ∞.

Theorem 2.6.2: Given a stationary processp, letp^(k)denote thekth order Markov approximations top. Then

klim→∞H¯_p_||_p(k)(X) = inf

k H¯_p_||_p(k)(X) = 0.

Thus the Markov approximations are asymptotically accurate in the sense that the relative entropy rate between the source and approximation can be made arbitrarily small (zero if the original source itself happens to be Markov).

Proof: As in the proof of Lemma 2.6.3 we can write forn≥kthat H_p_||_p(k)(X₀|X₋₁,· · ·, X₋n)

=−Hp(X0|X−1,· · ·, X₋n)−X

x^k+1

pX^k+1(x^k+1) lnpX_k|X^k(xk|x^k)

=Hp(X₀|X₋₁,· · ·, X₋k)−Hp(X₀|X₋₁,· · ·, X₋n).

Note that this implies thatp^(k)_Xn>> pXⁿ for allnsince the entropies are finite.

This automatic domination of the finite dimensional distributions of a measure by those of its Markov approximation will not hold in the general case to be encountered later, it is specific to the finite alphabet case. Taking the limit as n→ ∞gives

H¯_p_||_p(k)(X) = lim

n→∞H_p_||_p(k)(X₀|X₋₁,· · ·, X₋n)

=Hp(X₀|X₋₁,· · ·, X₋k)−H¯p(X).

The corollary then follows immediately from Lemma 2.6.2. 2

Markov approximations will play a fundamental role when considering rela-tive entropies for general (nonfinite alphabet) processes. The basic result above will generalize to that case, but the proof will be much more involved.

Dans le document Entropy and Information Theory (Page 63-66)