Basic Properties of Entropy - Entropy and Information Theory

For simplicity we focus on the entropy rate of a directly given finite alphabet random process{Xn}. We also will emphasize stationary measures, but we will try to clarify those results that require stationarity and those that are more general.

LetAbe a finite set. Let Ω =A^Z⁺ and letBbe the sigma-field of subsets of Ω generated by the rectangles. SinceAis finite, (A,BA) is standard, whereBAis the power set ofA. Thus (Ω,B) is also standard by Lemma 2.4.1 of [50]. In fact, from the proof that cartesian products of standard spaces are standard, we can take as a basis forBthe fieldsFngenerated by the finite dimensional rectangles having the form{x:Xⁿ(x) =xⁿ=aⁿ}for allaⁿ∈Aⁿand all positive integers n. (Members of this class of rectangles are calledthin cylinders.) The union of all such fields, sayF, is then a generating field.

2.3. BASIC PROPERTIES OF ENTROPY 21 Many of the basic properties of entropy follow from the following simple inequality.

Lemma 2.3.1: Given two probability mass functions {pi} and {qi}, that is, two countable or finite sequences of nonnegative numbers that sum to one,

then X

pilnpi

qi ≥0 with equality if and only ifqi=pi, alli.

Proof: The lemma follows easily from the elementary inequality for real numbers

lnx≤x−1 (2.5)

(with equality if and only ifx= 1) since X

with equality if and only ifqi/pi = 1 alli. Alternatively, the inequality follows from Jensen’s inequality [63] since ln is a convexT

function:

The quantity used in the lemma is of such fundamental importance that we pause to introduce another notion of information and to recast the inequality in terms of it. As with entropy, the definition for the moment is only for finite alphabet random variables. Also as with entropy, there are a variety of ways to define it. Suppose that we have an underlying measurable space (Ω,B) and two measures on this space, say P and M, and we have a random variable f with finite alphabetAdefined on the space and thatQis the induced partition {f⁻¹(a);a∈A}. LetPfandMf be the induced distributions and letpandmbe the corresponding probability mass functions, e.g.,p(a) =Pf({a}) =P(f =a).

Define therelative entropyof a measurementf with measureP with respect to the measureM by Observe that this only makes sense ifp(a) is 0 wheneverm(a) is, that is, ifPf is absolutely continuous with respect toMf orMf >> Pf. DefineHP||M(f) =∞ ifPf is not absolutely continuous with respect toMf. The measureM is re-ferred to as thereference measure. Relative entropies will play an increasingly important role as general alphabets are considered. In the early chapters the emphasis will be on ordinary entropy with similar properties for relative en-tropies following almost as an afterthought. When considering more abstract (nonfinite) alphabets later on, relative entropies will prove indispensible.

Analogous to entropy, given a random process{Xn}described by two process distributionspandm, if it is true that

mXⁿ>> pXⁿ; n= 1,2,· · ·,

then we can define for eachnthenth order relative entropyn⁻¹Hp||m(Xⁿ) and therelative entropy rate

H¯p||m(X)≡lim sup

n→∞

nHp||m(Xⁿ).

When dealing with relative entropies it is often the measures that are impor-tant and not the random variable or partition. We introduce a special notation which emphasizes this fact. Given a probability space (Ω,B, P), with Ω a finite space, and another measureM on the same space, we define the divergence of P with respect to Mas the relative entropy of the identity mapping with respect to the two measures:

D(P||M) =X

ω∈Ω

P(ω) ln P(ω) M(ω).

Thus, for example, given a finite alphabet measurementf on an arbitrary prob-ability space (Ω,B, P), ifM is another measure on (Ω,B) then

HP||M(f) =D(Pf||Mf).

Similarly,

Hp||m(Xⁿ) =D(PXⁿ||MXⁿ),

wherePXⁿandMXⁿare the distributions forXⁿinduced by process measuresp andm, respectively. The theory and properties of relative entropy are therefore determined by those for divergence.

There are many names and notations for relative entropy and divergence throughout the literature. The idea was introduced by Kullback for applications of information theory to statistics (see, e.g., Kullback [92] and the references therein) and was used to develop information theoretic results by Perez [120]

[122] [121], Dobrushin [32], and Pinsker [125]. Various names in common use for this quantity are discrimination, discrimination information, Kullback-Leibler number, directed divergence, and cross entropy.

The lemma can be summarized simply in terms of divergence as in the following theorem, which is commonly referred to as the divergence inequality.

Theorem 2.3.1: Given any two probability measuresP andM on a com-mon finite alphabet probability space, then

D(P||M)≥0 (2.6)

with equality if and only ifP =M.

In this form the result is known as the divergence inequality. The fact that the divergence of one probability measure with respect to another is nonnegative

2.3. BASIC PROPERTIES OF ENTROPY 23 and zero only when the two measures are the same suggest the interpretation of divergence as a “distance” between the two probability measures, that is, a measure of how different the two measures are. It is not a true distance or metric in the usual sense since it is not a symmetric function of the two measures and it does not satisfy the triangle inequality. The interpretation is, however, quite useful for adding insight into results characterizing the behavior of divergence and it will later be seen to have implications for ordinary distance measures between probability measures.

The divergence plays a basic role in the family of information measures all of the information measures that we will encounter–entropy, relative entropy, mutual information, and the conditional forms of these information measures–

can be expressed as a divergence.

There are three ways to view entropy as a special case of divergence. The first is to permit M to be a general measure instead of requiring it to be a probability measure and have total mass 1. In this case entropy is minus the divergence ifM is the counting measure, i.e., assigns measure 1 to every point in the discrete alphabet. IfM is not a probability measure, then the divergence inequality (2.6) need not hold. Second, if the alphabet off isAf and has||Af||

elements, then lettingM be a uniform pmf assigning probability 1/||A|| to all symbols inA yields

D(P||M) = ln||Af|| −HP(f)≥0

and hence the entropy is the log of the alphabet size minus the divergence with respect to the uniform distribution. Third, we can also consider entropy a special case of divergence while still requiring thatM be a probability measure by using product measures and a bit of a trick. Say we have two measuresPand Qon a common probability space (Ω,B). Define two measures on the product space (Ω×Ω,B(Ω×Ω)) as follows: LetP×Qdenote the usual product measure, that is, the measure specified by its values on rectangles as P×Q(F×G) = P(F)Q(G). Thus, for example, ifP andQare discrete distributions with pmf’s pandq, then the pmf forP×Qis justp(a)q(b). LetP⁰ denote the “diagonal”

measure defined by its values on rectangles asP⁰(F×G) =P(FT

G). In the discrete caseP⁰ has pmfp0(a, b) =p(a) ifa=band 0 otherwise. Then

HP(f) =D(P⁰||P×P).

Note that if we letXandY be the coordinate random variables on our product space, then bothP⁰ andP ×P give the same marginal probabilities to X and Y, that is,PX =PY =P. P⁰ is an extreme distribution on (X, Y) in the sense that with probability one X = Y; the two coordinates are deterministically dependent on one another. P×P, however, is the opposite extreme in that it makes the two random variables X and Y independent of one another. Thus the entropy of a distribution P can be viewed as the relative entropy between these two extreme joint distributions having marginalsP.

We now return to the general development for entropy. For the moment fix a probability measuremon a measurable space (Ω,B) and letX andY be two finite alphabet random variables defined on that space. LetAX andAY denote the corresponding alphabets. LetPXY,PX, andPY denote the distributions of (X, Y),X, andY, respectively.

First observe that sincePX(a)≤1, alla,−lnPX(a) is positive and hence H(X) =−X

a∈A

PX(a) lnPX(a)≥0. (2.7) From (2.6) withM uniform as in the second interpretation of entropy above, ifX is a random variable with alphabet AX, then

H(X)≤ln||AX||.

Since for anya∈AXandb∈AY we have thatPX(a)≥PXY(a, b), it follows that

H(X, Y) =−X

a,b

PXY(a, b) lnPXY(a, b)

≥ −X

a,b

PXY(a, b) lnPX(a) =H(X).

Using Lemma 2.3.1 we have that since PXY and PXPY are probability mass functions,

H(X, Y)−(H(X) +H(Y)) =X

a,b

PXY(a, b) lnPX(a)PY(b) PXY(a, b) ≤0.

This proves the following result:

Lemma 2.3.2: Given two discrete alphabet random variables X and Y defined on a common probability space, we have

0≤H(X) (2.8)

and

max(H(X), H(Y))≤H(X, Y)≤H(X) +H(Y) (2.9) where the right hand inequality holds with equality if and only ifX and Y are independent. If the alphabet ofX has||AX||symbols, then

HX(X)≤ln||AX||. (2.10)

There is another proof of the left hand inequality in (2.9) that uses an inequality for relative entropy that will be useful later when considering codes.

The following lemma gives the inequality. First we introduce a definition. A partitionRis said torefinea partionQif every atom inQis a union of atoms ofR, in which case we writeQ<R.

2.3. BASIC PROPERTIES OF ENTROPY 25 Lemma 2.3.3: Suppose that P and M are two measures defined on a common measurable space (Ω,B) and that we are given a finite partitionsQ<

R. Then

H_P_||_M(Q)≤H_P_||_M(R) and

HP(Q)≤HP(R)

Comments: The lemma can also be stated in terms of random variables and mappings in an intuitive way: Suppose thatU is a random variable with finite alphabet A andf : A→ B is a mapping fromA into another finite alphabet B. Then the composite random variablef(U) defined byf(U)(ω) =f(U(ω)) is also a finite random variable. If U induces a partitionRand f(U) a partition Q, thenQ<R(since knowing the value ofU implies the value off(U)). Thus the lemma immediately gives the following corollary.

Corollary 2.3.1IfM >> P are two measures describing a random variable U with alphabetAand iff : A→B, then

HP||M(f(U))≤HP||M(U) and

HP(f(U))≤HP(U).

Since D(Pf||Mf) = H_P_||_M(f), we have also the following corollary which we state for future reference.

Corollary 2.3.2: Suppose that P andM are two probability measures on a discrete space and thatf is a random variable defined on that space, then

D(Pf||Mf)≤D(P||M).

The lemma, discussion, and corollaries can all be interpreted as saying that taking a measurement on a finite alphabet random variable lowers the entropy and the relative entropy of that random variable. By choosingU as (X, Y) and f(X, Y) = X or Y, the lemma yields the promised inequality of the previous lemma.

Proof of Lemma: IfH_P_||_M(R) = +∞, the result is immediate. IfH_P_||_M(Q) = +∞, that is, if there exists at least oneQjsuch thatM(Qj) = 0 butP(Qj)6= 0, then there exists an Ri ⊂Qj such that M(Ri) = 0 andP(Ri)>0 and hence H_P_||_M(R) = +∞. Lastly assume that bothH_P_||_M(R) andH_P_||_M(Q) are finite and consider the difference

HP||M(R)−HP||M(Q) = X

P(Ri) ln P(Ri) M(Ri)−X

P(Qj) ln P(Qj) M(Qj) = X

[ X

i:Ri⊂Qj

P(Ri) ln P(Ri)

M(Ri)−P(Qj) ln P(Qj) M(Qj)].

We shall show that each of the bracketed terms is nonnegative, which will prove the first inequality. Fixj. IfP(Qj) is 0 we are done since then alsoP(Ri) is 0 for alliin the inner sum since theseRi all belong toQj. IfP(Qj) is not 0, we can divide by it to rewrite the bracketed term as

P(Qj)



 X

i:Ri⊂Qj

P(Ri)

P(Qj)ln P(Ri)/P(Qj) M(Ri)/M(Qj)



,

where we also used the fact that M(Qj) cannot be 0 since then P(Qj) would also have to be zero. Since Ri ⊂ Qj, P(Ri)/P(Qj) = P(RiT

Qj)/P(Qj) = P(Ri|Qj) is an elementary conditional probability. Applying a similar argument toM and dividing byP(Qj), the above expression becomes

i:Ri⊂Qj

P(Ri|Qj) ln P(Ri|Qj) M(Ri|Qj)

which is nonnegative from Lemma 2.3.1, which proves the first inequality. The second inequality follows similarly: Consider the difference

HP(R)−HP(Q) =X

[ X

i:R_i⊂Q_j

P(Ri) lnP(Qj) P(Ri)] = X

P(Qj)[− X

i:Ri⊂Qj

P(Ri|Qj) lnP(Ri|Qj)]

and the result follows since the bracketed term is nonnegative since it is an entropy for each value ofj(Lemma 2.3.2). 2

The next result provides useful inequalities for entropy considered as a func-tion of the underlying distribufunc-tion. In particular, it shows that entropy is a concave (or convex T

) function of the underlying distribution. Define the bi-nary entropy function (the entropy of a bibi-nary random variable with probability mass function (λ,1−λ)) by

h₂(λ) =−λlnλ−(1−λ) ln(1−λ).

Lemma 2.3.4: Letmandpdenote two distributions for a discrete alphabet random variableX and letλ∈(0,1). Then for anyλ∈(0,1)

λHm(X) + (1−λ)Hp(X)≤H_λm+(1−λ)p(X)

≤λHm(X) + (1−λ)Hp(X) +h2(λ). (2.11) Proof: We do a little extra here to save work in a later result. Define the quantities

I=−X

m(x) ln(λm(x) + (1−λ)p(x))

2.3. BASIC PROPERTIES OF ENTROPY 27 and therefore applying this bound to bothmandp

I≤ −lnλ−X To obtain the lower bounds of the lemma observe that

I=−X

Using (2.5) the rightmost term is bound below by

−X

The next result presents an interesting connection between combinatorics and binomial sums with a particular entropy. We require the familiar definition of the binomial coefficient:

Lemma 2.3.5: Givenδ∈(0,¹₂] and a positive integer M, we have Proof: We have after some simple algebra that

e⁻^h²^(δ)M =δ^δM(1−δ)⁽¹⁻^δ)M.

Ifδ <1/2, thenδ^k(1−δ)^M⁻^k increases askdecreases (since we are having more large terms and fewer small terms in the product) and hence ifi≤M δ,

δ^δM(1−δ)⁽¹⁻^δ)M ≤δⁱ(1−δ)^M⁻ⁱ. Thus we have the inequalities

1 =

which completes the proof of (2.14). In a similar fashion we have that e^Mh²^(δ^||^p)= (δ and therefore after some algebra we have that ifi≤M δthen

pⁱ(1−p)^M⁻ⁱ≤δⁱ(1−δ)^M⁻ⁱe⁻^Mh²^(δ^||^p)

2.3. BASIC PROPERTIES OF ENTROPY 29

The following is a technical but useful property of sample entropies. The proof follows Billingsley [15].

Lemma 2.3.6: Given a finite alphabet process{Xn} (not necessarily sta-tionary) with distribution m, let X_kⁿ = (Xk, Xk+1,· · ·, Xk+n−1) denote the random vectors giving a block of samples of dimension n starting at time k.

Then the random variablesn⁻¹lnm(X_kⁿ) arem-uniformly integrable (uniform ink andn).

Proof: For each nonnegative integerrdefine the sets Er(k, n) ={x:−1

where the final step follows since there are at most||A||ⁿpossiblen-tuples corre-sponding to thin cylinders inEr(k, n) and by construction each has probability less thane⁻^nr.

To prove uniform integrability we must show uniform convergence to 0 as r→ ∞of the integral

Takingrlarge enough so thatr >ln||A||, then the exponential term is bound above by the special casen= 1 and we have the bound

γr(k, n)≤X^∞

i=0

(r+i+ 1)e^−(r+i^−ln^||^A^||)

a bound which is finite and independent of k and n. The sum can easily be shown to go to zero as r → ∞ using standard summation formulas. (The exponential terms shrink faster than the linear terms grow.) 2

Variational Description of Divergence

Divergence has a variational characterization that is a fundamental property for its applications to large deviations theory [143] [31]. Although this theory will not be treated here, the basic result of this section provides an alternative description of divergence and hence of relative entropy that has intrinsic interest.

The basic result is originally due to Donsker and Varadhan [34].

Suppose now that P and M are two probability measures on a common discrete probability space, say (Ω,B). Given any real-valued random variable Φ defined on the probability space, we will be interested in the quantity

EMe^Φ. (2.16)

which is called the cumulant generating functionof Φ with respect to M and is related to the characteristic function of the random variable Φ as well as to the moment generating function and the operational transform of the random variable. The following theorem provides a variational description of divergence in terms of the cumulant generating function.

Theorem 2.3.2:

D(P||M) = sup Φ

¡EPΦ−ln(EM(e^Φ))¢

. (2.17)

Proof: First consider the random variable Φ defined by Φ(ω) = ln(P(ω)/M(ω)) and observe that

EPΦ−ln(EM(e^Φ)) =X

P(ω) ln P(ω)

M(ω)−ln(X

M(ω)P(ω) M(ω))

=D(P||M)−ln 1 =D(P||M).

This proves that the supremum over all Φ is no smaller than the divergence.

To prove the other half observe that for any bounded random variable Φ, EPΦ−lnEM(e^Φ) =EP

ln e^Φ EM(e^Φ)

P(ω) µ

lnM^Φ(ω) M(ω)

¶ ,

2.4. ENTROPY RATE 31

Dans le document Entropy and Information Theory (Page 42-53)