• Aucun résultat trouvé

Divergence

Dans le document Entropy and Information Theory (Page 99-114)

Given a probability space (Ω,B, P) (not necessarily with finite alphabet) and another probability measureM on the same space, define thedivergence of P with respect toM by

D(P||M) = sup

Q HP||M(Q) = sup

f D(Pf||Mf), (5.1) where the first supremum is over all finite measurable partitionsQof Ω and the second is over all finite alphabet measurements on Ω. The two forms have the same interpretation: the divergence is the supremum of the relative entropies

77

or divergences obtainable by finite alphabet codings of the sample space. The partition form is perhaps more common when considering divergence per se, but the measurement or code form is usually more intuitive when considering entropy and information. This section is devoted to developing the basic proper-ties of divergence, all of which will yield immediate corollaries for the measures of information.

The first result is a generalization of the divergence inequality that is a trivial consequence of the definition and the finite alphabet special case.

Lemma 5.2.1: The Divergence Inequality:

For any two probability measuresP andM D(P||M)0 with equality if and only ifP =M.

Proof: Given any partitionQ, Theorem 2.3.1 implies that X

Q∈Q

P(Q) ln P(Q) M(Q) 0

with equality if and only ifP(Q) =M(Q) for all atomsQof the partition. Since D(P||Q) is the supremum over all such partitions, it is also nonnegative. It can be 0 only ifP andM assign the same probabilities to all atoms in all partitions (the supremum is 0 only if the above sum is 0 for all partitions) and hence the divergence is 0 only if the measures are identical. 2

As in the finite alphabet case, Lemma 5.2.1 justifies interpreting divergence as a form of distance or dissimilarity between two probability measures. It is not a true distance or metric in the mathematical sense since it is not symmetric and it does not satisfy the triangle inequality. Since it is nonnegative and equals zero only if two measures are identical, the divergence is adistortion measure as considered in information theory [51], which is a generalization of the notion of distance. This view often provides interpretations of the basic properties of divergence. We shall develop several relations between the divergence and other distance measures. The reader is referred to Csisz´ar [25] for a development of the distance-like properties of divergence.

The following two lemmas provide means for computing divergences and studying their behavior. The first result shows that the supremum can be con-fined to partitions with atoms in a generating field. This will provide a means for computing divergences by approximation or limits. The result is due to Dobrushin and is referred to as Dobrushin’s theorem. The second result shows that the divergence can be evaluated as the expectation of an entropy density defined as the logarithm of the Radon-Nikodym derivative of one measure rela-tive to the other. This result is due to Gelfand, Yaglom, and Perez. The proofs largely follow the translator’s remarks in Chapter 2 of Pinsker [125] (which in turn follows Dobrushin [32]).

Lemma 5.2.2: Suppose that (Ω,B) is a measurable space whereB is gen-erated by a fieldF,B=σ(F). Then ifP andM are two probability measures

5.2. DIVERGENCE 79 on this space,

D(P||M) = sup

Q⊂FHP||M(Q).

Proof: From the definition of divergence, the right-hand term above is clearly less than or equal to the divergence. If P is not absolutely continuous with respect toM, then we can find a setF such thatM(F) = 0 butP(F)6= 0 and hence the divergence is infinite. Approximating this event by a field elementF0 by applying Theorem 1.2.1 simultaneously to M and G will yield a partition {F0, F0c} for which the right hand side of the previous equation is arbitrarily large. Hence the lemma holds for this case. Henceforth assume thatM >> P.

Fix² >0 and suppose that a partition Q={Q1,· · ·, QK}yields a relative entropy close to the divergence, that is,

HP||M(Q) = XK i=1

P(Qi) ln P(Qi)

M(Qi) ≥D(P||M)−²/2.

We will show that there is a partition, say Q0 with atoms in F which has almost the same relative entropy, which will prove the lemma. First observe that P(Q) ln[P(Q)/M(Q)] is a continuous function ofP(Q) and M(Q) in the sense that given²/(2K) there is a sufficiently smallδ >0 such that if|P(Q)−P(Q0)| ≤ δand|M(Q)−M(Q0)| ≤δ, then providedM(Q)6= 0

|P(Q) ln P(Q)

M(Q)−P(Q0) ln P(Q0) M(Q0)| ≤ ²

2K. If we can find a partitionQ0 with atoms inF such that

|P(Q0i)−P(Qi)| ≤δ, |M(Q0i)−M(Qi)| ≤δ, i= 1,· · ·, K, (5.2) then

|HP||M(Q0)−HP||M(Q)| ≤X

i

|P(Qi) ln P(Qi)

M(Qi)−P(Q0i) ln P(Q0i) M(Q0i)|

≤K ² 2K = ²

2 and hence

HP||M(Q0)≥D(P||M)−²

which will prove the lemma. To find the partitionQ0 satisfying (5.2), letmbe the mixture measureP/2 +M/2. As in the proof of Lemma 4.2.2, we can find a partitionQ0⊂ Fsuch thatm(Qi∆Q0i)≤K2γfori= 1,2,· · ·, K, which implies that

P(Qi∆Q0i)2K2γ; i= 1,2,· · ·, K, and

M(Qi∆Q0i)2K2γ; i= 1,2,· · ·, K.

If we now chooseγ so small that 2K2γ≤δ, then (5.2.2) and hence the lemma follow from the above and the fact that

|P(F)−P(G)| ≤P(F∆G).2 (5.3) Lemma 5.2.3: Given two probability measures P and M on a common measurable space (Ω,B), if P is not absolutely continuous with respect toM, then

The quantity lnf (if it exists) is called the entropy densityor relative entropy densityofP with respect toM.

Proof: The first statement was shown in the proof of the previous lemma. If Pis not absolutely continuous with respect toM, then there is a setQsuch that M(Q) = 0 andP(Q)>0. The relative entropy for the partitionQ={Q, Qc} is then infinite, and hence so is the divergence.

Assume thatP << M and letf =dP/dM. Suppose thatQis an event for whichM(Q)>0 and consider the conditional cumulative distribution function for the real random variablef given thatω∈Q:

FQ(u) =M({f < u}T Q)

M(Q) ;u∈(−∞,∞).

Observe that the expectation with respect to this distribution is EM(f|Q) = Applying Jensen’s inequality to the convex S

functionulnuyields the in-equality

5.2. DIVERGENCE 81 We therefore have that for any eventQwithM(Q)>0 that

Z

where the inequality follows from (5.2.4) sinceP(Qi)6= 0 implies thatM(Qi)6=

0 sinceM >> P. This proves that D(P||M)

Z

lnf(ω)dP(ω).

To obtain the converse inequality, letqndenote the asymptotically accurate quantizers of Section 1.6. From (1.6.3)

Z

lnf(ω)dP(ω) = lim

n→∞

Z

qn(lnf(ω))dP(ω).

For fixed n the quantizer qn induces a partition of Ω into 2n2n + 1 atoms Q. In particular, there are 2n2n1 “good” atoms such that for ω, ω0 inside The rightmost two terms above are bounded below as

P(lnf ≥n) ln P(lnf ≥n)

M(lnf ≥n)+P(lnf <−n) ln P(lnf <−n) M(lnf <−n)

≥P(lnf ≥n) lnP(lnf ≥n) +P(lnf <−n) lnP(lnf <−n).

Since P(lnf ≥n) and P(lnf < −n)→ 0 as n→ ∞ and sincexlnx→ 0 as x→0, given ²we can choosenlarge enough to ensure that the above term is greater than−². This yields the lower bound

X

Fix a good atom Q and define ¯h= supωQlnf(ω) and h= infωQlnf(ω) and note that by definition of the good atoms

¯h−h≤2(n1). We now have that

P(Q)¯h≥ Z

Qlnf(ω)dP(ω) and

M(Q)eh Z

Qf(ω)dM(ω) =P(Q).

Combining these we have that P(Q) ln P(Q)

M(Q)≥P(Q) ln P(Q)

P(Q)eh =P(Q)h

≥P(Q)(¯h−2−(n−1)) Z

Qlnf(ω)dP(ω)−P(Q)2−(n−1). Therefore

X

Q∈Q

P(Q) ln P(Q)

M(Q) X goodQ

P(Q) ln P(Q) M(Q)−²

X

goodQ Z

Q

lnf(ω)dP 2(n1)−²

= Z

ω:|lnf(ω)|≤n

lnf(ω)dP(ω)2(n1)−².

Since this is true for arbitrarily largenand arbitrarily small², D(P||Q)≥

Z

lnf(ω)dP(ω), completing the proof of the lemma. 2

It is worthwhile to point out two examples for the previous lemma. IfP and M are discrete measures with corresponding pmf’s p and q, than the Radon-Nikodym derivative is simplydP/dM(ω) =p(ω)/m(ω) and the lemma gives the known formula for the discrete case. IfP andM are both probability measures on Euclidean space Rn and if both measures are absolutely continuous with respect to Lebesgue measure, then there exists a density f called a probability density functionorpdfsuch that

P(F) = Z

Ff(x)dx,

5.2. DIVERGENCE 83 where dxmeansdm(x) withm Lebesgue measure. (Lebesgue measure assigns each set its volume.) Similarly, there is a pdfgforM. In this case,

D(P||M) = Z

Rnf(x) lnf(x)

g(x)dx. (5.5)

The following immediate corollary to the previous lemma provides a formula that is occasionally useful for computing divergences.

Corollary 5.2.1: Given three probability distributions M >> Q >> P, then

D(P||M) =D(P||Q) +EP(ln dQ dM).

Proof: From the chain rule for Radon-Nikodym derivatives (e.g., Lemma 5.7.3 of [50])

dP dM = dP

dQ dQ dM

and taking expectations using the previous lemma yields the corollary. 2 The next result is a technical result that shows that given a mapping on a space, the divergence between the induced distributions can be computed from the restrictions of the original measures to the sub-σ-field induced by the mapping. As part of the result, the relation between the induced Radon-Nikodym derivative and the original derivative is made explicit.

Recall that ifP is a probability measure on a measurable space (Ω,B) and ifF is a sub-σ-field of B, then the restriction PF of P to F is the probability measure on the measurable space (Ω,F) defined by PF(G) = P(G), for all G∈ F. In other words, we can use either the probability measures on the new space or the restrictions of the probability measures on the old space to compute the divergence. This motivates considering the properties of divergences of restrictions of measures, a useful generality in that it simplifies proofs. The following lemma can be viewed as a bookkeeping result relating the divergence and the Radon-Nikodym derivatives in the two spaces.

Lemma 5.2.4: (a) Suppose that M, P are two probability measures on a space (Ω,B) and that X is a measurement mapping this space into (A,A).

LetPX andMX denote the induced distributions (measures on (A,A)) and let Pσ(X) andMσ(X) denote the restrictions ofP and M to σ(X), the sub-σ-field ofBgenerated byX. Then

D(PX||MX) =D(Pσ(X)||Mσ(X)).

If the Radon-Nikodym derivativef =dPX/dMX exists (e.g., the above diver-gence is finite), then define the functionf(X) : Ω[0,) by

f(X)(ω) =f(X(ω)) = dPX

dMX(X(ω));

then with probability 1 under bothM andP f(X) = dPσ(X)

dMσ(X).

(b) Suppose thatP << M. Then for any sub-σ-field F ofB, we have that dPF

dMF =EM(dP dM|F).

Thus the Radon-Nikodym derivative for the restrictions is just the conditional expectation of the original Radon-Nikodym derivative.

Proof: The proof is mostly algebra: D(Pσ(X)||Mσ(X)) is the supremum over all finite partitionsQwith elements inσ(X) of the relative entropyHPσ(X)||Mσ(X)(Q).

Each element Q ∈ Q ⊂ σ(X) corresponds to a unique set Q0 ∈ A via Q = X−1(Q0) and hence to eachQ ⊂σ(X) there is a corresponding partitionQ0 ⊂ A.

The corresponding relative entropies are equal, however, since HPX||MX(Q0) = X

Taking the supremum over the partitions proves that the divergences are equal.

If the derivative is f =dPX/dMX, thenf(X) is measurable since it is a mea-surable function of a meamea-surable function. In addition, it is meamea-surable with respect toσ(X) since it depends onω only throughX(ω). For any F ∈σ(X)

from the change of variables formula (see, e.g., Lemma 4.4.7 of [50]). Thus Z

Ff(X)dMσ(X)=PX(G) =Pσ(X)(X1(G)) =Pσ(X)(F),

which proves thatf(X) is indeed the claimed derivative with probability 1 under M and hence also underP.

The variation quoted in part (b) is proved by direct verification using iterated expectation. IfG∈ F, then using iterated expectation we have that

Z

Since the argument of the integrand isF-measurable (see, e.g., Lemma 5.3.1 of [50]), invoking iterated expectation (e.g., Corollary 5.9.3 of [50]) yields

Z

5.2. DIVERGENCE 85

=E(1GdP

dM) =P(G) =PF(G),

proving that the conditional expectation is the claimed derivative. 2 Part (b) of the Lemma was pointed out to the author by Paul Algoet.

Having argued above that restrictions of measures are useful when finding divergences of random variables, we provide a key trick for treating such restric-tions.

Lemma 5.2.5: LetM >> P be two measures on a space (Ω,B). Suppose thatF is a sub-σ-field and thatPF andMF are the restrictions ofP andM to Proof: If M >> P, then clearly MF >> PF and hence the appropriate Radon-Nikodym derivatives exist. Define the set functionS by

S(F) = Observe that forF∈ F, iterated expectation implies that

S(F) =EM(EM(1F dP

dM|F)) =EM(1F dP dM)

=P(F) =PF(F); F ∈ F

and hence in particular thatS(Ω) is 1 so that dPF/dMF is integrable and S is indeed a probability measure on (Ω,B). (In addition, the restriction ofS toF is justPF.) Define

g= dP/dM dPF/dMF.

This is well defined since with M probability 1, if the denominator is 0, then so is the numerator. GivenF ∈ Bthe Radon-Nikodym theorem (e.g., Theorem 5.6.1 of [50]) implies that

Z

proving the first part of the lemma. The second part follows by direct verifica-tion:

D(P||M) = Z

ln dP dM dP =

Z

ln dPF dMF dP+

Z

ln dP/dM dPF/dMF dP

= Z

ln dPF

dMF dPF+ Z

lndP

dSdP =D(PF||MF) +D(P||S). 2

The two previous lemmas and the divergence inequality immediately yield the following result forM >> P. IfM does not dominateP, then the result is trivial.

Corollary 5.2.2: Given two measuresM, P on a space (Ω,B) and a sub-σ-fieldF ofB, then

D(P||M)≥D(PF||MF).

Iff is a measurement on the given space, then D(P||M)≥D(Pf||Mf).

The result is obvious for finite fields F or finite alphabet measurements f from the definition of divergence. The general result for arbitrary measurable functions could also have been proved by combining the corresponding finite alphabet result of Corollary 2.3.1 and an approximation technique. As above, however, we will occasionally get results comparing the divergences of measures and their restrictions by combining the trick of Lemma 5.2.5 with a result for a single divergence.

The following corollary follows immediately from Lemma 5.2.2 since the union of a sequence of asymptotically generating sub-σ-fields is a generating field.

Corollary 5.2.3: Suppose that M, P are probability measures on a mea-surable space (Ω,B) and that Fn is an asymptotically generating sequence of sub-σ-fields and letPn andMn denote the restrictions ofP andM toFn (e.g., Pn=PFn). Then

D(Pn||Mn)↑D(P||M).

There are two useful special cases of the above corollary which follow im-mediately by specifying a particular sequence of increasing sub-σ-fields. The following two corollaries give these results.

Corollary 5.2.4: Let M, P be two probability measures on a measurable space (Ω,B). Suppose thatf is anA-valued measurement on the space. Assume thatqn : A→Anis a sequence of measurable mappings into finite setsAn with the property that the sequence of fields Fn =F(qn(f)) generated by the sets {qn−1(a); a ∈An} asymptotically generate σ(f). (For example, if the original space is standard let Fn be a basis and let qn map the points in the ith atom ofFn intoi.) Then

D(Pf||Mf) = lim

n→∞D(Pqn(f)||Mqn(f)).

5.2. DIVERGENCE 87 The corollary states that the divergence between two distributions of a ran-dom variable can be found as a limit of quantized versions of the ranran-dom vari-able. Note that the limit could also be written as

nlim→∞HPf||Mf(qn).

In the next corollary we consider increasing sequences of random variables instead of increasing sequences of quantizers, that is, more random variables (which need not be finite alphabet) instead of ever finer quantizers. The corol-lary follows immediately from Corolcorol-lary 5.2.3 and Lemma 5.2.4.

Corollary 5.2.5: Suppose that M and P are measures on the sequence space corresponding to outcomes of a sequence of random variablesX0, X1,· · · with alphabetA. LetFn =σ(X0,· · ·, Xn−1), which asymptotically generates theσ-fieldσ(X0, X1,· · ·). Then

nlim→∞D(PXn||MXn) =D(P||M).

We now develop two fundamental inequalities involving entropy densities and divergence. The first inequality is from Pinsker [125]. The second is an improvement of an inequality of Pinsker [125] by Csisz´ar [24] and Kullback [91].

The second inequality is more useful when the divergence is small. Coupling these inequalities with the trick of Lemma 5.2.5 provides a simple generalization of an inequality of [48] and will provide easy proofs of L1 convergence results for entropy and information densities. A key step in the proof involves a notion of distance between probability measures and is of interest in its own right.

Given two probability measures M, P on a common measurable space (Ω,B), thevariational distancebetween them is defined by

d(P, M)sup

Q

X

Q∈Q

|P(Q)−M(Q)|,

where the supremum is over all finite measurable partitions. We will proceed by stating first the end goal, the two inequalities involving divergence, as a lemma, and then state two lemmas giving the basic required properties of the variational distance. The lemmas will be proved in a different order.

Lemma 5.2.6: LetP and M be two measures on a common probability space (Ω,B) withP << M. Letf =dP/dM be the Radon-Nikodym derivative and leth= lnf be the entropy density. Then

D(P||M) Z

|h|dP ≤D(P||M) +2

e, (5.7)

Z

|h|dP ≤D(P||M) +p

2D(P||M). (5.8)

Lemma 5.2.7: Given two probability measures M, P on a common mea-surable space (Ω,B), the variational distance is given by

d(P, M) = 2 sup

F∈B|P(F)−M(F)|. (5.9)

Furthermore, ifSis a measure for whichP << S andM << S(S= (P+M)/2, for example), then also

d(P, M) = Z

|dP dS −dM

dS |dS (5.10)

and the supremum in (5.9) is achieved by the set F=: dP

dS(ω)>dM dS (ω)}.

Lemma 5.2.8

d(P, M)p

2D(P||M).

Proof of Lemma 5.2.7: First observe that for any set F we have for the partitionQ={F, Fc} that

d(P, M)≥ X

Q∈Q

|P(Q)−M(Q)|= 2|P(F)−M(F)|

and hence

d(P, M)2 sup

F∈B|P(F)−M(F)|.

Conversely, suppose thatQis a partition which approximately yields the vari-ational distance, e.g.,

X

Q∈Q

|P(Q)−M(Q)| ≥d(P, M)−²

for² >0. Define a setFas the union of all of theQinQfor whichP(Q)≥M(Q) and we have that

X

Q∈Q

|P(Q)−M(Q)|=P(F)−M(F) +M(Fc)−P(Fc) = 2(P(F)−M(F))

and hence

d(P, M)−²≤sup

F∈B2|P(F)−M(F)|.

Since²is arbitrary, this proves the first statement of the lemma.

Next suppose that a measureS dominating bothP andM exists and define the set

F=: dP

dS(ω)>dM dS (ω)}

and observe that Z

|dP dS −dM

dS |dS= Z

F(dP dS −dM

dS )dS− Z

Fc(dP dS −dM

dS )dS

5.2. DIVERGENCE 89

=P(F)−M(F)(P(Fc)−M(Fc)) = 2(P(F)−M(F)).

From the definition ofF, however, P(F) =

To prove the reverse inequality, assume thatQapproximately yields the varia-tional distance, that is, for² >0 we have

X which, since²is arbitrary, proves that

d(P, M) Z

|dP dS −dM

dS |dS,

Combining this with the earlier inequality proves (5.10). We have already seen that this upper bound is actually achieved with the given choice of F, which completes the proof of the lemma. 2

Proof of Lemma 5.2.8: Assume that M >> P since the result is trivial otherwise because the right-hand side is infinite. The inequality will follow from the first statement of Lemma 5.2.7 and the following inequality: Given 1≥p, m≥0,

pln p

m+ (1−p) ln 1−p

1−m−2(p−m)20. (5.11) To see this, suppose the truth of (5.11). SinceF can be chosen so that 2(P(F)−

M(F)) is arbitrarily close to d(P, M), given ² > 0 choose a set F such that

If (5.11) holds, then the right-hand side is bounded below by−², which proves the lemma since ² is arbitrarily small. To prove (5.11) observe that the left-hand side equals zero for p=m, has a negative derivative with respect to m for m < p, and has a positive derivative with respect to m for m > p. (The derivative with respect tom is (m−p)[1−4m(1−m)]/[m(1−m).) Thus the left hand side of (5.11) decreases to its minimum value of 0 asmtends topfrom above or below. 2

Proof of Lemma 5.2.6: The magnitude entropy density can be written as

|h(ω)|=h(ω) + 2h(ω) (5.12) wherea=−min(a,0). This inequality immediately gives the trivial left-hand inequality of (5.7). The right-hand inequality follows from the fact that

Z

hdP = Z

f[lnf]dM and the elementary inequalityalna≥ −1/e.

The second inequality will follow from (5.12) if we can show that 2

Z

hdP p

2D(P||M).

LetF denote the set {h≤0} and we have from (5.4) that 2

Z

hdP =2 Z

FhdP ≤ −2P(F) ln P(F) M(F) and hence using the inequality lnx≤x−1 and Lemma 5.2.7

2 Z

hdP 2P(F) lnM(F)

P(F) 2(M(F)−P(F))

≤d(P, M)≤p

2D(P||M), completing the proof. 2

Combining Lemmas 5.2.6 and 5.2.5 yields the following corollary, which gen-eralizes Lemma 2 of [54]:

Corollary 5.2.6: LetP andM be two measures on a space (Ω,B). Suppose that F is a sub-σ-field and that PF andMF are the restrictions ofP andM to F. Assume thatM >> P. Define the entropy densities h= lndP/dM and h0= lndPF/dMF. Then

Z

|h−h0|dP ≤D(P||M)−D(PF||MF) +2

e, (5.13)

and Z

|h−h0|dP ≤D(P||M)−

D(PF||MF) +p

2D(P||M)2D(PF||MF). (5.14) Proof: Choose the measure S as in Lemma 5.2.5 and then apply Lemma 5.2.6 withS replacingM. 2

5.2. DIVERGENCE 91

Variational Description of Divergence

As in the discrete case, divergence has a variational characterization that is a fundamental property for its applications to large deviations theory [143] [31].

We again take a detour to state and prove the property without delving into its applications.

Suppose now that P and M are two probability measures on a common probability space, say (Ω,B), such thatM >> P and hence the density

f = dP dM

is well defined. Suppose that Φ is a real-valued random variable defined on the same space, which we explicitly require to be finite-valued (it cannot assume as a value) and to have finite cumulant generating function:

EM(eΦ)<∞. Then we can define a probability measureMφ by

MΦ(F) = Z

F

eΦ

EM(eΦ)dM (5.15)

and observe immediately that by constructionM >> MΦand dMΦ

dM = eΦ EM(eΦ).

The measureMΦis called a “tilted” distribution. Furthermore, by construction dMΦ/dM 6= 0 and hence we can write

We are now ready to state and prove the principal result of this section, a variational characterization of divergence.

Theorem 5.2.1: Suppose thatM >> P. Then D(P||M) = sup

Φ

¡EPΦln(EM(eΦ))¢

, (5.16)

where the supremum is over all random variables Φ for which Φ is finite-valued andeΦisM-integrable.

Proof: First consider the random variable Φ defined by Φ = lnf and observe that

=D(P||M)ln Z

dP =D(P||M).

This proves that the supremum over all Φ is no smaller than the divergence. To prove the other half observe that for any Φ,

H(P||M)¡

EPΦlnEM(eΦ

=EP

µ

ln dP/dM dP/dMΦ

,

whereMΦis the tilted distribution constructed above. SinceM >> MΦ>> P, we have from the chain rule for Radon-Nikodym derivatives that

H(P||M)¡

EPΦlnEM(eΦ

=EPln dP

dMΦ =D(P||MΦ)0 from the divergence inequality, which completes the proof. Note that equality holds and the supremum is achieved if and only ifMΦ=P. 2

Dans le document Entropy and Information Theory (Page 99-114)