Entropy, relative entropy and mutual information

Self-Modeling as Support for Fault Localization

8.1 Entropy, relative entropy and mutual information

8.1.1 Entropy and conditional entropy . . . . 135 8.1.2 Relative entropy and mutual information . . . . 136 8.1.3 Gain function for fault localization . . . . 138 8.2 Implementation and evaluation . . . 139 8.2.1 Entropy reduction as more measurements are collected . . . . 139 8.2.2 Fault localization in an IMS network . . . . 141

The aim of this chapter is to demonstrate the relevance of the exploration strategy explained in the previous chapter. We review some standard results and definitions related to entropy, relative entropy, mutual information and gain function for fault localization. These concepts and their properties are useful to analyse the experimental results that will be presented.

8.1 Entropy, relative entropy and mutual information

For any probability distribution, we recall the definition of a quantity called theentropy, which has many properties that agree with the intuitive notion of what a measure of information should be. This notion is extended to definemutual information, which is a measure of the amount of information one random variable contains about another.

Entropy then becomes the self-information of a random variable. Mutual information is a special case of a more general quantity calledrelative entropy, which is a measure of the distance between two probability distributions. All these quantities are closely related and share a number of simple properties that are presented below.

8.1.1 Entropy and conditional entropy

We will first introduce the concept of entropy, which is a measure of uncertainty of a random variable. LetXbe a discrete random variable with alphabetX and probability mass function p(x) =P r(X =x), X ∈X. We denote the probability mass function by p(x).

135

136 Experimental results

Definition 10 The entropy H(X) of a discrete random variable X is defined by

H(X) =− X

X∈X

p(x)log p(x). (8.1)

Note thatH(X)≥0.

We also recall the conditional entropy of a random variable given another as the expected value of the entropies of the conditional distributions, averaged over the con-ditioning random variable.

Definition 11 If (X, Y)∼p(x, y), then the conditional entropyH(Y|X) is defined as

H(Y|X) = X

X∈X

p(x)H(Y|X =x) (8.2)

=− X

X∈X

p(x) X

Y∈Y

p(y|x)log p(y|x) (8.3)

=− X

X∈X

Y∈Y

p(x, y)log p(y|x). (8.4)

The naturalness of the definition of joint entropy and conditional entropy is exhibited by the fact that the entropy of a pair of random variables is the entropy of one variable plus the conditional entropy of the other.

Theorem 1 (Chain rule for entropy)

H(X, Y) =H(X) +H(Y|X). (8.5)

Let X1, X2, ..., Xn be drawn according top(x1, x2, ..., xn). Then H(X₁, X₂, ..., X_n) =

i=1

H(X_i|X_i−1, ..., X₁). (8.6)

8.1.2 Relative entropy and mutual information

Here, we review two related concepts: relative entropy and mutual information.

Definition 12 The relative entropy or Kullback Leibler distance between two probabil-ity mass functions p(x) and q(x) is defined as

D(pkq) = X

X∈X

p(x)logp(x)

q(x). (8.7)

Entropy, relative entropy and mutual information 137

Definition 13 Consider two random variablesX and Y with a joint probability mass functionp(x, y) and marginal probability mass function p(x) andp(y). The mutual in-formationI(X;Y) is the relative entropy between the joint distribution and the product distributionp(x)p(y):

I(X;Y) = X

X∈X

Y∈Y

p(x, y)log p(x, y)

p(x)p(y) (8.8)

=D(p(x, y)kp(x)p(y)) (8.9)

=H(X)−H(X|Y). (8.10)

Thus, the mutual information I(X;Y) is the reduction in the uncertainty of X due to the knowledge of Y. Using Jensen’s inequality and its consequences (Cover and Thomas, 1991) some of the properties of entropy and relative entropy can be proved.

Theorem 2 (Information inequality) Let p(x), q(x), x ∈ X, be two probability mass functions. Then,

D(pkq)≥0 (8.11)

Corollary 4 (Non-negativity of mutual information) For any two random vari-ables, X, Y:

I(X;Y)≥0, (8.12)

Theorem 3 (Conditioning reduces entropy)

H(X|Y)≤H(X) (8.13)

Intuitively, the theorem says that knowing another random variable Y can only reduce the uncertainty inX. Note that this is trueonly on the average. Specifically, H(X|Y = y) may be greater or lower than H(X), but on the average H(X|Y) = P

yp(y)H(X|Y =y) ≤ H(X). For example, in a court case, specific new evidence might increase uncertainty, but on the average evidence decreases uncertainty.

Example. Let (X, Y) have the following joint distribution. Then,H(X) =H(¹₈,⁷₈) =

X=1 X=2

Y=1 0 ³₄

Y=2 ¹₈ ¹₈

0.544 bits,H(X|Y = 1) = 0 bits and H(X|Y = 2) = 1 bit. We calculate H(X|Y) =

4H(X|Y = 1) + ¹₄H(X|Y = 2) = 0.25 bits. Thus, uncertainty in X is increased if Y = 2 is observed and decreased if Y = 1 is observed, but uncertainty decreases on the average. Notice again that this information reductionH(X|Y) ≤H(X) is on the average. For a given sample, one may haveH(X|Y =y)≥H(X|Y) and even≥H(X).

138 Experimental results

Theorem 4 (Chain rule for information)

I(X;Y1, Y2, ..., Xn) =I(X;Y1) +I(X;Y2|Y₁) +I(X;Y3|Y₁Y2) +.... (8.14) In particular,

H(X|Z =z, Y =y) =−X

p(x|y, z)logp(x|y, z). (8.15) H(X|Z, Y =y) =X

8.1.3 Gain function for fault localization

Suppose the state of a system is denoted by a vectorX = (X₁, X₂, ..., X_n) of random variables, where Xi represents the state of a node. A probabilistic model is a joint probability distribution (a prior distribution) over all the random variables Xi. Hence it assigns, a probability p(x) to each realization x. Given observationsY of the system state, a fault diagnosis algorithm identifies the Most Probable Explanation (MPE) of the underlying system responsible for observed outage, i.e.:

x^∗ = arg max

p(x|y)

Observations reduce uncertainty of the system. Before we observe Y, the uncer-tainty about system state can be quantified by the Shannon entropy:

H(X) =−X

p(x)log p(x).

If the system state is doubtless, H(X) = 0. As uncertainty goes up, H(X) increases.

The purpose of making observations is just to reduce this uncertainty. Given probe responsesY = (Y₁, Y₂, ..., Y_n), the probability of the system state changes intoP(X|Y), consequently the average uncertainty of the system state is:

H(X|Y) =−X

x,y

p(x, y)log p(x|y).

(Lindley, 1956) proposed a method to measure the average information provided by an experiment. Suppose the probeYi is defined on system state X.

Definition 14 The average amount of information gain fromY_i, with prior probability distribution P(X), is:

G(Y_i) =I(X;Y_i) =H(X)−H(X|Y_i) =X

x,yi

p(x, y_i)log p(x, y_i)

p(x)P(yi). (8.17)

Implementation and evaluation 139

In some situations, as the one we examine below, one may not be able to use directly the full observation setY = (Y₁, Y₂, ..., Y_n) and may have to select a subset of kobservations to collect. In order to select an optimal set ofk probes among variables Y1, ..., Yn, for fault localization, the target is to find a subset E ⊆ {1, ..., n}, |E|= k, which maximizes the reduction of Shannon entropy i.e.:

E^∗ = arg max

G(YE), for YE ={Y_i,i∈E}. (8.18) The problem stated in (8.18) is NP-hard (Brodie et al., 2003), hence efficient exact solutions are likely not possible. An alternative, widely used in the literature, is the heuristic greedy approach, which iteratively adds toEthe probe that reduces the max-imum entropy of the system, among those that were not selected yet. When problems are detected, a greedy algorithm repeatedly selects, givenE, the new observationYi to collect by:

i^∗= arg max

i /∈E

I(X;Y_i|Y_E). (8.19)

Actuallyi /∈ Ecan be simplified intoi∈E becauseI(X;Yi|Y_E)≥0 andI(X;Yi|Y_E) = 0 for i ∈ E. Besides, note that this solution uses P(X, Y_i|Y_E). Not only the mutual information ofX and Y_i is measured, but the impact of previously selected probes on Yi is also calculated, which provides a “global view”, better than the techniques which only consider system stateX and the candidate probe Y_i.

Dans le document The DART-Europe E-theses Portal (Page 138-142)