A brief review of probability theory

This section is a very brief review of the basics of probability theory, and is merely meant as a refresher for readers who may be “rusty”. Readers who are already familiar with these basics may safely skip this section.

2.2.1 Discrete random variables

The expression p(A) denotes the probability that the event A is true. For example, A might be the logical expression “it will rain tomorrow”. We require that 0 ≤ p(A) ≤ 1, where p(A) = 0means the event definitely will not happen, andp(A) = 1means the event definitely will happen. We write p(A) to denote the probability of the event not A; this is defined to p(A) = 1−p(A). We will often writeA= 1to mean the eventAis true, andA= 0to mean the eventAis false.

We can extend the notion of binary events by defining adiscrete random variableX, which can take on any value from a finite or countably infinite set X. We denote the probability of the event that X =xby p(X =x), or justp(x) for short. Here p() is called a probability mass function or pmf. This satisfies the properties 0 ≤ p(x) ≤ 1 and

x∈Xp(x) = 1.

Figure 2.1 shows two pmf’s deﬁned on the ﬁnitestate spaceX ={1,2,3,4,5}. On the left we have a uniform distribution, p(x) = 1/5, and on the right, we have a degenerate distribution, p(x) =I(x= 1), whereI()is the binary indicator function. This distribution represents the fact thatX is always equal to the value 1, in other words, it is a constant.

2.2.2 Fundamental rules

In this section, we review the basic rules of probability.

2.2.2.1 Probability of a union of two events

Given two events,AandB, we deﬁne the probability ofAorB as follows:

p(A∨B) = p(A) +p(B)−p(A∧B) (2.1)

= p(A) +p(B)ifAandB are mutually exclusive (2.2) 2.2.2.2 Joint probabilities

We deﬁne the probability of the joint eventAandB as follows:

p(A, B) =p(A∧B) =p(A|B)p(B) (2.3)

This is sometimes called the product rule. Given ajoint distributionon two eventsp(A, B), we deﬁne the marginal distributionas follows:

p(A) =

p(A, B) =

p(A|B=b)p(B =b) (2.4)

where we are summing over all possible states of B. We can deﬁne p(B) similarly. This is sometimes called thesum rule or therule of total probability.

The product rule can be applied multiple times to yield thechain ruleof probability:

p(X_1:D) =p(X₁)p(X₂|X₁)p(X₃|X₂, X₁)p(X₄|X₁, X₂, X₃). . . p(X_D|X_1:D−1) (2.5) where we introduce the Matlab-like notation1 :D to denote the set{1,2, . . . , D}.

2.2.2.3 Conditional probability

We deﬁne theconditional probabilityof eventA, given that eventB is true, as follows:

p(A|B) = p(A, B)

p(B) ifp(B)>0 (2.6)

2.2.3 Bayes rule

Combining the deﬁnition of conditional probability with the product and sum rules yieldsBayes rule, also calledBayes Theorem²:

p(X =x|Y =y) =p(X=x, Y =y)

p(Y =y) = p(X =x)p(Y =y|X =x)

xp(X=x)p(Y =y|X =x) (2.7) 2.2.3.1 Example: medical diagnosis

As an example of how to use this rule, consider the following medical diagonsis problem.

Suppose you are a woman in your 40s, and you decide to have a medical test for breast cancer called a mammogram. If the test is positive, what is the probability you have cancer? That obviously depends on how reliable the test is. Suppose you are told the test has asensitivity

2. Thomas Bayes (1702–1761) was an English mathematician and Presbyterian minister.

of 80%, which means, if you have cancer, the test will be positive with probability 0.8. In other words,

p(x= 1|y= 1) = 0.8 (2.8)

where x= 1is the event the mammogram is positive, andy= 1is the event you have breast cancer. Many people conclude they are therefore 80% likely to have cancer. But this is false! It ignores the prior probability of having breast cancer, which fortunately is quite low:

p(y= 1) = 0.004 (2.9)

Ignoring this prior is called the base rate fallacy. We also need to take into account the fact that the test may be a false positive or false alarm. Unfortunately, such false positives are quite likely (with current screening technology):

p(x= 1|y= 0) = 0.1 (2.10)

Combining these three terms using Bayes rule, we can compute the correct answer as follows:

p(y= 1|x= 1) = p(x= 1|y= 1)p(y= 1)

p(x= 1|y= 1)p(y= 1) +p(x= 1|y= 0)p(y= 0) (2.11)

= 0.8×0.004

0.8×0.004 + 0.1×0.996 = 0.031 (2.12) where p(y = 0) = 1−p(y = 1) = 0.996. In other words, if you test positive, you only have about a 3% chance of actually having breast cancer!³

2.2.3.2 Example: Generative classiﬁers

We can generalize the medical diagonosis example to classify feature vectorsxof arbitrary type as follows:

p(y=c|x,θ) = p(y=c|θ)p(x|y=c,θ)

cp(y=c|θ)p(x|y=c,θ) (2.13)

This is called agenerative classifier, since it specifies how to generate the data using the class-conditional densityp(x|y=c)and the class priorp(y=c). We discuss such models in detail in Chapters 3 and 4. An alternative approach is to directly fit the class posterior, p(y =c|x);

this is known as a discriminative classiﬁer. We discuss the pros and cons of the two approaches in Section 8.6.

2.2.4 Independence and conditional independence

We say X and Y are unconditionally independent or marginally independent, denoted X ⊥Y, if we can represent the joint as the product of the two marginals (see Figure 2.2), i.e.,

X⊥Y ⇐⇒ p(X, Y) =p(X)p(Y) (2.14)

Figure 2.2 Computingp(x, y) =p(x)p(y), whereX⊥Y. HereXandY are discrete random variables;

X has 6 possible states (values) and Y has 5 possible states. A general joint distribution on two such variables would require(6×5)−1 = 29parameters to deﬁne it (we subtract 1 because of the sum-to-one constraint). By assuming (unconditional) independence, we only need(6−1) + (5−1) = 9parameters to deﬁnep(x, y).

In general, we say a set of variables is mutually independent if the joint can be written as a product of marginals.

Unfortunately, unconditional independence is rare, because most variables can inﬂuence most other variables. However, usually this inﬂuence is mediated via other variables rather than being direct. We therefore sayX andY areconditionally independent(CI) givenZiff the conditional joint can be written as a product of conditional marginals:

X⊥Y|Z ⇐⇒ p(X, Y|Z) =p(X|Z)p(Y|Z) (2.15) When we discuss graphical models in Chapter 10, we will see that we can write this assumption as a graphX−Z−Y, which captures the intuition that all the dependencies betweenX andY are mediated viaZ. For example, the probability it will rain tomorrow (eventX) is independent of whether the ground is wet today (event Y), given knowledge of whether it is raining today (event Z). Intuitively, this is because Z “causes” both X and Y, so if we knowZ, we do not need to know aboutY in order to predictX or vice versa. We shall expand on this concept in Chapter 10.

Another characterization of CI is this:

Theorem 2.2.1. X ⊥Y|Ziff there exist functiong andhsuch that

p(x, y|z) =g(x, z)h(y, z) (2.16)

for allx, y, zsuch thatp(z)>0.

3. These numbers are from (McGrayne 2011, p257). Based on this analysis, the US government decided not to recommend annual mammogram screening to women in their 40s: the number of false alarms would cause needless worry and stress amongst women, and result in unnecesssary, expensive, and potentially harmful followup tests. See Section 5.7 for the optimal way to trade off risk reverse reward in the face of uncertainty.

See Exercise 2.8 for the proof.

CI assumptions allow us to build large probabilistic models from small pieces. We will see many examples of this throughout the book. In particular, in Section 3.5, we discuss naive Bayes classiﬁers, in Section 17.2, we discuss Markov models, and in Chapter 10 we discuss graphical models; all of these models heavily exploit CI properties.

2.2.5 Continuous random variables

So far, we have only considered reasoning about uncertain discrete quantities. We will now show (following (Jaynes 2003, p107)) how to extend probability to reason about uncertain continuous quantities.

SupposeX is some uncertain continuous quantity. The probability thatX lies in any interval a≤X ≤b can be computed as follows. Deﬁne the eventsA= (X ≤a), B= (X ≤b)and W = (a < X ≤b). We have thatB=A∨W, and sinceAandW are mutually exclusive, the sum rules gives

p(B) =p(A) +p(W) (2.17)

and hence

p(W) =p(B)−p(A) (2.18)

Deﬁne the function F(q) p(X ≤q). This is called the cumulative distribution function or cdf of X. This is obviously a monotonically increasing function. See Figure 2.3(a) for an example. Using this notation we have

p(a < X ≤b) =F(b)−F(a) (2.19)

Now deﬁne f(x) = _dx^d F(x) (we assume this derivative exists); this is called the probability density function or pdf. See Figure 2.3(b) for an example. Given a pdf, we can compute the probability of a continuous variable being in a ﬁnite interval as follows:

P(a < X ≤b) = _b

f(x)dx (2.20)

As the size of the interval gets smaller, we can write

P(x≤X≤x+dx)≈p(x)dx (2.21)

We requirep(x)≥0, but it is possible forp(x)>1 for any given x, so long as the density integrates to 1. As an example, consider theuniform distributionUnif(a, b):

Unif(x|a, b) = 1

b−aI(a≤x≤b) (2.22)

If we seta= 0andb= ¹₂, we havep(x) = 2for anyx∈[0,¹₂].

−30 −2 −1 0 1 2 3 20

40 60 80 100

CDF

(a)

Φ⁻¹(α/2) 0 Φ⁻¹(1−α/2)

α/2 α/2

(b)

Figure 2.3 (a) Plot of the cdf for the standard normal,N(0,1). (b) Corresponding pdf. The shaded regions each containα/2of the probability mass. Therefore the nonshaded region contains1−αof the probability mass. If the distribution is GaussianN(0,1), then the leftmost cutoff point isΦ⁻¹(α/2), where Φis the cdf of the Gaussian. By symmetry, the rightost cutoff point isΦ⁻¹(1−α/2) =−Φ⁻¹(α/2). If α= 0.05, the central interval is 95%, and the left cutoff is -1.96 and the right is 1.96. Figure generated by quantileDemo.

2.2.6 Quantiles

Since the cdf F is a monotonically increasing function, it has an inverse; let us denote this by F⁻¹. IfF is the cdf ofX, thenF⁻¹(α)is the value ofx_αsuch that P(X ≤x_α) =α; this is called theαquantileofF. The valueF⁻¹(0.5)is themedianof the distribution, with half of the probability mass on the left, and half on the right. The valuesF⁻¹(0.25)and F⁻¹(0.75) are the lower and upperquartiles.

We can also use the inverse cdf to compute tail area probabilities. For example, if Φ is the cdf of the Gaussian distributionN(0,1), then points to the left ofΦ⁻¹(α)/2)containα/2 probability mass, as illustrated in Figure 2.3(b). By symmetry, points to the right ofΦ⁻¹(1−α/2) also containα/2 of the mass. Hence the central interval(Φ⁻¹(α/2),Φ⁻¹(1−α/2))contains 1−αof the mass. If we setα= 0.05, the central 95% interval is covered by the range

(Φ⁻¹(0.025),Φ⁻¹(0.975)) = (−1.96,1.96) (2.23) If the distribution is N(μ, σ²), then the 95% interval becomes (μ−1.96σ, μ+ 1.96σ). This is sometimes approximated by writingμ±2σ.

2.2.7 Mean and variance

The most familiar property of a distribution is itsmean, orexpected value, denoted byμ. For discrete rv’s, it is deﬁned as E[X]

x∈Xx p(x), and for continuous rv’s, it is deﬁned as E[X]

Xx p(x)dx. If this integral is not ﬁnite, the mean is not deﬁned (we will see some examples of this later).

The varianceis a measure of the “spread” of a distribution, denoted byσ². This is deﬁned

as follows:

from which we derive the useful result E

X²

=μ²+σ² (2.26)

Thestandard deviationis deﬁned as std [X]

var [X] (2.27)

This is useful since it has the same units asX itself.

Dans le document machine learning (Page 59-65)