Assigning probabilities - This page intentionally left blank

4.1 Introduction

When we adopt the approach of probability theory as extended logic, the solution to any inference problem begins with Bayes’ theorem:

pðHijD;IÞ ¼pðHijIÞpðDjHi;IÞ

pðDjIÞ : (4:1)

In a well-posed problem, the prior information, I, defines the hypothesis space and provides the information necessary to compute the terms in Bayes’ theorem.

In this chapter we will be concerned with how to encode our prior information,I, into a probability distribution to use for pðDjHi;IÞ. Different states of knowledge correspond to different probability distributions. These probability distributions are frequently called sampling distributions, a carry-over from conventional statistics literature. Recall that in inference problems,pðDjHi;IÞgives the probability of obtain-ing the data, D, that we actually got, under the assumption that Hi is true. Thus, pðDjHi;IÞ yields how likely it is thatH_i is true,¹and hence it is referred to as the likelihood and frequently written asLðHiÞ.

For example, we might have two competing hypotheses H₁ and H₂ that each predicts different values of some temperature, say 1 K and 4.5 K, respectively. If the measured value is 1:20:4 K then it is clear that H₁ is more likely to be true. In precisely this type of situation we can use pðDjHi;IÞto compute quantitatively the relative likelihood ofH1 andH2. We saw how to do that in one case (Section3.6) where the likelihood was the product ofNindependent Gaussian distributions.

4.2 Binomial distribution

In this section, we will see how a particular state of knowledge (prior informationI) leads us to the choice of likelihood,pðDjHi;IÞ, which is the well-known binomial distribution (derivation due to M. Tribus,1969). In this case, our prior information is as follows:

1Conversely, if we know thatHiis true, then we can directly calculate the probability of observing any particular data value. We will usepðDjHi;IÞin this way to generate simulated data sets in Section5.13.

I ‘‘Proposition E represents an event that is repeated many times and has two possible outcomes represented by propositions, Q and Q, e.g., tossing a coin. The probability of outcomeQis constant from event to event, i.e., the probability of getting an outcomeQin any individual event is independent of the outcome for any other event.’’

In the Boolean algebra of propositions we can writeEas

E¼QþQ; (4:2)

whereQþQis the logical sum. Then the possible outcomes ofnevents can be written as E1;E2;. . .;En¼ ðQ1þQ1Þ;ðQ2þQ2Þ;. . .;ðQnþQnÞ; (4:3) whereQi‘‘outcomeQoccurred for theith event.’’ If the multiplication on the right is carried out, the result will be a logical sum of 2ⁿ terms, each a product of n logical statements, thereby enumerating all possible outcomes of thenevents. Forn¼3 we find:

E₁;E₂;E₃¼Q₁;Q₂;Q₃þQ₁;Q₂;Q₃þQ₁;Q₂;Q₃þQ₁;Q₂;Q₃

þQ1;Q2;Q3þQ1;Q2;Q3þQ1;Q2;Q3þQ1;Q2;Q3: (4:4) The probability of the particular sequenceQ1;Q2;Q3can be obtained from repeated applications of the product rule.

pðQ1;Q₂;Q₃Þ ¼pðQ1jIÞpðQ2;Q₃jQ1;IÞ

¼pðQ1jIÞpðQ2jQ1;IÞpðQ3jQ1;Q2;IÞ: (4:5) InformationIleads us to assign the same probability for outcomeQfor each event independent of what happened earlier or later, so Equation (4.5) becomes

pðQ1;Q2;Q3Þ ¼pðQ1jIÞpðQ2jIÞpðQ3jIÞ

¼pðQjIÞpðQjIÞpðQjIÞ

¼pðQjIÞpðQjIÞ²:

(4:6)

Thus, the probability of a particular outcome depends only on the number ofQ’s and Q’s in it and not on the order in which they occur. Returning to Equation (4.4), we note that:

one outcome, the first, contains threeQ’s, three outcomes contain twoQ’s,

three outcomes contain only oneQ, and one outcome contains noQ’s.

More generally, we are going to be interested in the number of ways of getting an outcome withr Q’s innevents or trials. In each event, it is possible to obtain aQ, so the question becomes in how many ways can we selectr Q’s fromnevents where their order is irrelevant, which is given byⁿC_r.

nC_r¼ n!

r!ðnrÞ!¼ n

r : (4:7)

For example, 3 2 ¼ 3!

2!1!¼3; Q;Q;Q Q;Q;Q Q;Q;Q.

Thus, the probability of gettingr Q’s innevents is the probability of any one sequence withr Q’s andðnrÞQ’s, multiplied byⁿCr, the multiplicity of ways of obtainingr Q’s innevents or trials. Therefore, we conclude that inntrials, the probability of seeing the outcome ofr Q’s andðnrÞQ’s is

pðrjn;IÞ ¼ n!

r!ðnrÞ!pðQjIÞ^rpðQjIÞ^nr: (4:8) This distribution is called thebinomial distribution.

Note the similarity to thebinomial expansion ðxþyÞⁿ¼Xⁿ

r¼0

r!ðnrÞ!x^ry^nr: (4:9)

Referring back to Equation (4.4), in the algebra of propositions, we can interpretEⁿto meanEcarried outntimes and write it in a form analogous to Equation (4.9):

Eⁿ¼ ðQþQÞⁿ: Example:

I‘‘You pick up one of two coins which appear identical. One, coinA, is known to be a fair coin, while coinBis a weighted coin withpðheadÞ ¼0:2.’’ From this information and from experimental information you will acquire from tossing the coin, compute the probability that you picked up coinA.

D‘‘3 heads turn up in 5 tosses.’’

What is the probability you picked coinA?

Let odds¼pðAjD;IÞ pðBjD;IÞ

¼pðAjIÞpðDjA;IÞ pðBjIÞpðDjB;IÞ ¼

2 pðDjA;IÞ

2 pðDjB;IÞ:

(4:10)

To evaluate the likelihoodspðDjA;IÞandpðDjB;IÞ, we use the binomial distribution, given by

pðrjn;IÞ ¼ n!

r!ðnrÞ!pðheadjA;IÞ^rpðtailjA;IÞ^nr;

wherepðrjn;IÞis the probability of obtainingrheads inntosses andpðheadjA;IÞis the probability of obtaining a head in any single toss assuming A is true. Now

pðDjA;IÞ ¼ n

r pðheadjA;IÞ^rpðtailjA;IÞ^nr¼ 5

3 ð0:5Þ³ð0:5Þ² andpðDjB;IÞ ¼ 5

3 ð0:2Þ³ð0:8Þ² !odds¼6:1¼_1pðAjD;IÞ^pðAjD;IÞ and sopðAjD;IÞ ¼0:86.

Thus, the probability you picked up coin A¼0:86, based on our current state of knowledge.

4.2.1 Bernoulli’s law of large numbers

The binomial distribution allows us to computepðrjn;IÞ, whereris, for example, the number of heads occurring inntosses of a coin. According to Bernoulli’s law of large numbers, thelong-run frequency of occurrence tends to the probability of the event occurring in any single trial, i.e.,

n!1lim r

n¼pðheadjIÞ: (4:11)

We can easily demonstrate this using the binomial distribution. If the probability of a head in any single toss ispðheadjIÞ ¼0:4, Figure4.1shows a plot ofpðr=njn;IÞversus the fractionr=nfor a variety of different choices ofnranging from 20 to 1000.

Box 4.1Mathematicaevaluation of binomial distribution:

Needs[‘‘Statistics ‘DiscreteDistributions’’’]

The line above loads a package containing a wide range of discrete distributions of importance to statistics, and the following line computes the probability ofrheads inntrials where the probability of a head in any one trial isp.

PDF[Binomial Distribution[n,p],r]

!answer¼0:205ðn¼10; p¼0:5; r¼4Þ

Notice asnincreases, the PDF for the frequency becomes progressively more sharply peaked, converging on a value of 0.4, the probability of a head in any single toss.

Although Bernoulli was able to derive this result, his unfulfilled quest lay in the inverse process: what could one say about the probability of obtaining a head, in a single toss, given a finite number of observed outcomes? This turns out to be a straightforward problem for Bayesian inference as we see in the next section.

4.2.2 The gambler’s coin problem

LetI‘‘You have acquired a coin from a gambling table. You want to determine whether it is a biased coin from the results of tossing the coin many times. You specify the bias of the coin by a proposition H, representing the probability of a head

occurring in any single toss.A priori, you assume that Hcan have any value in the range 0!1 with equal probability. You want to see how pðHjD;IÞ evolves as a function of the number of tosses.’’

LetD‘‘You toss the coin 50 times and record the following results: (a) 2 heads in the first 3 tosses, (b) 7 heads in the first 10 tosses, and (c) 33 heads in 50 tosses.’’

From the prior information, we determine that our hypothesis spaceHis contin-uous in the range 0!1. As usual, our starting point is Bayes’ theorem:

pðHjD;IÞ ¼pðHjIÞpðDjH;IÞ

pðDjIÞ : (4:12)

Since we are assuming a uniform prior for pðHjIÞ, the action will all be in the likelihood termpðDjH;IÞ, which, in this case, is given by the binomial distribution:

pðrjn;IÞ ¼ n!

r!ðnrÞ!H^rð1HÞ^nr: (4:13) Note: the symbolHis being employed in two different ways. In Equation (4.13), it is acting as an ordinary algebraic variable standing for possible numerical values in the range 0 to 1. When it appears as an argument of a probability or PDF, e.g.,pðHjD;IÞ, it acts as a proposition (obeying the rules of Boolean algebra) and asserts that the true value lies in the numerical rangeHtoHþdH.

Figure4.2shows the results from Equation (4.13) as a function ofHin the range 0!1. From the figure, we can clearly see how the evolution of our state of knowledge of the coin translates into a progressively more sharply peaked posterior PDF. From this simple example, we can see how Bayes’ theorem solves the inverse problem: find pðHjD;IÞgiven a finite number of observed outcomes represented byD.

0 0.2 0.4 0.6 0.8 1

r/n 0

5 10 15 20 25

Probability density

n=1000 n=100 n=20

Figure 4.1 A numerical illustration of Bernoulli’s law of large numbers. The PDF for the frequency of heads,r=n, inntosses of a coin is shown for three different choices ofn. Asn increases, the distribution narrows about the probability of a head in any single toss¼0:4.

4.2.3 Bayesian analysis of an opinion poll

LetI‘‘A number of political parties are seeking election in British Columbia. The questions to be addressed are: (a) what is the fraction of decided voters that support the Liberals, and (b) what is the probability that the Liberals will achieve a majority of at least 51%in the upcoming election, assuming the poll will be representative of the population at the time of the election?’’

LetD‘‘In a poll of 800 decided voters, 18%supported the New Democratic Party versus 55%for the Liberals, 19%for Reform BC and 8%for other parties.’’

Let the propositionH‘‘The fraction of the voters that will support the Liberals is betweenHandHþdH.’’ In this problem our hypothesis space of interest is contin-uous in the range 0 to 1, sopðHjD;IÞis a probability density function.

Based only on the prior information as stated, we adopt a flat priorpðHjIÞ ¼1.

Letr¼ the number of respondents in the poll that support the Liberals. As far as this problem is concerned, there are only two outcomes of interest; a voter either will or will not vote for the Liberals. We can therefore use the binomial distribution to evaluate the likelihood functionpðDjH;IÞ. Given a particular value ofH, the binomial distribution gives the probability of obtainingD¼rsuccesses innsamples, where in this case, a success means support for the Liberals.

pðDjH;IÞ ¼ n!

r!ðnrÞ!H^rð1HÞ^nr: (4:14) In this problemn¼800, andr¼440. From Bayes’ theorem we can write

pðHjD;IÞ ¼pðHjIÞpðDjH;IÞ

pðDjIÞ ¼pðDjH;IÞ

pðDjIÞ ¼ pðDjH;IÞ R1

0 dH pðDjH;IÞ: (4:15)

0 0.2 0.4 0.6 0.8 1

H (probability of a head in one toss) 0

1 2 3 4 5 6

p(H|D,I)

Weighted Coin

n = 50 n = 10 n = 3

Figure 4.2 The posterior PDF for the bias of a coin determined from: (a) 3 tosses, (b) 10 tosses, and (c) 50 tosses.

Figure4.3shows a graph of the posterior probability ofHfor a variety of poll sizes includingn¼800. The 95%credible region²forHis 55^þ3:4_3:5%. A frequentist interpreta-tion of the same poll would express the uncertainty in the fracinterpreta-tion of decided voters supporting the Liberals in the following way: ‘‘The poll of 800 people claims an accuracy of3:5%, 19 times out of 20.’’ We will see why when we deal with frequentist confidence intervals in Section6.6.

The second question, concerning the probability that the Liberals will achieve a majority of at least 51%of the vote, is addressed as a model selection problem. The two models are:

1. ModelM₁‘‘the Liberals will achieve a majority.’’ The parameter of the model isH, which is assumed to have a uniform prior in the range 0:51H1:0.

2. ModelM₂‘‘the Liberals will not achieve a majority.’’ The parameter of the model isH, which is assumed to have a uniform prior in the range 0H 5 0:51.

From Equation (3.14) we can write

odds¼O₁₂¼pðM1jIÞ

pðM2jIÞB₁₂; (4:16)

0 0.2 0.4 0.6 0.8 1

Fraction of voters supporting the Liberals 0

5 10 15 20

Probability density

n=100 n=200 n=800

Figure 4.3 The posterior PDF for H, the fraction of voters in the province supporting the Liberals based on polls of sizen¼100;200;800 decided voters.

2Note: a Bayesian credible region is not the same as a frequentist confidence interval. For a uniform prior forHthe 95%

confidence interval has essentially the same value as the 95%credible region, but the interpretation is very different.

The recipe for computing a credible region was given at the end of Section3.3.

where

B12¼pðDjM1;IÞ pðDjM2;IÞ

¼ R1

H¼0:51dH pðHjM1;IÞpðDjM1;H;IÞ R0:51

H¼0dH pðHjM2;IÞpðDjM2;H;IÞ

¼ R1

H¼0:51dHð1=0:49ÞpðDjM1;H;IÞ R0:51

H¼0dHð1=0:51ÞpðDjM2;H;IÞ

¼87:68:

(4:17)

Based onI, we have no prior reason to preferM1overM2, soO12¼B12. The probability that the Liberal party will win a majority is then given by (see Equation (3.18))

pðM1jD;IÞ ¼ 1

ð1þ1=O₁₂Þ¼0:989: (4:18)

Again, we emphasize that our conclusions are conditional on the assumed prior informa-tion, which includes the assumption that the poll will be representative of the population at the time of the election. Now that we have set up the equations to answer the questions posed above, it is a simple exercise to recompute the answers assuming different prior information, e.g., suppose the prior lower bound onHwere 0.4 instead of 0.

4.3 Multinomial distribution

When we throw a six-sided die there are six possible outcomes. This motivates the following question: Is there a generalization of the binomial distribution for the case where we have more than two possible outcomes? Again we can use probability theory as extended logic to derive the appropriate distribution starting from a statement of our prior information.

I‘‘Proposition E represents an event that is repeated many times and has m possible outcomes represented by propositions, O1;O2;. . .;Om. The outcomes of individual events are logically independent, i.e., the probability of getting an outcome Oi in event j is independent of what outcome occurred in any other event.’’

E¼O₁þO₂þO₃þ þO_m, then for the eventErepeatedntimes:

Eⁿ¼ ðO1þO2þ þO_mÞⁿ: The probability of any particularEⁿhaving

O1 occurring n1 times O2 occurring n2 times

...

... ... Om occurring nm times ispðEⁿjIÞ ¼pðO1jIÞⁿ¹pðO2jIÞⁿ². . .pðOmjIÞⁿ^m.

Next we need to find the number of sequences having the same number of O1;O2;. . .;Om (multiplicity) independent of the order. We can readily guess at the form of multiplicity by rewriting Equation (4.7) setting the denominator r!ðnrÞ!¼n1!n2!.

multiplicity for the two-outcome case¼ n!

r!ðnrÞ!¼ n!

n1!n2!; (4:19) wheren1stands for the number ofA’s andn2for the number ofA’s. Now in the current problem, we havempossible outcomes for each event, so,

multiplicity for them-outcome case¼ n!

n1!n2! . . .nm!; (4:20) wheren¼Pm

i¼1 ni.

Therefore, the probability of seeing the outcome defined byn1n2. . .nmwhereni

‘‘OutcomeO_ioccurredn_itimes’’ is

pðn1;n₂;. . .;n_mjEⁿ;IÞ ¼ n!

n1!n2! . . .nm! Y^m

i¼1

pðOijIÞⁿⁱ: (4:21) This is called themultinomial distribution.

Compare this with themultinomial expansion:

ðx1þx2þ þxmÞⁿ¼X n!

n1!n2! . . .nm!xⁿ₁ⁱxⁿ₂². . .xⁿ_m^m; (4:22) where the sum is taken over all possible values ofni, subject to the constraint that Pm

i¼1n_i¼n.

4.4 Can you really answer that question?

LetI‘‘A tin containsNbuttons, identical in all respects except thatMare black and the remainder are white.’’

What is the probability that you will a pick a black button on the first draw assuming you are blindfolded? The answer is clearly M/N. What is the probability that you will a pick a black button on the second draw if you know that a black button was picked on the first and not put back in the tin (sampling without replacement)?

LetBi‘‘A black button was picked on the ith draw.’’

LetW_i‘‘A white button was picked on the ith draw.’’

Then

pðB2jB1;IÞ ¼M1 N1;

because for the second draw there is one less black button and one less button in total.

Now, what is the probability of picking a black button on the second drawpðB2jIÞ when we are not told what color was picked on the first draw? In this case the answer mightappearto be indeterminate, but as we shall show, questions of this kind can be answered using probability theory as extended logic.

We know that either B1 or W1 is true, which can be expressed as the Boolean equationB₁þW₁¼1. Thus we can write:

B2¼ ðB1þW1Þ;B2¼B1;B2þW1;B2:

But according to Jaynes consistency (see Section2.5.1), equivalent states of knowledge must be represented by equivalent plausibility assignments. Therefore

pðB2jIÞ ¼pðB1;B2jIÞ þpðW1;B2jIÞ

¼pðB1jIÞpðB2jB1;IÞ þpðW1jIÞpðB2jW1;IÞ

¼ M N

M1 N1

þ NM N

M N1

¼M N:

(4:23)

In like fashion, we can show

pðB3jIÞ ¼M N:

The probability of black at any draw, if we do not know the result of any other draw, is always the same.

The method used to obtain this result is very useful.

1. Resolve the quantity whose probability is wanted into mutually exclusive sub-propositions:³ B₃¼ ðB1þW₁Þ;ðB2þW₂Þ;B₃

¼B1;B2;B3þB1;W2;B3þW1;B2;B3þW1;W2;B3: 2. Apply the sum rule.

3. Apply the product rule.

If the sub-propositions are well chosen (i.e., they have a simple meaning in the context of the problem), their probabilities are often calculable.

While we are on the topic of sampling without replacement, let’s introduce the hypergeometric distribution (see Jaynes,2003). This gives the probability of drawingr

3In his book,Rational Descriptions, Decisions and Designs, M. Tribus refers to this technique as extending the conversation. In many problems, there are many pieces of information which do not seem to fit together in any simple mathematical formulation. The technique of extending the conversation provides a formal method for introducing this information into the calculation of the desired probability.

black buttons (blindfolded) inntries from a tin containingNbuttons, identical in all respects except thatMare black and the remainder are white.

pðrjN;M;nÞ ¼ M

NM nr

N n

; (4:24)

where

M r

¼ M!

r!ðMrÞ! etc: (4:25)

Box 4.2 Mathematicaevaluation of hypergeometric distribution:

Needs[‘‘Statistics ‘DiscreteDistributions’ ’’]

PDF[HypergeometricDistribution[n,nsucc,ntot],r]

gives the probability ofrsuccesses inntrials corresponding to sampling without replacement from a population of sizen_totwithn_succpotential successes.

4.5 Logical versus causal connections

We now need to clear up an important distinction between a logical connection between two propositions and a causal connection. In the previous problem withM black buttons andNMwhite buttons, it is clear thatpðBjjBj1;IÞ<pðBjjIÞsince we know there is one less black button in the tin when we take our next pick. Clearly, what was drawn on earlier draws can affect what will happen in later draws. We can say there is some kind of partial causal influence ofBj1onB_j.

Now suppose we ask the question what is the probabilitypðBj1jBj;IÞ? Clearly in this case what we get on a later draw can have no effect on what occurs on an earlier draw, so it may be surprising to learn thatpðBj1jBj;IÞ ¼pðBjjBj1;IÞ. Consider the following simple proof (Jaynes,2003). From the product rule we write

pðBj1;BjjIÞ ¼pðBj1jBj;IÞpðBjjIÞ ¼pðBjjBj1;IÞpðBj1jIÞ:

But we have just seen thatpðBjjIÞ ¼pðBj1jIÞ ¼M=Nfor allj, so

pðBj1jBj;IÞ ¼pðBjjBj1;IÞ; (4:26) or more generally,

pðBkjBj;IÞ ¼pðBjjBk;IÞ; for allj;k: (4:27)

How can information about a later draw affect the probability of an earlier draw?

Recall that in Bayesian analysis, probabilities are an encoding of our state of know-ledge about some question. Performing the later draw does not physically affect the numberMjof black buttons in the tin at thejth draw. However, information about the result of a later draw has the same effect on our state of knowledge about what could have been taken on thejth draw, as does information about an earlier draw. Bayesian probability theory is concerned with all logical connections between propositions independent of whether there are causal connections.

Example 1:

I‘‘A shooting has occurred and the police arrest a suspect on the same day.’’

A‘‘Suspect is guilty of shooting.’’

B‘‘A gun is found seven days after the shooting with suspect’s fingerprints on it.’’

Clearly,Bis not a partial cause ofAbut still we conclude that pðAjB;IÞ5pðAjIÞ:

Example 2:

I‘‘A virulent virus invades Montreal. Anyone infected loses their hair a month before dying.’’

A‘‘The mayor of Montreal lost his hair in September.’’

B‘‘The mayor of Montreal died in October.’’

Again, in this case,pðAjB;IÞ5pðAjIÞ.

Although a logical connection does not imply a causal connection, a causal con-nection does imply a logical concon-nection, so we can certainly use probability theory to address possible causal connections.

4.6 Exchangeable distributions

In the previous section, we learned that information about the result of a later draw has the same effect on our state of knowledge about what could have been taken on thejth draw, as does information about an earlier one. Every draw has the same relevance to every other draw regardless of their time order. For example, pðBjjBj1;Bj2;IÞ ¼pðBjjBjþ1;Bjþ2;IÞ, where againB_j is the proposition asserting a black button on thejth draw. The only thing that is significant about the knowledge

Dans le document This page intentionally left blank (Page 92-116)