Probability theory as extended logic - This page intentionally left blank

2.1 Overview

The goal of this chapter is to provide an extension of logic to handle situations where we have incomplete information so we may arrive at the relative probabilities of competing propositions (theories, hypotheses, or models) for a given state of informa-tion. We start by reviewing the algebra of logical propositions and explore the structure (syllogisms) of deductive and plausible inference. We then set off on a course to come up with a quantitative theory of plausible inference (probability theory as extended logic) based on the three desirable goals calleddesiderata. This amounts to finding an adequate set of mathematical operations for plausible inference that satisfies the desiderata. The two operations required turn out to be the product rule and sum rule of probability theory. The process of arriving at these operations uncovers a precise operational definition of plausibility, which is determined by the data. The material presented in this chapter is an abridged version of the treatment given by E. T. Jaynes in his book,Probability Theory–The Logic of Science(Jaynes, 2003), with permission from Cambridge University Press.

2.2 Fundamentals of logic 2.2.1 Logical propositions

In general, we will represent propositions by capital lettersfA;B;C;etc:g. A proposi-tion asserts that something is true.

e:g:;A‘‘The age of the specimen is 10⁶ years:’’

The denial of a proposition is indicated by a bar:

A‘‘Ais false.’’

We will only be concerned with two-valued logic; thus, any proposition has a truth value of either

True or False

1 or 0

Truth value:

2.2.2 Compound propositions A;Basserts bothAandBare true

ðlogical product orconjunctionÞ

A;Aimpossible statement, truth value¼F or zero AþBassertsAis true or Bis true or both are true

ðlogical sumordisjunctionÞ

A;BþB;Aasserts eitherAis true orBis true but both are not true ðexclusive form of logical sumÞ

2.2.3 Truth tables and Boolean algebra

Consider the two compound propositionsA¼B;CandD¼BþC. Are the proposi-tionsAandDequal? Two propositions are equal if they have the same truth value. We can verify thatA¼Dby constructing a truth table which lays out the truth values for AandDfor all the possible combinations of the truth values of the propositionsBand Con which they are based (Table2.1).

SinceAandDhave the same truth value for all possible truth values of propositions BandC, then we can write

A¼D(which means they are logically equivalent):

We have thus established the relationship

B;C¼BþCand 6¼B;C: (2:1)

In addition, the last two columns of the table establish the relationship

B;C¼BþC: (2:2)

Boole (1854) pointed out that the propositional statements in symbolic logic obey the rules of algebra provided one interprets them as having values of 1 or 0 (Boolean algebra). There are no operations equivalent to subtraction or division. The only operations required are multiplications (‘and’) and additions (‘or’).

Table 2.1

B C B;C A¼B;C D¼BþC BþC BþC B;C

T T T F F T F F

T F F T T T F F

F T F T T T F F

F F F T T F T T

Box2.1 Worked exercise:

construct a truth table to showA;ðBþCÞ ¼A;BþA;C.

SinceA;ðBþCÞandA;BþA;Chave the same truth value for all possible truth values of propositionsA,BandC, then we can write

A;ðBþCÞ ¼A;BþA;C. (This is adistributivity identity.)

One surprising result of Boolean algebra manipulations is that a given statement may take several different forms which don’t resemble one another.

For example, show thatD¼AþB;C¼ ðAþBÞ;ðAþCÞ.

In the proof below, we make use of the relationshipsX;Y¼XþY(on line 1), and X;Y¼XþY(on line 3), from Equations (2.1) and (2.2).

D¼AþB;C¼AþB;C¼A;B;C D¼A;ðBþCÞ

D¼A;BþA;C¼ ðAþBÞ þ ðAþCÞ D¼ ðAþBÞ;ðAþCÞ

D¼ ðAþBÞ;ðAþCÞ or AþB;C¼ ðAþBÞ;ðAþCÞ:

This can also be verified by constructing a truth table.

A B C BþC A;ðBþCÞ A;B A;C A;BþA;C

T T T T T T T T

T F F F F F F F

T T F T T T F T

T F T T T F T T

F T T T F F F F

F F F F F F F F

F T F T F F F F

F F T T F F F F

By the application of these identities, one can prove any number of further relations, some highly non-trivial. For example, we shall presently have use for the rather elementary ‘‘theorem’’:

IfB¼A;D

A;B¼A;A;D¼A;D¼B thenA;B¼B:

(2:3)

Also, we can show that:

B;A¼A: (2:4)

Proof of the latter follows from

B¼A;D¼AþD

B;A¼A;AþA;D¼AþA;D¼A: (2:5) Clearly, Equation (2.5) is true ifAis true and false ifAis false, regardless of the truth of D.

2.2.4 Deductive inference

Deductive inference is the process of reasoning from one proposition to another. It was recognized by Aristotle (fourth century BC) that deductive inference can be analyzed into repeated applications of thestrong syllogisms:

1. IfAis true, thenBis true (major premise) Ais true ðminor premiseÞ ThereforeBis true ðconclusionÞ 2. IfAis true, thenBis true

Bis false ThereforeAis false

Basic Boolean Identities

Idempotence: A;A ¼ A

AþA ¼ A

Commutativity: A;B ¼ B;A

AþB ¼ BþA

Associativity: A;ðB;CÞ ¼ ðA;BÞ;C ¼ A;B;C

Aþ ðBþCÞ ¼ ðAþBÞ þC ¼ AþBþC

Distributivity: A;ðBþCÞ ¼ A;BþA;C

Aþ ðB;CÞ ¼ ðAþBÞ;ðAþCÞ

Duality: If C¼A;B, then C¼AþB

If D¼AþB, then D¼A;B

In Boolean algebra, these strong syllogisms can be written as:

A¼A;B: (2:6)

This equation says that the truth value of propositionA;Bis equal to the truth value of propositionA. It does not assert that eitherAorBis true. Clearly, ifBis false, then the right hand side of the equation equals 0, and soAmust be false. On the other hand, ifB is known to be true, then according to Equation (2.6), propositionAcan be true or false. It is also written as the implication operationA)B.

2.2.5 Inductive or plausible inference

In almost all situations confronting us, we do not have the information required to do deductive inference. We have to fall back on weaker syllogisms:

IfAis true, thenBis true Bis true

ThereforeAbecomes more plausible Example

A‘‘It will start to rain by 10 AM at the latest.’’

B‘‘The sky becomes cloudy before 10 AM.’’

Observing clouds at 9:45 AM does not give us logical certainty that rain will follow;

nevertheless, our common sense, obeying the weak syllogism, may induce us to change our plans and behave as if we believed that it will rain, if the clouds are sufficiently dark.

This example also shows the major premise: ‘‘IfAthenB’’ expresses Bonly as a logical consequence ofAand not necessarily as a causal consequence (i.e., the rain is not the cause of the clouds).

Another weak syllogism:

IfAis true, thenBis true Ais false

ThereforeBbecomes less plausible

2.3 Brief history

The early work on probability theory by James Bernoulli (1713), Rev. Thomas Bayes (1763), and Pierre Simon Laplace (1774), viewed probability as an extension of logic to the case where, because of incomplete information, Aristotelian deductive reasoning is unavailable. Unfortunately, Laplace failed to give convincing arguments to show why

the Bayesian definition of probability uniquely required the sum and product rules for manipulating probabilities. The frequentist definition of probability was introduced to satisfy this point, but in the process, eliminated the interpretation of probability as extended logic. This caused a split in the subject into the Bayesian and frequentist camps. The frequentist approach dominated statistical inference throughout most of the twentieth century, but the Bayesian viewpoint was kept alive notably by Sir Harold Jeffreys (1891–1989).

In the 1940s and 1950s, G. Polya, R. T. Cox and E. T. Jaynes provided the missing rationale for Bayesian probability theory. In his book Mathematics and Plausible Reasoning, George Polya dissected our ‘‘common sense’’ into a set of elementary desiderata and showed that mathematicians had been using them all along to guide the early stages of discovery, which necessarily precede the finding of a rigorous proof.

When one added (see Section 2.5.1) the consistency desiderata of Cox (1946) and Jaynes, the result was a proof that, if degrees of plausibility are represented by real numbers, then there is a unique set of rules for conducting inference according to Polya’s desiderata which provides for an operationally defined scale of plausibility.

The final result was just the standard product and sum rules of probability theory, given axiomatically by Bernoulli and Laplace! The important new feature is that these rules are now seen as uniquely valid principles of logic in general, making no reference to ‘‘random variables’’, so their range of application is vastly greater than that supposed in the conventional probability theory that was developed in the early twentieth century. With this came a revival of the notion of probability theory as extended logic.

The work of Cox and Jaynes was little appreciated at first. Widespread application of Bayesian methodology did not occur until the 1980s. By this time computers had become sufficiently powerful to demonstrate that the methodology could outperform standard techniques in many areas of science. We are now in the midst of a ‘‘Bayesian Revolution’’ in statistical inference. In spite of this, many scientists are still unaware of the significance of the revolution and the frequentist approach currently dominates statistical inference. New graduate students often find themselves caught between the two cultures. This book represents an attempt to provide a bridge.

2.4 An adequate set of operations So far, we have discussed the following logical operations:

A;Blogical product (conjunction) AþBlogical sum (disjunction) A)Bimplication

Anegation

By combining these operations repeatedly in every possible way, we can generate any number of new propositions, such as:

C ðAþBÞ;ðAþA;BÞ þA;B;ðAþBÞ: (2:7) We now consider the following questions:

1. How large is the class of new propositions?

2. Is it infinite or finite?

3. Can every proposition defined fromAandBbe represented in terms of the above operations, or are new operations required?

4. Are the four operations already over-complete?

Note: two propositions are not different from the standpoint of logic if they have the same truth value.C, in the above equation, is logically the same statement as the implicationC¼ ðB)AÞ. Recall that the implicationB)Acan also be written as B¼A;B. This does not assert that eitherAorBis true; it only means thatA;Bis false, or equivalently thatðAþBÞis true.

Box2.2 Worked exercise:

expand the right hand side (RHS) of propositionCgiven by Equation (2.7), and show that it can be reduced toðAþBÞ.

RHS¼A;AþA;BþA;A;BþA;B;BþA;A;BþA;B;B

Drop all terms that are clearly impossible (false), e.g.,A;A. Adding any number of impossible propositions to a proposition in a logical sum does not alter the truth value of the proposition. It is like adding a zero to a function; it doesn’t alter the value of the function.

¼A;BþA;BþA;B

¼A;ðBþBÞ þA;B¼AþA;B

¼AþA;B¼A;A;B¼A;ðAþBÞ ¼A;B

¼AþB:

2.4.1 Examination of a logic function

Any logic function C¼fðA;BÞ has only two possible values, and likewise for the independent variables A and B. A logic function with n variables is defined on a discrete space consisting of onlym¼2ⁿpoints. For example, in the case ofC¼fðA;BÞ, m¼4 points; namely those at whichAandBtake on the valuesfTT,TF,FT,FFg. The number of independent logic functions ¼2^m¼16. Table 2.2 lists these 16 logical functions.

We can show thatf₅!f₁₆are logical sums of f₁!f₄. Example 1:

f1þf3þf4¼A;BþA;BþA;B

¼BþA;B¼ ðBþAÞ;ðBþBÞ last step is a distributivity identity

¼BþA

¼f8:

(2:8)

Example 2:

f₂þf₄¼A;BþA;B

¼ ðAþAÞ;B¼B

¼f13:

(2:9)

This method (called ‘‘reduction to disjunctive normal form’’ in logic textbooks) will work for anyn. Thus, one can verify that the three operations:

conjunction; disjunction; negation logical product; logical sum; negation

AND OR NOT

; Table 2.2Logic functions of the two propositions A and B.

A;B TT TF FT FF

f₁ðA;BÞ T F F F ¼A;B

f₂ðA;BÞ F T F F ¼A;B

f₃ðA;BÞ F F T F ¼A;B

f₄ðA;BÞ F F F T ¼A;B

f₅ðA;BÞ T T T T

f₆ðA;BÞ T T T F

f₇ðA;BÞ T T F T

f₈ðA;BÞ T F T T

f9ðA;BÞ F T T T

f10ðA;BÞ T T F F

f₁₁ðA;BÞ T F T F

f₁₂ðA;BÞ F T T F

f₁₃ðA;BÞ F T F T

f₁₄ðA;BÞ F F T T

f₁₅ðA;BÞ T F F T

f₁₆ðA;BÞ F F F F ¼A;A

suffice to generate all logic functions, i.e., form an adequate set. But the logical sum AþBis the same as denying that they are both false:AþB¼A;B. Therefore AND and NOT are already an adequate set.

Is there a still smaller set? Answer: Yes.

NAND, defined asANDwhich is represented byA"B.

A"BA;B¼AþB A¼A"A

A;B¼ ðA"BÞ " ðA"BÞ AþB¼ ðA"AÞ " ðB"BÞ: Every logic function can be constructed from NAND alone.

The NOR operator is defined by:

A#BAþB¼A;B and is also powerful enough to generate all logic functions.

A¼A#A

AþB¼ ðA#BÞ # ðA#BÞ A;B¼ ðA#AÞ # ðB#BÞ:

2.5 Operations for plausible inference

We now turn to the extension of logic for a common situation where we lack the axiomatic information necessary for deductive logic. The goal according to Jaynes, is to arrive at a useful mathematical theory of plausible inference which will enable us to build a robot (write a computer program) to quantify the plausibility of any hypothesis in our hypothesis space of interest based on incomplete information. For example, given 10⁷observations, determine (in the light of these data and whatever prior information is at hand) the relative plausibilities of many different hypotheses about the causes at work.

We expect that any mathematical model we succeed in constructing will be replaced by more complete ones in the future as part of the much grander goal of developing a theory of common sense reasoning. Experience in physics has shown that as know-ledge advances, we are able to invent better models, which reproduce more features of the real world, with more accuracy. We are also accustomed to finding that these advances lead to consequences of great practical value, like a computer program to carry out useful plausible inference following clearly defined principles (rules or operations) expressing an idealized common sense.

The rules of plausible inference are deduced from a set of three desiderata (see Section2.5.1) rather thanaxioms, because they do not assert anything is true, but only state what appear to be desirable goals. We would definitely want to revise the

operation of our robot or computer program if they violated one of these elementary desiderata. Whether these goals are attainable without contradiction and whether they determine any unique extension of logic are a matter of mathematical analysis. We also need to compare the inference of a robot built in this way to our own reasoning, to decide whether we are prepared to trust the robot to help us with our inference problems.

2.5.1 The desiderata of Bayesian probability theory I. Degrees of plausibility are represented by real numbers.

II. The measure of plausibility must exhibit qualitative agreement with rationality. This means that as new information supporting the truth of a proposition is supplied, the number which represents the plausibility will increase continuously and monotonically. Also, to maintain rationality, the deductive limit must be obtained where appropriate.

III. Consistency

(a)Structural consistency: If a conclusion can be reasoned out in more than one way, every possible way must lead to the same result.

(b) Propriety: The theory must take account of all information, provided it is relevant to the question.

(c)Jaynes consistency: Equivalent states of knowledge must be represented by equivalent plausibility assignments. For example, ifA;BjC¼BjC, then the plausibility ofA;BjC must equal the plausibility ofBjC.

2.5.2 Development of the product rule

In Section2.4we established that the logical product and negation (AND, NOT) are an adequate set of operations to generate any proposition derivable from fA1;. . .;ANg. For Bayesian inference, our goal is to find operations (rules) to deter-mine the plausibility of logical conjunction and negation that satisfy the above desiderata. Start with the plausibility ofA;B:

LetðA;BjCÞ plausibility ofA;Bsupposing the truth ofC.

Remember, we are going to represent plausibility by real numbers (desideratum I). Now ðA;BjCÞmust be a function of some combination ofðAjCÞ,ðBjCÞ,ðBjA;CÞ,ðAjB;CÞ.

There are 11 possibilities:

ðA;BjCÞ ¼F1½ðAjCÞ;ðAjB;CÞ ðA;BjCÞ ¼F2½ðAjCÞ;ðBjCÞ ðA;BjCÞ ¼F3½ðAjCÞ;ðBjA;CÞ ðA;BjCÞ ¼F₄½ðAjB;CÞ;ðBjCÞ ðA;BjCÞ ¼F₅½ðAjB;CÞ;ðBjA;CÞ

ðA;BjCÞ ¼F6½ðBjCÞ;ðBjA;CÞ ðA;BjCÞ ¼F7½ðAjCÞ;ðAjB;CÞ;ðBjCÞ ðA;BjCÞ ¼F8½ðAjCÞ;ðAjB;CÞ;ðBjA;CÞ ðA;BjCÞ ¼F₉½ðAjCÞ;ðBjCÞ;ðBjA;CÞ ðA;BjCÞ ¼F₁₀½ðAjB;CÞ;ðBjCÞ;ðBjA;CÞ ðA;BjCÞ ¼F₁₁½ðAjCÞ;ðAjB;CÞ;ðBjCÞ;ðBjA;CÞ

Box2.3 Note on the use of the ‘‘ = ’’ sign

1. In Boolean algebra, the equals sign is used to denote equal truth value. By definition, A¼Basserts thatAis true if and only ifBis true.

2. When talking about plausibility, which is represented by a real number, ðA;BjCÞ ¼ ðÞðÞ. . .means equal numerically.

3. means equal by definition.

Now let us examine these 11 different functions more closely. Since the order in which the symbolsAandBappear has no meaning (i.e.,A;B¼B;A) it follows that

F₁½ðAjCÞ;ðAjB;CÞ ¼F₆½ðBjCÞ;ðBjA;CÞ F₃½ðAjCÞ;ðBjA;CÞ ¼F₄½ðAjB;CÞ;ðBjCÞ F₇½ðAjCÞ;ðAjB;CÞ;ðBjCÞ ¼F₉½ðAjCÞ;ðBjCÞ;ðBjA;CÞ F8½ðAjCÞ;ðAjB;CÞ;ðBjA;CÞ ¼F10½ðAjB;CÞ;ðBjCÞ;ðBjA;CÞ

This reduces the number of equations dramatically from 11 to 7. The seven functions remaining areF1;F2;F3;F5;F7;F8;F11.

If any function leads to an absurdity in even one example, it must be ruled out, even if for other examples it would be satisfactory. Consider

ðA;BjCÞ ¼F2½ðAjCÞ;ðBjCÞ:

SupposeAnext person will have blue left eye.

Bnext person will have brown right eye.

Cprior information concerning our expectation that the left and right eye colors of any individual will be very similar.

Now ðAjCÞ could be very plausible as could ðBjCÞ, but ðA;BjCÞ is extremely implausible. We rule out functions of this form because they have no way of taking such influence into account. Our robot could not reason the way humans do, even qualitatively, with that functional form.

Similarly, we can rule out F₁ for the extreme case where the conditional (given) information represented by propositionCis that ‘‘AandBare independent.’’ In this extreme case,

ðAjB;CÞ ¼ ðAjCÞ:

Therefore,

ðA;BjCÞ ¼F1½ðAjCÞ;ðAjB;CÞ ¼F1½ðAjCÞ;ðAjCÞ; (2:10) which is clearly absurd becauseF1claims that the plausibility ofA;BjCdepends only on the plausibility ofAjC.

Other extreme conditions are A¼B;A¼C;C¼A, etc. Carrying out this type of analysis, Tribus (1969) shows that all but one of the remaining possibilities can exhibit qualitative violations with common sense in some extreme case. There is only one survivor which can be written in two equivalent ways:

ðA;BjCÞ ¼F½ðBjCÞ;ðAjB;CÞ

¼F½ðAjCÞ;ðBjA;CÞ: (2:11) In addition, desideratum II, qualitative agreement with common sense, requires that F½ðAjCÞ;ðBjA;CÞ must be a continuous monotonic function of ðAjCÞ and ðBjA;CÞ. The continuity assumption requires that ifðAjCÞchanges only infinitesim-ally, it can induce only an infinitesimal change inðA;BjCÞorðAjCÞ.

Now use desideratum III: ‘‘Consistency’’

Suppose we wantðA;B;CjDÞ

1. ConsiderB;Cto be a single proposition at first; then we can apply Equation (2.11):

ðA;B;CjDÞ ¼F½ðB;CjDÞ;ðAjB;C;DÞ

¼FfF½ðCjDÞ;ðBjC;DÞ;ðAjB;C;DÞg: (2:12) 2. ConsiderA;Bto be a single proposition at first:

ðA;B;CjDÞ ¼F½ðCjDÞ;ðA;BjC;DÞ

¼FfðCjDÞ;F½ðBjC;DÞ;ðAjB;C;DÞg: (2:13) For consistency, 1 and 2 must be equal.

Letx ðAjB;C;DÞ;y ðBjC;DÞ;z ðCjDÞ, then:

Ffx;F½y;zg ¼FfF½x;y;zg: (2:14) This equation has a long history in mathematics and is called the ‘‘the Associativity Equation.’’ Acze´l (1966) derives the general solution (Equation (2.15) below) without assuming differentiability; unfortunately, the proof fills 11 pages of his book.

R. T. Cox (1961) provided a shorter proof, but assumed differentiability.

The solution is

wfF½x;yg ¼wfxgwfyg; (2:15) wherewfxgis any positive continuous monotonic function.

In the case of just two propositions,A,Bgiven the truth ofC, the solution to the associativity equation becomes

wfðA;BjCÞg ¼wfðAjB;CÞgwfðBjCÞg

¼wfðBjA;CÞgwfðAjCÞg: (2:16) For simplicity, drop thefgbrackets, but it should be remembered that the argument ofwis a plausibility.

wðA;BjCÞ ¼wðAjB;CÞwðBjCÞ

¼wðBjA;CÞwðAjCÞ: (2:17) Henceforth this will be called the product rule. Recall that at this moment,wðÞis any positive, continuous, monotonic function.

Desideratum II: Qualitative correspondence with common sense imposes further restric-tions onwfxg

Suppose Ais certain given C. Then A;BjC¼BjC (i.e., same truth value). By our primitive axiom that propositions with the same truth value must have the same plausibility,

ðA;BjCÞ ¼ ðBjCÞ

ðAjB;CÞ ¼ ðAjCÞ: (2:18)

Therefore, Equation (2.17), the solution to the associativity equation, becomes wðBjCÞ ¼wðAjCÞwðBjCÞ: (2:19) This is only true whenAjCis certain.

Thus we have arrived at a new constraint onwðÞ; it must equal 1 when the argument is certain.

For the next constraint, suppose thatAis impossible givenC. This implies A;BjC¼AjC

AjB;C¼AjC:

Then

wðA;BjCÞ ¼wðAjB;CÞwðBjCÞ (2:20) becomes

wðAjCÞ ¼wðAjCÞwðBjCÞ: (2:21)

This must be true for anyðBjCÞ. There are only two choices: eitherwðAjCÞ ¼0 orþ1.

1. wðxÞis a positive, increasing functionð0!1Þ:

2. wðxÞis a positive, decreasing functionð1 !1Þ:

They do not differ in content.

Suppose w1ðxÞ represents impossibility by þ1. We can define w2ðxÞ ¼1=w1ðxÞ which represents impossibility by 0. Therefore, there is no loss of generality if we adopt:

0wðxÞ 1:

Summary:

Using our desiderata, we have arrived at our present form of the product rule:

wðA;BjCÞ ¼wðAjCÞwðBjA;CÞ ¼wðBjCÞwðAjB;CÞ:

At this point we are still not referring towðxÞas the probability ofx.wðxÞis any continuous, monotonic function satisfying:

0wðxÞ 1;

wherewðxÞ ¼0 when the argumentxis impossible and 1 whenxis certain.

2.5.3 Development of sum rule

We have succeeded in deriving an operation for determining the plausibility of the logical product (conjunction). We now turn to the problem of finding an operation to determine the plausibility of negation. Since the logical sumAþAis always true, it follows that the plausibility thatAis false must depend on the plausibility thatAis true. Thus, there must exist some functional relation

Dans le document This page intentionally left blank (Page 41-61)