Frequentist statistical inference - This page intentionally left blank

5.1 Overview

We now begin three chapters which are primarily aimed at a discussion of the main concepts of frequentist statistical inference. This is currently the prevailing approach to much of scientific inference, so a student should understand the main ideas to appre-ciate current literature and understand the strengths and limitations of this approach.

In this chapter, we introduce the concept of a random variable and discuss some general properties of probability distributions before focusing on a selection of important sampling distributions and their relationships. We also introduce the very important Central Limit Theorem in Section5.9and examine this from a Bayesian viewpoint in Section5.10. The chapter concludes with the topic of how to generate pseudo-random numbers of any desired distribution, which plays an important role in Monte Carlo simulations.

In Chapter 6, we address the question of what is a statistic and give some common important examples. We also consider the meaning of a frequentist confidence interval for expressing the uncertainty in parameter values. The reader should be aware that study of different statistics is a very big field which we only touch on in this book. Some other topics normally covered in a statistics course like the fitting of models to data are treated from a Bayesian viewpoint in later chapters.

Finally, Chapter7concludes our brief summary of frequentist statistical inference with the important topic of frequentist hypothesis testing and discusses an important limitation known as theoptional stopping problem.

5.2 The concept of a random variable

Recall from Section1.1that conventional ‘‘frequentist’’ statistical inference and Bayesian inference employ fundamentally different definitions of probability. In frequentist statistics, when we write the probabilitypðAÞ, the argument of the probability is called a random variable. It is a quantity that can be considered to take on various values throughout an ensemble or a series of repeated experiments. For example:

1. A measured quantity which contains random errors.

2. Time intervals between successive radioactive decays.

Before proceeding, we need an operational definition of a random variable. From this, we discover that the random variable is not the particular number recorded in one measurement, but rather, it is an abstraction of the measurement operation or observation that gives rise to that number.

Definition: A random variable,X, transforms the possible outcomes of an experiment (measurement operation) to real numbers.

Example:Suppose we are interested in measuring a pollutant’s concentration level for each ofn time intervals. The observations (procedure for producing a real number) X1;X2;. . .;Xnform a sample of the pollutant’s concentration. Before the instrument actually records the concentration level during theith trial, the observation,X_i, is a random variable. The recorded value,xi, is not a random variable, but the actual measured value of the observation,X_i.

Question: Why do we need to have n random variablesX_i? Why not one random variableXfor whichx1;x2;. . .;xnare the realizations of the random variable during thenobservations?

Answer:Because we often want to determine the joint probability of gettingx1on trial 1,x2on trial 2, etc. If we think of each observation as a random variable, then we can distinguish between situations corresponding to:

1. Sampling with replacement so that no observation is affected by any other (i.e., independent X₁;X₂;. . .;X_n). In this case, all observations are random variables with identical probability distributions.

2. Sampling without replacement. In this case, the observations are not independent and hence are characterized by different probability distributions. Think of an urn filled with black and white balls. When we don’t replace the drawn balls, the probability of say a black on each draw is different.

5.3 Sampling theory

The most important aspect of frequentist statistics is the process of drawing conclu-sions based onsampledata drawn from thepopulation(which is the collection of all possible samples). The concept of the population assumes that in principle, an infinite number of measurements (under identical conditions) are possible. The use of the term random variable conveys the idea of an intrinsic uncertainty in the measurement characterized by an underlying population.

Question: What does the term ‘‘random’’ really mean?

Answer: When we randomize a collection of balls in a bottle by shaking it, this is equivalent to saying that the details of this operation are not understood or too complicated to handle. It is sometimes necessary to assume that certain complicated details, while undeniably relevant, might nevertheless have little numerical effect on

the answers to certain questions, such as the probability of drawing rblack balls from a bottle inntrials whennis sufficiently small.

According to E. T. Jaynes (2003), the belief that ‘‘randomness’’ is some kind of property existing in nature is a form of Mind Projection Fallacy which says, in effect,

‘‘I don’t know the detailed causes – therefore Nature is indeterminate.’’ For example, later in this chapter we discuss how to write computer programs which generate seemingly ‘‘random’’ numbers, yet all these programs are completely deterministic.

If you did not have a copy of the program, there is almost no chance that you could discover it merely by examining more output from the program. Then the Mind Projection Fallacy might lead to the claim that no rule exists. At scales where quantum mechanics becomes important, the prevailing view is that nature is indeterminate. In spite of the great successes of the theory of quantum mechanics, physicists readily admit that they currently lack a satisfactory understanding of the subject. The Bayesian viewpoint is that the limitation in scientific inference results from incomplete information.

In both Bayesian and frequentist statistical inference, certain sampling distributions (e.g., binomial, Poisson, Gaussian) play a central role. To the frequentist, the sampling distribution is a model of the probability distribution of the underlying population from which the sample was taken. From this point of view, it makes sense to interpret probabilities as long-run relative frequencies.

In a Bayesian analysis, the sampling distribution is a mathematical description of the uncertainty in predicting the data for any particular model because of incomplete information. It enables us to compute the likelihoodpðDjH;IÞ.

In Bayesian analysis, any sampling distribution corresponds to a particular state of knowledge. But as soon as we start accumulating data, our state of knowledge changes. The new information necessarily modifies our probabilities in a way that can be incomprehensible to one who tries to interpret probabilities as physical caus-ations or long-run relative frequencies.

5.4 Probability distributions

Now that we have a better understanding of what a random variable is let’s restate the frequentist definition of probability more precisely. It is commonly referred to as the relative frequency definition.

Relative frequency definition of probability:If an experiment is repeatedntimes under identical conditions andnxoutcomes yield a value of the random variableX¼x, the limit ofn_x=n, asnbecomes very large,¹is defined aspðxÞ, the probability thatX¼x.

Experimental outcomes can be either discrete or continuous. Associated with each random variable is a probability distribution. A probability distribution may be

1See Bernoulli’s law of large numbers discussed in Section4.2.1.

quantitatively and conveniently described by two functionspðxÞandFðxÞwhich are given below for the discrete and continuous cases.

1. Discrete random variables

Probability distribution function: (Also called the probability mass function). pðxiÞ gives the probability of obtaining the particular value of the random variableX¼xi. (a)pðxÞ ¼pfX¼xg

(b) pðxÞ 0 for allx (c)P

xpðxÞ ¼1

Cumulative probability function: this gives the probability that the random variable will have a valuex.

Figure5.1shows the discrete probability distribution (binomial) describing the num-ber of heads in ten throws of a fair coin. The right panel shows the corresponding cumulative distribution function.

2. Continuous random variables² Probability density function:fðxÞ

Figure 5.1 The left panel shows the discrete probabilities for the number of heads in ten throws of a fair coin. The right panel shows the corresponding cumulative distribution function.

2Continuous density functiondefined byfðX¼xÞ ¼lim

x!0½fðx<X<xþxÞ=x.

Cumulative probability density function:

Figure 5.2 shows an example of a continuous probability density function (left panel) and the corresponding cumulative probability density function (right panel).

5.5 Descriptive properties of distributions

The expectation value for a function,gðXÞ, of a random variable,X, is the weighted average of the function over all possible values ofx. We will designate the expectation value ofgðXÞbyhgðXÞi, which is given by

hgðXÞi ¼ P

allxgðxÞpðxÞ (discrete), Rþ1

1gðxÞfðxÞdx (continuous).

(5:1) The result, if it exists, is a fixed number (not a function) and a property of the probability distribution of X. The expectation defined above is referred to as the first momentof the distributiongðXÞ. The shape of a probability distribution can be rigorously described by the value of its moments:

Therthmomentof the random variableXabout the originðx¼0Þis defined by ⁰_r¼ hX^ri ¼

xx^rpðxÞ (discrete), Rþ1

1x^rfðxÞdx (continuous).

(

(5:2) Mean¼⁰₁ ¼ hXi ¼¼first moment about the origin. This is the usual measure of the location of a probability distribution.

Therthcentral momentðorigin¼meanÞofXis defined by _r¼ hðXÞ^ri ¼

Figure 5.2 The left panel shows a continuous probability density function and the right panel shows the corresponding cumulative probability density function.

The distinction between_rand⁰_ris simply that in the calculation of_rthe origin is shifted to the mean value ofx.

First central moment:hðXÞi ¼ hXi ¼0.

Second central moment: VarðXÞ ¼²_x¼ hðXÞ²i, where ²_x¼ usual measure of dispersion of a probability distribution.

hðXÞ²i ¼ hðX²2Xþ²Þi ¼ hX²i 2hXi þ²

¼ hX²i 2²þ²¼ hX²i ²¼ hX²i hXi² Therefore; ²¼ hX²i hXi²:

(5:4)

Thestandard deviation,, equal to the square root of the variance, is a useful measure of the width of a probability distribution.

It is frequently desirable to compute an estimate of²as the data are being acquired.

Equation (5.4) tells us how to accomplish this, by subtracting the square of the average of the data from the average of the data values squared. Later, in Section6.3, we will introduce a more accurate estimate of²called the sample variance.

Box 5.1

Question:What is the variance of the random variableY¼aXþb?

Solution:

VarðYÞ ¼ hðY_yÞ²i ¼ hfðaXþbÞ ðaXþbÞg²i

¼ hfaXag²i

¼ ha²X²2a²Xþa²²i ¼a²ðhX²i hXi²Þ

¼a²VarðXÞ Third central moment:₃¼ hðXÞ³i:

This is a measurement of the asymmetry or skewness of the distribution. For a symmetric distribution,₃¼0 and_2nþ1 ¼0 for any integer value ofn.

Fourth central moment:₄ ¼ hðXÞ⁴i:

₄ is calledkurtosis (another shape factor). It is a measure of how flat-topped a distribution is near its peak. See Figure5.3and discussion in the next section for an example.

5.5.1 Relative line shape measures for distributions

The shape of a distribution cannot be entirely judged by the values of ₃ and ₄ because they depend on the units of the random variable. It is better to use measures relative to the distribution’s dispersion.

Coefficient of skewness:₃¼ ₃ ð2Þ³⁼². Coefficient of kurtosis:₄¼ ₄

ð2Þ².

Figure 5.3 illustrates a single peaked distribution for different ₃ and ₄ coefficients. Note: ₄¼3 for any Gaussian distribution so distributions with ₄>3 are more sharply peaked than a Gaussian, while those with ₄<3 are more flat-topped.

5.5.2 Standard random variable

A random variableXcan always be converted to a standard random variableZusing the following definition:

Z¼X

_x : (5:5)

Zhas a meanhZi ¼0, and variancehZ²i ¼²_z¼1.

α3 > 0 ≡ positively skewed →

α3 > 0 ≡ negatively skewed →

α3 = 0 ≡ symmetric →

α3 > 3 leptokurtic ≡ highly-peaked →

α3 > 3 platykurtic ≡ flat-topped →

Figure 5.3 Single peak distributions with different coefficients of skewness and kurtosis.

For any particular valuexofX, the quantityz¼ ðxÞ=_xindicates the deviation of xfrom the expected value ofXin terms of standard deviation units. At several points in this chapter we will find it convenient to make use of the standard random variable.

5.5.3 Other measures of central tendency and dispersion

Median:The median is a measure of the central tendency in the sense that half the area of the probability distribution lies to the left of the median and half to the right. For any continuous random variable, the median is defined by

pðXmedianÞ ¼pðXmedianÞ ¼1=2: (5:6) If a distribution has a strong central peak, so that most of its area is under a single peak, then the median is an estimator of the central peak. It is a more robust estimator than the mean: the median fails as an estimator only if the area in the tail region of the probability distribution is large, while the mean fails if the first moment of the tail is large. It is easy to construct examples where the first moment of the tail is large even though the area is negligible.

Mode: Defined to be a value,xmofX, that maximizes the probability function (ifXis discrete) or probability density (ifXis continuous). Note: this is only meaningful if there is a single peak.

IfXis continuous, the mode is the solution to dfðxÞ

dx ¼0; for d²fðxÞ

dx² <0: (5:7)

An example of the mode, median and mean for a particular PDF is shown in Figure5.4.

0 0.2 0.4 0.6 0.8 1

x 0.5

1 1.5 2 2.5 3

Probability density

mean median mode

Figure 5.4 The mode, median and mean are three different measures of this probability density function.

5.5.4 Median baseline subtraction

Suppose you want to remove the baseline variations in some data without suppressing the signal. Many automated signal detection schemes only work well if these baselines variations are removed first. The upper panel of Figure5.5depicts the output from a detector system with a signal profile represented by narrow Gaussian-like features sitting on top of a slowly varying baseline with noise. How do we handle this problem?

Solution:Use running median subtraction.

One way to remove the slowly varying baseline is to subtract a running median. The signal at sample locationiis replaced by the original signal atiminus the median of all values withinðN1Þ=2 samples. Nis chosen so it is large compared to the signal profile width and short compared to baseline changes.

0 25 50 75 100 125 150 175

Figure 5.5 (a) A signal profile sitting on top of a slowly varying baseline. (b) The same data with the baseline variations removed by a running median subtraction. (c) The same data with the baseline variations removed by a running mean subtraction; notice the negative bowl in the vicinity of the source profile.

Question:Why is median subtraction more robust than mean subtraction?

Answer:When theNsamples include some of the signal points, both the mean value and median will be elevated so that when the running subtraction occurs the signal will sit in a negative bowl as is illustrated in Figure5.5(c).

With mean subtraction, the size of the bowl will be proportional to the signal strength. With median subtraction, the size of the bowl is smaller and essentially independent of the signal strength for signals greater than noise. To understand why, consider a running median subtraction with N¼21 and a signal profile, which for simplicity is assumed to have a width of only 1 sample. First, imagine a histogram of the 21 sample values when no signal is present, i.e., just a Gaussian noise histogram with some median, m₀. Now suppose a signal of strength S is added to sample 11, shifting it in the direction of increasing signal strength. Let T₁₁ be the value of sample 11 before the signal was added. There are two cases of interest. (a) If T11>m0 thenT11þS>m0 and the addition of the signal produces no change in the median value, i.e., the number of sample values on either side of m₀ is unchanged. (b) If T₁₁<m₀, then the addition of S can cause the sample to move to the other side ofm0 thus increasing the median by a small amount tom₁. The size ofSrequired to produce this small shift isS the RMS noise. Once sample 11 has been shifted to the other side, no further increase in the value of Swill change the median. Figure 5.5(b) shows the result of a 21-point running median subtraction. The baseline curvature has been nicely removed and there is no noticeable negative bowl in the vicinity of the source.

In the case of a running mean subtraction, the change in the mean of our 21 samples is directly proportional to the signal strengthS, which gives rise to the very noticeable negative bowl that can be seen in Figure5.5(c).

Mean deviation(alternative measure ofdispersion)

hjXji ¼ P

allxjxjpðxÞ (discrete), Rþ1

1jxjfðxÞdx (continuous).

(

(5:8)

For long-tailed distributions, the effect on the mean deviation of the values in the tail is less than the effect on the standard deviation.

5.6 Moment generating functions

In Section5.5we looked at various useful moments of a random variable. It would be convenient if we could describe all moments of a random variable in one function.

This function is called the moment generating function. We will use it directly to compute moments for a variety of distributions. We will also employ the moment generating function in the derivation of the Central Limit Theorem, in Section5.9, and

in the proof of several theorems in Chapter6. The moment generating function,m_xðtÞ, of the random variableXis defined by

m_xðtÞ ¼ he^tXi ¼ P

xe^txpðxÞ (discrete), Rþ1

1e^txfðxÞdx (continuous), (

(5:9)

where t is a dummy variable. The moment generating function exists if there is a positive constantsuch thatmxðtÞis finite forjtj . The moments themselves are the coefficients in a Taylor series expansion of the moment generating function (see Equation ((5.12)) below) which converges forjtj .

It can be shown that if a moment generating function exists, then it completely determines the probability distribution of X, i.e., if two random variables have the same moment generating function, they have the same probability distribution.

Therth moment about the origin (see Equation (5.2)) is obtained by taking therth derivative ofmxðtÞ with respect tot and then evaluating the derivative att¼0 as shown in Equation (5.10).

For moments about the mean (central moments), we can use the central moment generating function.

mxðtÞ ¼ hexpftðxÞgi: (5:11) Now we use a Taylor series expansion of the exponential,

hexp½tðXÞi ¼ 1þtðXÞ þt²ðXÞ²

2! þt³ðXÞ³

* +

: (5:12)

From the expansion, one can see clearly that each successive moment is obtained by taking the next higher derivative with respect tot, each time evaluating the derivative att¼0.

Example:

LetXbe a random variable with probability density function fðxÞ ¼

Determine the moment generating function and variance:

mxðtÞ ¼1

Z 1 0

expðtxÞexpðx=Þdx

¼1

Z 1 0

exp½ð1tÞx=dx

ð1tÞexp½ð1tÞx=j¹₀

¼ ð1tÞ¹ ðfort<1=Þ

(5:14)

dm_xðtÞ

dt j^t^¼⁰¼ð1tÞ²jt¼0¼¼ hXi (5:15) d²m_xðtÞ

dt² j^t^¼⁰¼2²ð1tÞ³jt¼0¼2²¼ hX²i: (5:16) From Equation (5.4), the variance,², is given by

² ¼ hX²i hXi²¼2²²¼²: (5:17)

5.7 Some discrete probability distributions 5.7.1 Binomial distribution

Thebinomial distribution³is one of the most useful discrete probability distributions and arises in any repetitive experiment whose result is either the occurrence or non-occurrence of an event (only two possible outcomes, like tossing a coin). A large number of experimental measurements contain random errors which can be repre-sented by a limiting form of the binomial distribution called thenormalorGaussian distribution(Section5.8.1).

LetXbe a random variable representing the number of successes (occurrences) out of n independent trials such that the probability of success for any one trial is p.⁴ Then X is said to have a binomial distribution with probability mass function

pðxÞ ¼pðxjn;pÞ ¼ n!

ðnxÞ!x!p^xð1pÞ^nx; for x¼0;1;. . .;n; 0p1;

(5:18) which has two parametersnandp.

3A Bayesian derivation of the binomial distribution is presented in Section4.2.

4Note: any time the symbolpappears without an argument, it will be taken to be a number representing the probability of a success.pðxÞis a probability distribution either discrete or continuous.

Cumulative distribution function:

FðxÞ ¼X^x

i¼0

pðiÞ ¼X^x

i¼0

i pⁱð1pÞ^ðn^iÞ n

i ¼ short-hand notation for number of combinations of

nitems takeniat a time. ð5:19Þ

Box 5.2Mathematicacumulative binomial distribution:

Needs[‘‘Statistics ‘DiscreteDistributions’ ’’]

The probability of at leastxsuccesses innbinomial trials is given by (1 – CDF[BinomialDistribution[n;p],x])

!answer¼0:623ðn¼10;p¼0:5;x¼4Þ

Moment generating function of a binomial distribution:

We can apply Equation (5.9) to compute the moment generating function of the binomial distribution.

mxðtÞ ¼ he^txi ¼Xⁿ

x¼0

e^tx n

x p^xð1pÞ^nx (5:20)

mxðtÞ ¼Xⁿ

x¼0

ðnxÞ!x!ðe^tpÞ^xð1pÞ^nx

¼ ð1pÞⁿþnð1pÞⁿ¹ðe^tpÞ þ þ n!

ðnkÞ!ð1pÞ^nkðe^tpÞⁿ þ þ ðe^tpÞⁿ

¼binomial expansion of½ð1pÞ þe^tpⁿ: Therefore,m_xðtÞ ¼ ½1pþe^tpⁿ.

From the first derivative, we compute the mean, which is given by mean¼⁰₁¼^dm_dt^x^ðtÞ_t¼0¼n½1pþe^tpⁿ¹e^tpjt¼0¼np.

The second derivative yields the second moment:

⁰₂¼d²mxðtÞ dt²

¼nðn1Þ½1pþe^tpⁿ²ðe^tpÞ²þn½1pþe^tpⁿ¹e^tpjt¼0

¼nðn1Þp²þnp:

But⁰₂¼ hX²i, and therefore, the variance²is given by ²¼ hðXÞ²i ¼ hX²i hXi²¼ hX²i ²

¼nðn1Þp²þnp ðnpÞ²

²¼npð1pÞ (variance of binomial distribution):

(5:21)

Box 5.3Mathematicabinomial mean and variance The same results could be obtained inMathematicawith the commands:

Dans le document This page intentionally left blank (Page 116-159)