What is a statistic? - This page intentionally left blank

6.1 Introduction

In this chapter, we address the question ‘‘What is a statistic’’? In particular, we look at what role statistics play in scientific inference and give some common useful examples.

We will examine their role in the two basic inference problems: hypothesis testing (the frequentist equivalent of model selection) and parameter estimation, with emphasis on the latter. Hypothesis testing will be dealt with in Chapter7.

Recall that an important aspect of frequentist statistical inference is the process of drawing conclusions based on sample data drawn from the population (which is the collection of all possible samples). The concept of the population assumes that in principle, an infinite number of measurements (under identical conditions) are pos-sible. Suppose X1;X2;. . .;Xn are n independent and identically distributed (IID) random variables that constitute a random sample from the population for which x1;x2;. . .;xn is one realization. The population is assumed to have an intrinsic probability distribution (or density function) which, if known, would allow us to predict the likelihood of the samplex₁;x₂;. . .;x_n.

For example, suppose the random variable we are measuring is the time interval between successive decays of a radioactive sample. In this case, the population probability density function is a negative exponential (see Section 5.8.5), given by fðxjÞ ¼ ½expðx=Þ/. The likelihood is given by

Lðx1;x2;. . .;xnjÞ ¼Yⁿ

i¼1

fðxijÞ: (6:1)

This particular population probability density function is characterized by a single parameter. Another population probability distribution that arises in many prob-lems is the normal (Gaussian) distribution which has two parameters,and².

In most problems, the parameters of the underlying population probability distribution are not known. Without knowledge of their values, it is impossible to compute the desired probabilities. However, a population parameter can be estimated from astatistic, which is determined from the information contained in a random sample. It is for this reason that the notion of a statistic and its sampling distribution is so important in statistical inference.

139

Definition: Astatisticis any function of the observed random variables in a sample such that the function does not contain any unknown quantities.

One important statistic is the sample meanXgiven by X¼ ðX1þX₂þ þX_nÞ=n¼1

n Xⁿ

i¼1

X_i: (6:2)

Note: we are using a capitalXwhich implies we are talking about a random variable.

All statistics are random variables and to be useful, we need to be able to specify their sampling distribution.

For example, we might be interested in the mean redshift¹of a population of cosmic gamma-ray burst (GRB) sources. This would provide information about the distances of these objects and their mean energy. GRBs are the most powerful type of explosion known in the universe. The parameter of interest is the mean redshift which we designate . A parameter of a population is always regarded as a fixed and usually unknown constant. LetZbe a random variable representing GRB redshifts. Suppose the redshifts,fz1;z2;. . .;z7g, of a sample of seven GRB sources are obtained after a great deal of effort. What can we conclude about the population mean redshiftfrom our sample, i.e., how accurately can we determinefrom our sample?

This can be a fairly difficult question to answer using the individual measurements, z_i, because we don’t know the form of the sampling distribution for GRB source redshifts. Happily, in this case, we can proceed with our objective by exploiting the Central Limit Theorem (CLT) which predicts the sampling distribution of thesample mean statistic. The way to think about this is as follows: consider a thought experiment in which we are able to obtain redshifts for a very large number of samples (hypothet-ical reference set) of GRB redshifts. Each sample consists of seven redshift measure-ments. The means of all these samples will have a distribution. According to the CLT, the distribution of sample means tends to a Gaussian as the numbernof observations tends to infinity. In practice, a Gaussian sampling distribution is often employed when n5. Of course, we don’t have the results from this hypothetical reference set, only the results from our one sample, but at least we know that the shape of the sampling distribution characterizing our sample mean statistic is approximately a Gaussian.

This allows us to make a definite statement about the uncertainty in the population mean redshiftwhich we derive from our one sample of seven redshift measurements.

Just how we do this is discussed in detail in Section6.6.2. In the course of answering that question, we will encounter the sample variance statistic, S², and develop the notion of a sampling distribution of a statistic.

1Redshift is a measure of the wavelength shift produced by the Doppler effect. In 1929, Edwin Hubble showed that we live in an expanding universe in which the velocity of recession of a galaxy is proportional to its distance. A recession velocity shifts the observed wavelength of a spectral line to longer wavelengths, i.e., to the red end of the optical spectrum.

6.2 The²distribution

Thesampling distributionof any particular statistic is the probability distribution of that statistic that would be determined from an infinite number of independent samples, each of sizen, from an underlying population. We start with a treatment of the²sampling distribution.²We will prove in Section6.3that the²distribution describes the distribution of the variances of samples taken from a normal distribu-tion. The²distribution is a special case of the gamma distribution:

fðxj; Þ ¼ 1

ðÞx¹exp x

(6:3) with¼2 and¼=2, whereis called thedegree of freedom.

The²distribution has the following properties:

fðxjÞ ¼ 1

₂ 2²x²¹exp x 2

(6:4)

hxi ¼; Var½x ¼2: (6:5)

The coefficients of skewness (₃) and kurtosis (₄) are given by ₃¼ 4

ffiffiffiffiffi

p2; ₄¼3 1þ4

: (6:6)

Finally, themoment generating functionof² withdegrees of freedom is given by m2

ðtÞ ¼ ð12tÞ²: (6:7) We now prove two useful theorems pertaining to the²distribution.

Theorem 1:

Let fXig ¼X₁;X₂;. . .;X_n be an IID sample from a normal distribution Nð; Þ.

Let Y¼Pn

i¼1ðXiÞ²=²¼Pn

i¼1Z²_i, where Zi are standard random variables.

ThenYhas a chi-squared ²_n distribution withndegrees of freedom.

Proof:

LetmYðtÞ ¼ the moment generating function (recall Section5.6) ofY. From Equation (5.9), we can write

m_YðtÞ ¼ he^tYi ¼ he^tⁱ^Z²ⁱi

¼ he^tZ²¹e^tZ²² e^tZ²ⁿi: (6:8)

2The²statistic plays an important role in fitting models to data using the least-squares method, which is discussed in great detail in Chapters10and11.

Since the random variableZis IID then,

mYðtÞ ¼ he^tZ²¹i he^tZ²²i he^tZ²ⁿi

¼m_Z²

1ðtÞ m_Z²

2ðtÞ m_Z²_nðtÞ: (6:9) The moment generating function for eachZiis given by

m_Z2ðtÞ ¼

where we have made use of the fact that fðzÞ is also a normal distribution, i.e., a Gaussian.

Multiplying and dividing Equation (6.10) byð12tÞ¹²we get

m_Z²ðtÞ ¼ ð12tÞ¹²

Integral of normal distribution¼1 ) m_Z²ðtÞ ¼ ð12tÞ¹²:

(6:11)

Therefore,

m_YðtÞ ¼ ð12tÞⁿ²: (6:12) Comparison of Equations (6.12) and (6.7) shows thatYhas a²distribution, withn degrees of freedom, which we designate by²_n. Figure6.1illustrates the²distribution for three different choices of the number of degrees of freedom.

Example:

In Section5.9, we showed that for any IID sampling distribution with a finite variance, ðXÞ ffiffiffi

=tends toNð0;1Þas n! 1, and therefore½ðXÞ ffiffiffi pn

=² is approxi-mately²₁with one degree of freedom.³

3When sampling from a normal distribution, the distribution ofðXÞ ffiffiffi pn

=is alwaysN(0, 1) regardless of the value ofn.

Theorem 2:

If X1 and X2 are two independent ²-distributed random variables with ₁ and₂ degrees of freedom, thenY¼X1þX2 is also ²-distributed with ₁þ₂ degrees of freedom.

Proof:

SinceX1andX2are independent, the moment generating function ofYis given by m_yðtÞ ¼m_X₁ðtÞ m_X₂ðtÞ ¼ ð12tÞ²¹ ð12tÞ²² (6:13)

¼ ð12tÞ

ð1þ2Þ

2 ; (6:14)

which equals the moment generating function of a²random variable with₁þ₂ degrees of freedom.

6.3 Sample varianceS²

We often want to estimate the varianceð²Þof a population from an IID sample taken from a normal distribution. We usually don’t know the meanðÞof the population so we use the sample meanðXÞas an estimate. To estimate² we use another random variable called thesample varianceðS²Þ, defined as follows:

S²¼Xⁿ

i¼1

ðXiXÞ²

n1 : (6:15)

2.5 5 7.5 10 12.5 15 17.5 20

χ²value 0.05

0.1 0.15 0.2 0.25

Probability density

ν⁼⁸ ν⁼³ ν⁼¹

Figure 6.1 The²distribution for three different choices of the number of degrees of freedom.

Just why we define the sample variance random variable in this way will soon be made clear. Of course, for any particular sample of n data values, the sample random variable would take on a particular value designated by lower cases².

Here is a useful theorem that enables us to estimatefromS:

Theorem 3:

The expectation value of a quantity that has a²distribution withðn1Þdegrees of freedom is equal to the number of degrees of freedom (see Equation (6.5)).

Therefore,

This provides justification for our definition of S² – its expectation value is the population variance. Note: this does not mean thatS²will equal²for any particular sample.

Note 1: We have just established Equation (6.20) when sampling for a normal distribution. We now show that Equation (6.20) is valid for IID sampling from any arbitrary distribution with finite variance. From Equation (6.16), we can write

hS²i ¼

PhðXiÞ²i nhðXÞ²i

n1 : (6:21)

But hðXiÞ²i ¼VarðXiÞ ¼² by definition, and hðXÞ²i ¼VarðXÞ ¼²=n from Equation (5.50). It follows that

hS²i ¼n²²

n1 ¼²: (6:22)

Thus, Equation (6.22) is valid for IID sampling from an arbitrary distribution with finite variance. In the language of frequentist statistics, we say thatS², as defined in Equation (6.15), is an unbiased estimator of².

Standard error of the sample mean:We often want to quote a typical error for the mean of a population based on our sample. According to Equation (5.50), VarðXÞ ¼²=n for any distribution with finite variance. Since we do not normally know ², the variance of the population, we use the sample variance as an estimate.

Thestandard error of the sample meanis defined as S ffiffiffin

p : (6:23)

In Section 6.6.2 we will use a Student’s t distribution to be more precise about specifying the uncertainty in our estimate of the population mean from the sample mean.

Note 2: In a situation where we know populationbut not², defineS²: S²¼Xⁿ

i¼1

ðXiÞ²

n : (6:24)

It is easily shown that with this definition,nS²=²is²_nwithndegrees of freedom. We lose one degree of freedom when we estimatefromX.

Example:

A random sample of size n¼16 (IID sample) is drawn from a population with a normal distribution of unknown mean () and variance (²). We compute the sample variance,S², and want to determine

pð²<0:49S²Þ: (6:25)

Solution:Equation (6.25) is equivalent to p S²

²>2:041

: (6:26)

We know that the random variableX¼ ðn1ÞS²=²has a²distribution withðn1Þ degrees of freedom. In this case,ðn1Þ ¼15¼degrees of freedom. Therefore,

p S²

² >2:041

¼p ðn1ÞS²

² >30:61

: (6:27)

Let

¼pððn1ÞS²=²>30:61Þ:

Then

1¼pððn1ÞS²=²30:61Þ;

or more generally, 1¼pðXx1Þ where x1 is the particular value of the random variableXfor which the cumulative distributionpðXx1Þ ¼1.x1

is called theð1Þquantile valueof the distribution, andpðXx1jÞis given by pðXx1jÞ ¼ 1

₂ 2² Z x1

t²¹exp t 2

dt¼1: (6:28)

For ¼15 degrees of freedom, 30.61 corresponds to ¼0:01 or x0:990. Thus, the probability that the random variable ²<0:49S²¼1%. Figure 6.2 shows the ² distribution for¼15 degrees of freedom and the 1¼0:99 quantile value.

0 10 20 30 40

0 0.02 0.04 0.06

Probability density

χ²0.99

χ²value

Figure 6.2 The ² distribution for ¼15 degrees of freedom. The vertical line marks the 1¼0:99 quantile value. The area to the left of this line corresponds to 1.

We can evaluate Equation (6.28) with the followingMathematicacommand:

Box 6.1 Mathematica²significance Needs [‘‘Statistics ‘ContinuousDistributions’’’]

The line above loads a package containing a wide range of continuous distribu-tions of importance to statistics, and the following line computes, the area in the tail of the²distribution to the right of²¼30:61, for¼15 degrees of freedom.

In statistical hypothesis testing (to be discussed in the next chapter),is referred to as thesignificanceor theone-sided P-valueof a statistical test.

6.4 The Student’stdistribution

Recall, when sampling from a normal distribution with known standard deviation,, the distribution of the standard random variable Z¼ ðXÞ ffiffiffi

= is Nð0;1Þ. In practice,is usually not known. The logical thing to do is to replaceby the sample standard deviationS. The usual inference desired is that there is a specified probability thatXlies withinSof the true mean.

Unfortunately, the distribution of ðXÞpffiffiffin

=S is not Nð0;1Þ. However, it is possible to determine the exact sampling distribution ofðXÞ ffiffiffi

=Swhen sampling fromNð; Þwith bothand² unknown. To this end, we examine the Student’s t distribution.⁴The following useful theorem pertaining to the Student’stdistribution is given without proof.

Theorem 4:

LetZbe a standard normal random variable and letXbe a²random variable with degrees of freedom. IfZandXare independent, then the random variable

T¼ Z ffiffiffiffiffiffiffiffiffi

pX= (6:29)

has a Student’stdistribution withdegrees of freedom and a probability density given by

4Thetdistribution is named for its discoverer, William Gosset, who wrote a number of statistical papers under the pseudonym ‘‘Student.’’ He worked as a brewer for the Guinness brewery in Dublin in 1899. He developed thet distribution in the course of analyzing the variability of various materials used in the brewing process.

The Student’stdistribution has the following properties:

=Sis a random variable with a Student’s tdistribution with n1 degrees of freedom. Figure6.3shows a comparison of a Student’stdistribution for three degrees of freedom, and a standard normal. The broader wings of the Student’stdistribution are clearly evident.

Theð1Þquantile value fordegrees of freedom,t1;, is given by

Suppose a cigarette manufacturer claims that one of their brands has an average nicotine content of 0.6 mg per cigarette. An independent testing organization

–4 –2 0 2 4

Figure 6.3 Comparison of a standard normal distribution and a Student’s tdistribution for 3 degrees of freedom.

measures the nicotine content of 16 such cigarettes and has determined the sample average and the sample standard deviation to be 0.75 and 0.197 mg, respectively. If we assume the amount of nicotine is a normal random variable, how likely is the sample result given the manufacturer’s claim?

T¼ ðXÞ ffiffiffi pn

=Shas a Student’stdistribution.

x¼0:75 mg;s¼0:197 mg, andn¼16;

so the number of degrees of freedom¼15.

Manufacturer claims¼0:6 mg corresponds to a t¼ð0:750:6Þ

0:197= ffiffiffiffiffi

p16 ¼3:045: (6:34)

The Student’stdistribution is a continuous distribution, and thus we cannot calculate the probability of any specifictvalue since there is no area under a point. The question of how likely thetvalue is, given the manufacturer’s claim, is usually interpreted as what is the probability by chance thatT3:045. The area of the distribution beyond the sampletvalue gives us a measure of how far out in the tail of the distribution the sample value resides.

Box 6.2 Mathematicasolution:

We can solve the above problem with the following commands:

Needs[‘‘Statistics ‘ContinuousDistributions’’’]

The following line computes the area in the tail of the Tdistribution beyondT¼3:045.

(1 – CDF[StudentTDistribution[n], 3.045])!answer¼0:004ð¼15Þ

where CDF[StudentTDistribution [n], 3.045] stands for the cumulative density functionof theTdistribution fromT¼ 1 !3:045.

Therefore, pðT>3:045Þ ¼¼0:004 or 0.4%, i.e., the manufacturer’s claim is very improbable. The way to think of this is to imagine we could repeatedly obtain samples of 16 cigarettes and compute the value oftfor each sample. The fraction of these t values that we would expect to fall in the tail area beyondt>3:045 is only 0.4%. If the manufacturer’s claim were reasonable, we would expect that thetvalue of our actual sample would not fall so far out in the tail of the distribution. If you are still puzzled by this reasoning, we will have a lot more to say about it in Chapter7. We will revisit this example in Section7.2.3.

Note: althoughðxÞ=s¼0:15=0:197<1,sis not a meaningful uncertainty forx– only forxi. The usual measure of the uncertainty inxiss= ffiffiffi

¼0:049. The quantity s= ffiffiffi

is called thestandard error of the sample mean.

6.5 Fdistribution (F-test)

The F distribution is used to find out if two data sets have significantly different variances. For example, we might be interested in the effect of a new catalyst in the brewing of beer so we compare some measurable property of a sample brewed with the catalyst to a sample from the control batch made without the catalyst. What effect has the catalyst had on the variance of this property?

Here, we develop the appropriate random variable for use in making inferences about the variances of two independent normal distributions based on a random sample from each. Recall that inferences about ², when sampling from a normal distribution, are based on the random variable ðn1ÞS²=², which has a ²_n1 distribution.

Theorem 5:

Let X and Y be two independent ² random variables with ₁ and ₂ degrees of freedom. Then the random variable

F¼X=1

Y=2

(6:35)

has anFdistribution with a probability density function

pðfj1; ₂Þ ¼

½ð1þ2Þ=2 ð1=2Þð2=2Þ

₂

2 f¹²^ð¹^2Þ

ð1þf1=2Þ¹²^ð¹^þ²^Þ; ðf>0Þ

0; elsewhere.

: (6:36)

AnFdistribution has the following properties:

hFi ¼ ₂

₂2; ð2>2Þ: (6:37)

(Surprisingly,hFidepends only on₂and not on₁.) VarðFÞ ¼ ²₂ð22þ2₁4Þ

₁ð21Þ²ð24Þ; ð2>4Þ (6:38) Mode¼₂ð12Þ

₁ð2þ2Þ: (6:39)

LetX¼ ðn11ÞS²₁=²₁andY¼ ðn21ÞS²₂=²₂. Then, F₁₂¼X=1

Y=₂ ¼X=ðn11Þ

Y=ðn21Þ¼S²₁=²₁

S²₂=²₂: (6:40)

Box 6.3 Mathematicaexample:

The sample variance iss²₁¼16:65 forn1¼6 IID samples from a normal distribu-tion with a populadistribu-tion variance²₁, ands²₂¼5:0 forn2¼11 IID samples from a second independent normal distribution with a population variance ²₂. If we assume that ²₁¼²₂, then from Equation (6.40), we obtain f¼3:33 for ₁¼n₁1¼5 and₂¼n₂1¼10 degrees of freedom. What is the probability of getting anfvalue3:33 by chance if²₁¼²₂?

Needs[‘‘Statistics ‘ContinuousDistributions’’’]

The following line computes the area in the tail of theFdistribution beyondf¼3:33.

ð1CDF[FRatioDistribution[n1,n2], 3:33]Þ !answer¼0:05

where CDF[FRatioDistribution[n1,n2], 3:33] stands for the cumulative density function of theFdistribution fromf¼0!3:33. Another way to compute this tail area is with

FRatioPValue[fratio,n1,n2]

TheFdistribution for this example is shown in Figure6.4.

What if we had labeled our two measurements of s the other way around so ₁¼10; 2¼5 ands²₁=s²₂¼1=3:33? The equivalent question is: what is the probability thatf1=3:33 which we can evaluate by

CDF[FRatioDistribution[n1,n2], 1=3:33]?Answer:0:05 Not surprisingly, we obtain the same probability.

0 1 2 3 4 5

fvalue 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Probability density

f_0.95,5,10 predicted distribution

measured

Figure 6.4 TheFdistribution for1¼5; 2¼10 degrees of freedom. The measured value of 3.33, indicated by the line, corresponds tof_0:95;5;10, the 0.95 quantile value.

6.6 Confidence intervals

In this section, we consider how to specify the uncertainty of our estimate of any particular parameter of the population, based on the results of our sample. We start by considering the uncertainty in the population mean when it is known that we are sampling from a population with a normal distribution. There are two cases of interest. In the first, we will assume that we know the variance² of the underlying population we are sampling from. More commonly, we don’t know the variance and must estimate it from the sample. This is the second case.

6.6.1 Variance²known

LetfXigbe an IIDNð; ²Þrandom sample ofn¼10 measurements from a popula-tion with unknownbut known¼1. LetXbe the sample mean random variable which will have a sample mean standard deviation,_m¼=pffiffiffin

¼1= ffiffiffiffiffi p10

¼0:32, to two decimal places. The probability thatXwill be within one_m¼0:32 ofis approxi-mately 0.68 (from Section5.8.1). We can write this as

pð0:32<X< þ0:32Þ ¼0:68: (6:41) Since we are interested in making inferences aboutfrom our sample, we rearrange Equation (6.41) as follows:

pð0:32<X< þ0:32Þ ¼pð0:32<X <0:32Þ

¼pð0:32> X>0:32Þ

¼pðXþ0:32> >X0:32Þ ¼0:68;

or,

pðX0:32< <Xþ0:32Þ ¼0:68: (6:42) Suppose the measured sample mean isx¼5:40. Can we simply substitute this value into Equation (6.42), which would yield

pð5:08< <5:72Þ ¼0:68? (6:43) We need to be careful how we interpret Equations (6.42) and (6.43).

Equation (6.42) says that if we repeatedly draw samples of the same size from this population, and each time compute specific values for the random interval ðX0:32;Xþ0:32Þ, then we would expect 68% of them to contain the unknown mean. In frequentist theory, a probability represents the percentage of time that something will happen. It says nothing directly about the probability that any one realization of a random interval will contain. The specific interval (5.08, 5.72) is but one realization of the random intervalðX0:32;Xþ0:32Þbased on the data of a

single sample. Since the probability of 0.68 is with reference to the random interval ðX0:32;Xþ0:32Þ, it would be incorrect to say that the probability of being contained in the interval (5.08, 5.72) is 0.68.

However, the 0.68 probability of therandom interval does suggest that our con-fidence in the interval (5.08, 5.72) for containing the unknown meanis high and we refer to it as aconfidence interval. It is only in this sense that we are willing to assign a degree of confidence in the statement 5:02< <5:72.

Meaning of a confidence interval:When we write pð5:08< <5:72Þ, we are not making a probability statement in a classical sense but rather are expressing a degree of confidence. In general, we writepð5:08< <5:72Þ ¼1where 1 is called the confidence coefficient. It is important to remember that the ‘‘68%

confidence’’ refers to the probability of thetest, not to theparameter.

Dans le document This page intentionally left blank (Page 159-182)