• Aucun résultat trouvé

What is a statistic?

Dans le document This page intentionally left blank (Page 159-182)

6.1 Introduction

In this chapter, we address the question ‘‘What is a statistic’’? In particular, we look at what role statistics play in scientific inference and give some common useful examples.

We will examine their role in the two basic inference problems: hypothesis testing (the frequentist equivalent of model selection) and parameter estimation, with emphasis on the latter. Hypothesis testing will be dealt with in Chapter7.

Recall that an important aspect of frequentist statistical inference is the process of drawing conclusions based on sample data drawn from the population (which is the collection of all possible samples). The concept of the population assumes that in principle, an infinite number of measurements (under identical conditions) are pos-sible. Suppose X1;X2;. . .;Xn are n independent and identically distributed (IID) random variables that constitute a random sample from the population for which x1;x2;. . .;xn is one realization. The population is assumed to have an intrinsic probability distribution (or density function) which, if known, would allow us to predict the likelihood of the samplex1;x2;. . .;xn.

For example, suppose the random variable we are measuring is the time interval between successive decays of a radioactive sample. In this case, the population probability density function is a negative exponential (see Section 5.8.5), given by fðxjÞ ¼ ½expðx=Þ/. The likelihood is given by

Lðx1;x2;. . .;xnjÞ ¼Yn

1

fðxijÞ: (6:1)

This particular population probability density function is characterized by a single parameter. Another population probability distribution that arises in many prob-lems is the normal (Gaussian) distribution which has two parameters,and2.

In most problems, the parameters of the underlying population probability distribution are not known. Without knowledge of their values, it is impossible to compute the desired probabilities. However, a population parameter can be estimated from astatistic, which is determined from the information contained in a random sample. It is for this reason that the notion of a statistic and its sampling distribution is so important in statistical inference.

139

Definition: Astatisticis any function of the observed random variables in a sample such that the function does not contain any unknown quantities.

One important statistic is the sample meanXgiven by X¼ ðX1þX2þ þXnÞ=n¼1

n Xn

1

Xi: (6:2)

Note: we are using a capitalXwhich implies we are talking about a random variable.

All statistics are random variables and to be useful, we need to be able to specify their sampling distribution.

For example, we might be interested in the mean redshift1of a population of cosmic gamma-ray burst (GRB) sources. This would provide information about the distances of these objects and their mean energy. GRBs are the most powerful type of explosion known in the universe. The parameter of interest is the mean redshift which we designate . A parameter of a population is always regarded as a fixed and usually unknown constant. LetZbe a random variable representing GRB redshifts. Suppose the redshifts,fz1;z2;. . .;z7g, of a sample of seven GRB sources are obtained after a great deal of effort. What can we conclude about the population mean redshiftfrom our sample, i.e., how accurately can we determinefrom our sample?

This can be a fairly difficult question to answer using the individual measurements, zi, because we don’t know the form of the sampling distribution for GRB source redshifts. Happily, in this case, we can proceed with our objective by exploiting the Central Limit Theorem (CLT) which predicts the sampling distribution of thesample mean statistic. The way to think about this is as follows: consider a thought experiment in which we are able to obtain redshifts for a very large number of samples (hypothet-ical reference set) of GRB redshifts. Each sample consists of seven redshift measure-ments. The means of all these samples will have a distribution. According to the CLT, the distribution of sample means tends to a Gaussian as the numbernof observations tends to infinity. In practice, a Gaussian sampling distribution is often employed when n5. Of course, we don’t have the results from this hypothetical reference set, only the results from our one sample, but at least we know that the shape of the sampling distribution characterizing our sample mean statistic is approximately a Gaussian.

This allows us to make a definite statement about the uncertainty in the population mean redshiftwhich we derive from our one sample of seven redshift measurements.

Just how we do this is discussed in detail in Section6.6.2. In the course of answering that question, we will encounter the sample variance statistic, S2, and develop the notion of a sampling distribution of a statistic.

1Redshift is a measure of the wavelength shift produced by the Doppler effect. In 1929, Edwin Hubble showed that we live in an expanding universe in which the velocity of recession of a galaxy is proportional to its distance. A recession velocity shifts the observed wavelength of a spectral line to longer wavelengths, i.e., to the red end of the optical spectrum.

6.2 The2distribution

Thesampling distributionof any particular statistic is the probability distribution of that statistic that would be determined from an infinite number of independent samples, each of sizen, from an underlying population. We start with a treatment of the2sampling distribution.2We will prove in Section6.3that the2distribution describes the distribution of the variances of samples taken from a normal distribu-tion. The2distribution is a special case of the gamma distribution:

fðxj; Þ ¼ 1

ðÞx1exp x

(6:3) with¼2 and¼=2, whereis called thedegree of freedom.

The2distribution has the following properties:

fðxjÞ ¼ 1

2 22x21exp x 2

(6:4)

hxi ¼; Var½x ¼2: (6:5)

The coefficients of skewness (3) and kurtosis (4) are given by 3¼ 4

ffiffiffiffiffi

p2; 4¼3 1þ4

: (6:6)

Finally, themoment generating functionof2 withdegrees of freedom is given by m2

ðtÞ ¼ ð12tÞ2: (6:7) We now prove two useful theorems pertaining to the2distribution.

Theorem 1:

Let fXig ¼X1;X2;. . .;Xn be an IID sample from a normal distribution Nð; Þ.

Let Y¼Pn

i¼1ðXiÞ2=2¼Pn

i¼1Z2i, where Zi are standard random variables.

ThenYhas a chi-squared 2n distribution withndegrees of freedom.

Proof:

LetmYðtÞ ¼ the moment generating function (recall Section5.6) ofY. From Equation (5.9), we can write

mYðtÞ ¼ hetYi ¼ hetiZ2ii

¼ hetZ21etZ22 etZ2ni: (6:8)

2The2statistic plays an important role in fitting models to data using the least-squares method, which is discussed in great detail in Chapters10and11.

Since the random variableZis IID then,

mYðtÞ ¼ hetZ21i hetZ22i hetZ2ni

¼mZ2

1ðtÞ mZ2

2ðtÞ mZ2nðtÞ: (6:9) The moment generating function for eachZiis given by

mZ2ðtÞ ¼

where we have made use of the fact that fðzÞ is also a normal distribution, i.e., a Gaussian.

Multiplying and dividing Equation (6.10) byð12tÞ12we get

mZ2ðtÞ ¼ ð12tÞ12

Integral of normal distribution¼1 ) mZ2ðtÞ ¼ ð12tÞ12:

(6:11)

Therefore,

mYðtÞ ¼ ð12tÞn2: (6:12) Comparison of Equations (6.12) and (6.7) shows thatYhas a2distribution, withn degrees of freedom, which we designate by2n. Figure6.1illustrates the2distribution for three different choices of the number of degrees of freedom.

Example:

In Section5.9, we showed that for any IID sampling distribution with a finite variance, ðXÞ ffiffiffi

pn

=tends toNð0;1Þas n! 1, and therefore½ðXÞ ffiffiffi pn

=2 is approxi-mately21with one degree of freedom.3

3When sampling from a normal distribution, the distribution ofðXÞ ffiffiffi pn

=is alwaysN(0, 1) regardless of the value ofn.

Theorem 2:

If X1 and X2 are two independent 2-distributed random variables with 1 and2 degrees of freedom, thenY¼X1þX2 is also 2-distributed with 1þ2 degrees of freedom.

Proof:

SinceX1andX2are independent, the moment generating function ofYis given by myðtÞ ¼mX1ðtÞ mX2ðtÞ ¼ ð12tÞ21 ð12tÞ22 (6:13)

¼ ð12tÞ

ð1þ2Þ

2 ; (6:14)

which equals the moment generating function of a2random variable with1þ2 degrees of freedom.

6.3 Sample varianceS2

We often want to estimate the varianceð2Þof a population from an IID sample taken from a normal distribution. We usually don’t know the meanðÞof the population so we use the sample meanðXÞas an estimate. To estimate2 we use another random variable called thesample varianceðS2Þ, defined as follows:

S2¼Xn

i¼1

ðXi2

n1 : (6:15)

2.5 5 7.5 10 12.5 15 17.5 20

χ2 value 0.05

0.1 0.15 0.2 0.25

Probability density

ν=8 ν=3 ν=1

Figure 6.1 The2distribution for three different choices of the number of degrees of freedom.

Just why we define the sample variance random variable in this way will soon be made clear. Of course, for any particular sample of n data values, the sample random variable would take on a particular value designated by lower cases2.

Here is a useful theorem that enables us to estimatefromS:

Theorem 3:

The expectation value of a quantity that has a2distribution withðn1Þdegrees of freedom is equal to the number of degrees of freedom (see Equation (6.5)).

Therefore,

This provides justification for our definition of S2 – its expectation value is the population variance. Note: this does not mean thatS2will equal2for any particular sample.

Note 1: We have just established Equation (6.20) when sampling for a normal distribution. We now show that Equation (6.20) is valid for IID sampling from any arbitrary distribution with finite variance. From Equation (6.16), we can write

hS2i ¼

PhðXiÞ2i nhðXÞ2i

n1 : (6:21)

But hðXiÞ2i ¼VarðXiÞ ¼2 by definition, and hðXÞ2i ¼VarðXÞ ¼2=n from Equation (5.50). It follows that

hS2i ¼n22

n1 ¼2: (6:22)

Thus, Equation (6.22) is valid for IID sampling from an arbitrary distribution with finite variance. In the language of frequentist statistics, we say thatS2, as defined in Equation (6.15), is an unbiased estimator of2.

Standard error of the sample mean:We often want to quote a typical error for the mean of a population based on our sample. According to Equation (5.50), VarðXÞ ¼2=n for any distribution with finite variance. Since we do not normally know 2, the variance of the population, we use the sample variance as an estimate.

Thestandard error of the sample meanis defined as S ffiffiffin

p : (6:23)

In Section 6.6.2 we will use a Student’s t distribution to be more precise about specifying the uncertainty in our estimate of the population mean from the sample mean.

Note 2: In a situation where we know populationbut not2, defineS2: S2¼Xn

1

ðXiÞ2

n : (6:24)

It is easily shown that with this definition,nS2=2is2nwithndegrees of freedom. We lose one degree of freedom when we estimatefromX.

Example:

A random sample of size n¼16 (IID sample) is drawn from a population with a normal distribution of unknown mean () and variance (2). We compute the sample variance,S2, and want to determine

2<0:49S2Þ: (6:25)

Solution:Equation (6.25) is equivalent to p S2

2>2:041

: (6:26)

We know that the random variableX¼ ðn1ÞS2=2has a2distribution withðn1Þ degrees of freedom. In this case,ðn1Þ ¼15¼degrees of freedom. Therefore,

p S2

2 >2:041

¼p ðn1ÞS2

2 >30:61

: (6:27)

Let

¼pððn1ÞS2=2>30:61Þ:

Then

1¼pððn1ÞS2=230:61Þ;

or more generally, 1¼pðXx1Þ where x1 is the particular value of the random variableXfor which the cumulative distributionpðXx1Þ ¼1.x1

is called theð1Þquantile valueof the distribution, andpðXx1jÞis given by pðXx1jÞ ¼ 1

2 22 Z x1

0

t21exp t 2

dt¼1: (6:28)

For ¼15 degrees of freedom, 30.61 corresponds to ¼0:01 or x0:990. Thus, the probability that the random variable 2<0:49S2¼1%. Figure 6.2 shows the 2 distribution for¼15 degrees of freedom and the 1¼0:99 quantile value.

0 10 20 30 40

0 0.02 0.04 0.06

Probability density

χ20.99

χ2value

Figure 6.2 The 2 distribution for ¼15 degrees of freedom. The vertical line marks the 1¼0:99 quantile value. The area to the left of this line corresponds to 1.

We can evaluate Equation (6.28) with the followingMathematicacommand:

Box 6.1 Mathematica2significance Needs [‘‘Statistics ‘ContinuousDistributions’’’]

The line above loads a package containing a wide range of continuous distribu-tions of importance to statistics, and the following line computes, the area in the tail of the2distribution to the right of2¼30:61, for¼15 degrees of freedom.

In statistical hypothesis testing (to be discussed in the next chapter),is referred to as thesignificanceor theone-sided P-valueof a statistical test.

6.4 The Student’stdistribution

Recall, when sampling from a normal distribution with known standard deviation,, the distribution of the standard random variable Z¼ ðXÞ ffiffiffi

pn

= is Nð0;1Þ. In practice,is usually not known. The logical thing to do is to replaceby the sample standard deviationS. The usual inference desired is that there is a specified probability thatXlies withinSof the true mean.

Unfortunately, the distribution of ðXÞpffiffiffin

=S is not Nð0;1Þ. However, it is possible to determine the exact sampling distribution ofðXÞ ffiffiffi

pn

=Swhen sampling fromNð; Þwith bothand2 unknown. To this end, we examine the Student’s t distribution.4The following useful theorem pertaining to the Student’stdistribution is given without proof.

Theorem 4:

LetZbe a standard normal random variable and letXbe a2random variable with degrees of freedom. IfZandXare independent, then the random variable

T¼ Z ffiffiffiffiffiffiffiffiffi

pX= (6:29)

has a Student’stdistribution withdegrees of freedom and a probability density given by

4Thetdistribution is named for its discoverer, William Gosset, who wrote a number of statistical papers under the pseudonym ‘‘Student.’’ He worked as a brewer for the Guinness brewery in Dublin in 1899. He developed thet distribution in the course of analyzing the variability of various materials used in the brewing process.

The Student’stdistribution has the following properties:

=Sis a random variable with a Student’s tdistribution with n1 degrees of freedom. Figure6.3shows a comparison of a Student’stdistribution for three degrees of freedom, and a standard normal. The broader wings of the Student’stdistribution are clearly evident.

Theð1Þquantile value fordegrees of freedom,t1;, is given by

Suppose a cigarette manufacturer claims that one of their brands has an average nicotine content of 0.6 mg per cigarette. An independent testing organization

–4 –2 0 2 4

Figure 6.3 Comparison of a standard normal distribution and a Student’s tdistribution for 3 degrees of freedom.

measures the nicotine content of 16 such cigarettes and has determined the sample average and the sample standard deviation to be 0.75 and 0.197 mg, respectively. If we assume the amount of nicotine is a normal random variable, how likely is the sample result given the manufacturer’s claim?

T¼ ðXÞ ffiffiffi pn

=Shas a Student’stdistribution.

x¼0:75 mg;s¼0:197 mg, andn¼16;

so the number of degrees of freedom¼15.

Manufacturer claims¼0:6 mg corresponds to a t¼ð0:750:6Þ

0:197= ffiffiffiffiffi

p16 ¼3:045: (6:34)

The Student’stdistribution is a continuous distribution, and thus we cannot calculate the probability of any specifictvalue since there is no area under a point. The question of how likely thetvalue is, given the manufacturer’s claim, is usually interpreted as what is the probability by chance thatT3:045. The area of the distribution beyond the sampletvalue gives us a measure of how far out in the tail of the distribution the sample value resides.

Box 6.2 Mathematicasolution:

We can solve the above problem with the following commands:

Needs[‘‘Statistics ‘ContinuousDistributions’’’]

The following line computes the area in the tail of the Tdistribution beyondT¼3:045.

(1 – CDF[StudentTDistribution[n], 3.045])!answer¼0:004ð¼15Þ

where CDF[StudentTDistribution [n], 3.045] stands for the cumulative density functionof theTdistribution fromT¼ 1 !3:045.

Therefore, pðT>3:045Þ ¼¼0:004 or 0.4%, i.e., the manufacturer’s claim is very improbable. The way to think of this is to imagine we could repeatedly obtain samples of 16 cigarettes and compute the value oftfor each sample. The fraction of these t values that we would expect to fall in the tail area beyondt>3:045 is only 0.4%. If the manufacturer’s claim were reasonable, we would expect that thetvalue of our actual sample would not fall so far out in the tail of the distribution. If you are still puzzled by this reasoning, we will have a lot more to say about it in Chapter7. We will revisit this example in Section7.2.3.

Note: althoughðxÞ=s¼0:15=0:197<1,sis not a meaningful uncertainty forx– only forxi. The usual measure of the uncertainty inxiss= ffiffiffi

pn

¼0:049. The quantity s= ffiffiffi

pn

is called thestandard error of the sample mean.

6.5 Fdistribution (F-test)

The F distribution is used to find out if two data sets have significantly different variances. For example, we might be interested in the effect of a new catalyst in the brewing of beer so we compare some measurable property of a sample brewed with the catalyst to a sample from the control batch made without the catalyst. What effect has the catalyst had on the variance of this property?

Here, we develop the appropriate random variable for use in making inferences about the variances of two independent normal distributions based on a random sample from each. Recall that inferences about 2, when sampling from a normal distribution, are based on the random variable ðn1ÞS2=2, which has a 2n1 distribution.

Theorem 5:

Let X and Y be two independent 2 random variables with 1 and 2 degrees of freedom. Then the random variable

F¼X=1

Y=2

(6:35)

has anFdistribution with a probability density function

pðfj1; 2Þ ¼

½ð1þ2Þ=2 ð1=2Þð2=2Þ

1

2

1

2 f12ð1

ð1þf1=2Þ12ð1þ2Þ; ðf>0Þ

0; elsewhere.

8<

: (6:36)

AnFdistribution has the following properties:

hFi ¼ 2

22; ð2>2Þ: (6:37)

(Surprisingly,hFidepends only on2and not on1.) VarðFÞ ¼ 22ð22þ21

1ð22ð24Þ; ð2>4Þ (6:38) Mode¼2ð1

1ð2þ2Þ: (6:39)

LetX¼ ðn11ÞS21=21andY¼ ðn21ÞS22=22. Then, F12¼X=1

Y=2 ¼X=ðn1

Y=ðn21Þ¼S21=21

S22=22: (6:40)

Box 6.3 Mathematicaexample:

The sample variance iss21¼16:65 forn1¼6 IID samples from a normal distribu-tion with a populadistribu-tion variance21, ands22¼5:0 forn2¼11 IID samples from a second independent normal distribution with a population variance 22. If we assume that 21¼22, then from Equation (6.40), we obtain f¼3:33 for 1¼n11¼5 and2¼n21¼10 degrees of freedom. What is the probability of getting anfvalue3:33 by chance if21¼22?

Needs[‘‘Statistics ‘ContinuousDistributions’’’]

The following line computes the area in the tail of theFdistribution beyondf¼3:33.

ð1CDF[FRatioDistribution[n1,n2], 3:33]Þ !answer¼0:05

where CDF[FRatioDistribution[n1,n2], 3:33] stands for the cumulative density function of theFdistribution fromf¼0!3:33. Another way to compute this tail area is with

FRatioPValue[fratio,n1,n2]

TheFdistribution for this example is shown in Figure6.4.

What if we had labeled our two measurements of s the other way around so 1¼10; 2¼5 ands21=s22¼1=3:33? The equivalent question is: what is the probability thatf1=3:33 which we can evaluate by

CDF[FRatioDistribution[n1,n2], 1=3:33]?Answer:0:05 Not surprisingly, we obtain the same probability.

0 1 2 3 4 5

fvalue 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Probability density

f0.95,5,10 predicted distribution

measured

Figure 6.4 TheFdistribution for1¼5; 2¼10 degrees of freedom. The measured value of 3.33, indicated by the line, corresponds tof0:95;5;10, the 0.95 quantile value.

6.6 Confidence intervals

In this section, we consider how to specify the uncertainty of our estimate of any particular parameter of the population, based on the results of our sample. We start by considering the uncertainty in the population mean when it is known that we are sampling from a population with a normal distribution. There are two cases of interest. In the first, we will assume that we know the variance2 of the underlying population we are sampling from. More commonly, we don’t know the variance and must estimate it from the sample. This is the second case.

6.6.1 Variance2known

LetfXigbe an IIDNð; 2Þrandom sample ofn¼10 measurements from a popula-tion with unknownbut known¼1. LetXbe the sample mean random variable which will have a sample mean standard deviation,m¼=pffiffiffin

¼1= ffiffiffiffiffi p10

¼0:32, to two decimal places. The probability thatXwill be within onem¼0:32 ofis approxi-mately 0.68 (from Section5.8.1). We can write this as

pð0:32<X< þ0:32Þ ¼0:68: (6:41) Since we are interested in making inferences aboutfrom our sample, we rearrange Equation (6.41) as follows:

pð0:32<X< þ0:32Þ ¼pð0:32<X <0:32Þ

¼pð0:32> X>0:32Þ

¼pðXþ0:32> >X0:32Þ ¼0:68;

or,

pðX0:32< <Xþ0:32Þ ¼0:68: (6:42) Suppose the measured sample mean isx¼5:40. Can we simply substitute this value into Equation (6.42), which would yield

pð5:08< <5:72Þ ¼0:68? (6:43) We need to be careful how we interpret Equations (6.42) and (6.43).

Equation (6.42) says that if we repeatedly draw samples of the same size from this population, and each time compute specific values for the random interval ðX0:32;Xþ0:32Þ, then we would expect 68% of them to contain the unknown mean. In frequentist theory, a probability represents the percentage of time that something will happen. It says nothing directly about the probability that any one realization of a random interval will contain. The specific interval (5.08, 5.72) is but one realization of the random intervalðX0:32;Xþ0:32Þbased on the data of a

single sample. Since the probability of 0.68 is with reference to the random interval ðX0:32;Xþ0:32Þ, it would be incorrect to say that the probability of being contained in the interval (5.08, 5.72) is 0.68.

However, the 0.68 probability of therandom interval does suggest that our con-fidence in the interval (5.08, 5.72) for containing the unknown meanis high and we refer to it as aconfidence interval. It is only in this sense that we are willing to assign a degree of confidence in the statement 5:02< <5:72.

Meaning of a confidence interval:When we write pð5:08< <5:72Þ, we are not making a probability statement in a classical sense but rather are expressing a degree of confidence. In general, we writepð5:08< <5:72Þ ¼1where 1 is called the confidence coefficient. It is important to remember that the ‘‘68%

confidence’’ refers to the probability of thetest, not to theparameter.

confidence’’ refers to the probability of thetest, not to theparameter.

Dans le document This page intentionally left blank (Page 159-182)

Documents relatifs