The t, F and χ 2 distributions - Continuous distributions

3.3 Continuous distributions

3.3.2 The t, F and χ 2 distributions

Three other continuous distributions that we will make use of repeatedly in the remainder of this book are thet,Fandχ²distributions.

Thet-DISTRIBUTIONis closely related to the normal distribution. It has one parame-ter, known as itsDEGREES OF FREEDOM(often abbreviated todf). Informally, degrees of freedom can be understood as a measure of how much precision an estimate has. This pa-rameter controls the thickness of the tails of the distribution, as illustrated in the upper left panel of Figure 3.8. The grey line represents the standard normal distribution, the solid line at-distribution with2degrees of freedom, and the dashed line at-distribution with5

1The number-5.32907e-16is in scientific notation. The parte-16specifies that the period should be shifted16positions to the left, yielding0

.000000000000000532907in standard notation.

DRAFT

degrees of freedom. As the degrees of freedom increase, the probability density function becomes more and more similar to that of the standard normal. For30or more degrees of freedom, the curves are already very similar, and for more than100degrees of freedom, they are virtually indistinguishable. Thet-distribution plays an important role in many statistical tests, and we will use it frequently in the remainder of this book. Rmakes the by now familiar four functions available for this distribution: dt(), pt(), qt()and rt(). Of these functions, the cumulative distribution function is the one we will use most. Here, we use it to illustrate the greater thickness of the tails of thet-distribution compared to the standard normal:

> pnorm(-3, 0, 1) [1] 0.001349898

> pt(-3, 2) [1] 0.04773298

The probability of observing extreme values (values less than −3 in this example) is greater for thet-distribution. This is what we mean when we say that thet-distribution has thicker tails.

There are many other continuous probability distributions besides the normal andt distributions. We will often need two of these distributions: the F-DISTRIBUTION and theχ²-DISTRIBUTION. TheF-distribution has two parameters, referred to asDEGREES OF FREEDOM1 andDEGREES OF FREEDOM2. The upper right panel of Figure 3.8 shows the probability density function of theF-distribution for4different combinations of degrees of freedom. The ratio of two variances isF-distributed, and a question that often arises in statistical testing is whether the variance in the numerator is so much larger than the variance in the denominator that we have reason to be surprised.

For instance, if theF ratio is6, then, depending on the degrees of freedom associ-ated with the two ratios the probability of this value may be small (surprise) or large (no surprise).

> 1 - pf(6, 1, 1) [1] 0.2467517

> 1 - pf(6, 20, 8) [1] 0.006905409

Here,pf()is the cumulative distribution function, which gives the probability of a ratio less than or equal to6(comparept()for thet-distribution andppois()andpbinom() for the Poisson and binomial distributions). To obtain the probability of a more extreme ratio, we take the complement probability.

The lower panels of Figure 3.8 show the probability density functions for threeχ² -distributions. Theχ²-distribution has a single parameter, which is also referred to as its degrees of freedom. The lower left panel shows the density function for a single degree of freedom, the lower right panel gives the densities for5(solid line) and10(dashed line) degrees of freedom.

The degree of non-homogeneity of a contingency table (see, e.g., Figure 2.6 in Chap-ter 2) can be assessed by means of a statistic named chi-squared, which, unsurprisingly

DRAFT

−6 −4 −2 0 2 4 6

0.00.10.20.30.4

density

t−distributions

0 2 4 6 8 10

0.00.20.40.60.81.0

density

F−distributions

0 2 4 6 8 10

0.00.51.01.52.02.5

density

chi−squared(1) distribution

0 5 10 15 20

0.000.050.100.15

density

chi−squared distributions

Figure 3.8: Probability density functions. Upper left: t-distributions with2(solid line) and5(dashed line) degrees of freedom, and the standard normal (grey line). Upper right:

F-distributions with5,5(black, solid line),2,1(grey, dashed line),5,1(grey, solid line) and10,10(black, dashed line) degrees of freedom. Lower left: aχ²-distribution with 1 degree of freedom. Lower right:χ²-distributions with5(solid line) and10(dashed line) degrees of freedom.

given its name, follows aχ²-distribution. Given a chi-squared value and its associated de-grees of freedom, we use the probability density functionpchisq()to obtain the proba-bility gauging the extent to which we have reason for surprise.

> 1 - pchisq(4, 1) [1] 0.04550026

> 1 - pchisq(4, 5) [1] 0.549416

> 1 - pchisq(4, 10)

DRAFT

[1] 0.947347

These examples illustrate that thep-values for one and the same chi-squared value (here 4) depends on the degrees of freedom. As the degrees of freedom increase,p-values de-crease. This is also evident in the lower panels of Figure 3.8. For1degree of freedom, a4 is already a rather extreme value. But for5degrees of freedom, a4is more or less in the center of the distribution, and for10degrees, it is in fact a rather low value instead of a very high value.

3.4 Exercises

The text of Lewis Carroll’sAlice’s Adventures in Wonderlandis available as the data set alice. In this exercise, we study the distribution of three words in this book,Alice,very, andHare(the second noun of the collocationMarch Hare).

> alice[1:4]

[1] "alice’s" "adventures" "in" "wonderland"

The vectoralicecontains all words (defined as sequences of non-space characters) in this novel. Our goal is to partition this text into40equal-sized text chunks, and to study the frequencies with which our three target words occur in these40chunks.

A text with25942words cannot be divided into40equal-sized text chunks: We are left with a remainder of22tokens.

> 25942 %% 40 # %% is the remainder operator [1] 22

We therefore restrict ourselves to the first25920tokens, and usecut()to partition the sequence of tokens into40equally sized chunks. The output ofcut()is a factor with as levels the successive equal-sized chunks of data. For each element in its input vec-tor, i.e., for each word, it specifies the chunk to which that word belongs. We combine the words and the information about their chunks into a data frame with the function data.frame().

> wonderland = data.frame(word = alice[1:25920], + chunk = cut(1:25920, breaks = 40, labels = F))

> wonderland[1:4, ] word chunk

1 alice’s 1

2 adventures 1

3 in 1

4 wonderland 1

We now add a vector of truth values to this data frame to indicate which rows contain the exact string"alice".

DRAFT

> wonderland$alice = wonderland$word=="alice"

word chunk alice 1 alice’s 1 FALSE 2 adventures 1 FALSE

3 in 1 FALSE

4 wonderland 1 FALSE

We count how often the wordAlice(alice) occurs in each chunk.

> countOfAlice = tapply(wonderland$alice, wonderland$chunk, sum)

> countOfAlice

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 8 7 9 8 4 10 8 8 9 7 8 7 8 14 9 11 6 10 12 14 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 13 13 15 11 11 12 10 7 12 12 16 10 13 8 7 3 8 8 5 6 Finally, we make a frequency table of these counts withxtabs().

> countOfAlice.tab = xtabs(˜countOfAlice) countOfAlice

3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 1 1 2 5 9 3 4 3 4 3 2 1 1

There is one chunk in whichAliceappears only three times (chunk36), and nine chunks in which this word occurs eight times (e.g., chunks1and4).

1. Create similar tables for the wordshareandvery.

2. Make a plot that displays by means of high density lines how oftenAliceoccurs in the successive chunks. Make similar plots forveryandhare. What do you see?

3. Make a plot with the number of timesAliceoccurs in the chunks on the horizontal axis (i.e.,as.numeric(names(alice.tab))), and with the proportion of chunks with that count on the vertical axis. Use high-density lines. Make similar sample density plots forveryand forhare.

4. Also plot the corresponding densities under the assumption that these words follow a Poisson distribution with an estimated rate parameterλequal to the mean of the counts in the chunks. Compare the Poisson densities with the sample densities.

5. Make quantile-quantile plots for graphical inspection of whetherAlice,veryandhare might follow a Poisson distribution. First create the vector of theoretical quantiles for theX-coordinates, using as percentage points5%,10%,15%, . . . ,100%. Sup-ply the percentage points as a vector of proportions as first argument toqpois(). The second argument isλ, estimated by the mean count. The sample quantiles are obtained withquantile().

DRAFT

6. The mean count ofAliceis9.4. In chunk39,Aliceis observed only5times. Suppose we only have this chunk of text available. Calculate the likelihood of observing Alicemore than10times in another chunk of similar size. Assume thatAlicefollows a Poisson distribution. Recalculate this probability on the basis of the mean count, and compare the expected number of chunks in whichAliceoccurs more than10 times with the actual number of chunks.

DRAFT

Chapter 4 Basic statistical methods

The logic underlying the statistical tests described in this book is simple. A statistical test produces aTEST STATISTIC of which the distribution is known.¹What we want to know is whether the test statistic has a value that is extreme, so extreme that it is unlikely to be attributable to chance. In the traditional terminology, we pit aNULL-HYPOTHESIS, actually a straw man, that the test-statistic does not have an extreme value, against an alternative hypothesis according to which its value is indeed extreme. Whether a test statistic has an extreme value is evaluated by calculating how far out it is in one of the tails of the distribution. Functions likept(),pf(), andpchisq()tell us how far out we are in a tail by means ofp-values, which assess what proportion of the population has even more extreme values. The smaller this proportion is, the more reason we have for surprise that our test-statistic is as extreme as it actually is.

However, the fuzzy notion of what counts as extreme needs to be made more precise.

It is generally assumed that a probability begins to counts as extreme by the time it drops below0.05. However, opinions differ with respect to how significance should be assessed.

One tradition holds that the researcher should begin with defining what counts as extreme, before gathering and analysing data. The cutoff probability for considering a test statistic as extreme is referred to as theαLEVELorSIGNIFICANCE LEVEL. Theαlevel 0.05 is marked by one asterisk inR. More stringentαlevels are0.01(marked by two asterisks) and0.001(marked by three asterisks). If the observed value of one’s test statistic is extreme given this pre-definedαlevel, i.e., if the associatedp-value (obtained with, for instance,pnorm()) is less thanα), then the outcome is declared to be statistically significant. If you fixαat0.05, theαlevel enforced by most linguistic and psycholinguistic journals, then all you should do is report whetherp <0.05orp >0.05.

However, a cut-off point like 0.05 is quite arbitrary. This is why I have disabled significance stars in summary tables when thelanguageRpackage is attached (with options(show.signif.stars=FALSE)). If an experiment that required half a year’s preparation results in ap-value of0.052, it would have failed to reveal a statistically

sig-1This chapter introduces tests based on what is known asFREQUENTISTstatistical inference. For an introduction to the alternative school in statistics known as BAYESIANinference, see Bolstad [2004].

DRAFT

nificant effect, whereas if it had produced ap-value of0.048, it would have succeeded in showing a statistically significant effect. Therefore, many researchers prefer to interpret p-values as a measure of surprise. Instead of reportingp <0.10orp <0.05they report p= 0.052orp= 0.048. This allows the reader to make up her own mind about how sur-prising this really is. This is important, because assessing what counts as surprise often depends on many considerations that are difficult to quantify.

For instance, although most journals will accept a significance level of0.05, no one in his right mind would want to cross a bridge that has a mere probability of0.05of collapsing. Nor would anyone like to use a medicine that has fatal side effects for one out of twenty patients, or even only one out of a thousand patients. When a paper with a result that is significant at the5% level is accepted for publication, this isonlybecause it opens new theoretical possibilities that have a fair chance of being replicated in further studies. Such replication experiments are crucial for establishing whether a given effect is really there. The smaller thep-value is, and the greater thePOWERof the experiment (i.e., the greater the number of subjects, items, repetitions, etc.), the more likely it is that replication studies will also bear witness to the effect. Nevertheless, replication studies remain essential even whenp-values are very small. We also have to keep in mind that a smallp-value does not imply that an observed effect is significant in the more general sense of being important or applicable. We will return to this issue below.

In practise, one’s a-priori assumptions about how difficult it is to find some hypothe-sized effect plays a crucial role in thinking about what counts as statistically significant.

In physics, where it is often possible to bring a great many important factors under exper-imental control,p-values can be required to be very small. For an experiment to falsify an existing well-established theory, ap-value as small as0.00001may not be small enough.

In the social sciences, where it is often difficult if not outright impossible to obtain full ex-perimental control of the very diverse factors that play a potential role in an experiment, ap-value of0.05can sensibly count as statistically significant.

One assumption that is brought explicitly into the evaluation ofp-values is the ex-pected direction of an effect. Consider, for instance, the effect of frequency of use. A long series of experiments has documented that higher frequency words tend to be recognized faster than lower frequency words. If we run yet another experiment in which frequency is a predictor, we expect to observe shorter latencies for higher frequencies (facilitation) and not longer latencies (inhibition). In other words, previous experience, irrespective of whether previous experience has been formalised in the form of a theory, may give rise to expectations about the direction of an effect: inhibition or facilitation. Suppose that we examine our directional expectation by means of a test statistic that follows the t-distribution. Facilitation then implies a negativet-value (the observed value is smaller than the value given by the null hypothesis), and inhibition a positivet-value (the ob-served value is greater). Given at-value of−2for10degrees of freedom, and given that we expect facilitation, we calculate the probability of observing at-value of−2or lower using the left tail of thet-distribution:

> pt(-2, 10) [1] 0.03669402

DRAFT

Since this probability is fairly small, there is reason to be surprised: the observedt-value is unlikely to be this small by chance. This kind of directional test, for which you should have very good independent reasons, is known as aONE-TAILEDtest.

Now suppose that nothing is known about the effect of frequency, and that it might equally well be facilitatory or inhibitory. If the only thing we want to test is that frequency might matter, one way or other, then thep-value is twice as large.

> 2 * pt(-2, 10) [1] 0.07338803

In this case, we reason that thet-value could just as well have been positive instead of negative, so we sum the probabilities in both tails of the distribution. This is known as a two-tailed test. Since the density curve of thet-distribution is symmetrical, the probability oftbeing less than−2is the same as the probability that it is greater than2. We sum the probabilities in both tails, and therefore obtain ap-value that is twice as large. Evidently, the present example now gives us less reason for surprise. Next suppose that we observed at-value of2instead of−2. Ourp-value is now obtained with

> 2 * (1 - pt(2, 10)) [1] 0.07338803

Recall thatpt(2,10)is the probability that thetstatistic assumes a value less than2.

We need the complementary probability, so we subtract from1to obtain the probability thatthas a value exceeding2. Again, we multiply the result by2in order to evaluate the likelihood that ourt-value is either in the left tail or in the right tail of the distribution.

We can merge the tests for negative and positive values into one generally applicable line of code by working with the absolute value of thet-value:

> 2 * (1 - pt(abs(-2), 10)) [1] 0.07338803

> 2 * (1 - pt(abs(2), 10)) [1] 0.07338803

Table 4.1 summarizes the different one and two-tailed tests that we will often use in the remainder of this book.

Any test that we run on a data set involves a statistical model, even the simplest of the standard tests described in this chapter. There are a number of basic properties of any statistical model that should be kept in mind at all times. As pointed out by Crawley [2002] (p. 17):

• All models are wrong.

• Some models are better than others.

• The correct model can never be known with certainty.

• The simpler the model, the better it is.

DRAFT

Table 4.1: One-tailed and two-tailed tests inR.dfdenotes the number of degrees of free-dom,N the normal distribution.

N one-tailed left tail pnorm(value, mean, sd) one-tailed right tail 1 - pnorm(value, mean, sd)

two-tailed either tail 2 * (1 - pnorm(abs(value), mean, sd)) t one-tailed left tail pt(value, df)

one-tailed right tail 1 - pt(value, df)

two-tailed either tail 2 * (1 - pt(abs(value), df))

F 1 - pf(value, df1, df2)

χ² 1 - pchisq(value, df)

As a consequence, it is important to check whether the model fits the data. This part of statistical analysis is known asMODEL CRITICISM. A test may yield a very smallp-value, but if the assumptions on which the test is based are violated, thep-value is quite useless.

In what follows, model criticism will therefore play an important role.

In what follows, we begin with discussing tests involving a single vector. We then pro-ceed with tests addressing the broader range of questions that arise when you have two vectors of observations. Questions involving more than two vectors are briefly touched upon, but are discussed in detail in chapters 5–7.

4.1 Tests for single vectors

Dans le document A practical introduction to statistics (Page 39-43)