• Aucun résultat trouvé

The normal distribution

Dans le document A practical introduction to statistics (Page 36-39)

3.3 Continuous distributions

3.3.1 The normal distribution

The upper left panel of Figure 3.6 shows theNORMAL DISTRIBUTIONin its most simple form, the case in which its two parameters, theMEANµand theSTANDARD DEVIATION σ, are0and1respectively. This specific form of the normal distribution is known as the STANDARD NORMAL DISTRIBUTION. The mean is represented by a vertical dashed line, and intersects the curve of the probability density function where it reaches its maximum.

The dotted horizontal line segment represents the standard deviation, the parameter that controls the width of the curve. We can shift the curve to the left or right by changing the mean, as shown in the right panel, in which the mean is increased from zero to four. We can make the curve narrower or broader by changing the standard deviation, as shown in the bottom panels, where the standard deviation is0.5instead of1.0. For all four panels, the area enclosed by the horizontal axis and the density curve is equal to1. It represents the probability of observing any value. The density curves are symmetrical around the mean. Thus, the area to the left (or right) of the vertical dashed line that is enclosed by the curve the horizontal axis represents a probability of0.5. In other words, the probability that a random variable assumes a value less than the mean is0.5. Similarly, the probability that its value will be greater than the mean is0.5.

62

DRAFT

−4 −2 0 2 4

0.00.20.40.60.8

x

density

normal(0,1)

0 2 4 6 8

0.00.20.40.60.8

x

density

normal(4,1)

−4 −2 0 2 4

0.00.20.40.60.8

x

density

normal(0,0.5)

0 2 4 6 8

0.00.20.40.60.8

x

density

normal(4,0.5)

Figure 3.6: Probability density functions for four normally distributed random variables.

Plotting the density shown in the upper left panel of Figure 3.6 requires that we select a range ofx-values to plot the density for. We select

> x = seq(-4, 4, 0.1)

as values outside the interval(−4,4)have such an extremely low probability that we can ignore them for our plot. They-values are obtained with the density function for the normal distribution,dnorm().

> y = dnorm(x)

We calleddnorm()without further arguments. If you do not specify mean and standard deviation explicitly,dnorm(), and also (pnorm(),qnorm(), andrnorm()) assume that the mean is zero and the standard deviation is1. Plotting the density is now straightfor-ward.

63

DRAFT

> plot(x, y, xlab = "x", ylab = "density", ylim = c(0, 0.8), + type = "l")) # line type: the quoted character is lower case L

> mtext("normal(0, 1)", 3, 1)

We add two lines to the plot, a vertical line across all values represented on the vertical axis, and a horizontal line segment. The vertical line is easiest to produce withabline(), a function that takes an intercept as first argument and a slope as second argument, and adds the requested line to the plot. For horizontal or vertical lines, the argumentvis set to specify where a vertical line intersects with the horizontal axis. Alternatively, the argumenthis set to the point where a horizontal line is to intersect the vertical axis. Here, we set our vertical line to intersect atX= 0. We also request a dashed line withlty(line type).

> abline(v = 0, lty = 2) # the vertical dashed line

For line segments, we uselines(). This function connects the points specified by the vector ofx-coordinates (its first argument) and the vector ofy-coordinates (its second argument). AsX-coordinates, we have−1and0, asY-coordinates, we have the density forX=−1for bothX-coordinates.

> lines(c(-1, 0), rep(dnorm(-1), 2), lty = 2)

For the remaining panels of Figure 3.6, the range ofX-values and the parameters of dnorm()have to be adjusted. For instance, for the lower left panel, the density curve is obtained with

> x = seq(0, 8, 0.1)

> y = dnorm(x, mean = 4, sd = 0.5)

Figure 3.7 shows the cumulative distribution function (upper left) and the quantile function (upper right) for a standard normal random variable. As for discrete random variables, these functions are each other’s inverse:

> pnorm(-1.96) [1] 0.02499790

> qnorm(0.02499790) [1] -1.96

The lower left panel of Figure 3.7 illustrates how we calculate the probability that a standard normal random variable has a value between−1and0, usingpnorm(). Since pnorm()plots the cumulative probability, the shaded area to the left of the dashed ver-tical line represents the probability of a value in the interval from minus infinity to zero.

This area is too large, however. The appropriate area is highlighted with dark grey. The desired probability is obtained by subtracting the lightgrey area from the shaded area.

> pnorm(0) - pnorm(-1) [1] 0.3413447

64

DRAFT

−3 −2 −1 0 1 2 3

0.00.40.8

x

pnorm(x)

0.0 0.2 0.4 0.6 0.8 1.0

−3−1123

p

qnorm(p)

−3 −2 −1 0 1 2 3

0.00.40.8

x

pnorm(x)

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

x

dnorm(x)

Figure 3.7: Cumulative distribution function (left panels), quantile function (upper right panel), and probability density function (lower right panel) for the standard normal dis-tribution.

The final panel of Figure 3.7 (have a look atshadenormal.fnc()and its documen-tation for how this panel was produced) returns to the probability density function. The shaded areas in the tails of the distribution each represent a probability of0.025. In other words, the shaded areas together highlight the 5% most extreme values in the distribu-tion. The remaining area under the curve that is not shaded represents the95% of values that are not extreme, given the rather arbitrary cutoff point of5% for being extreme.

A fundamental property of the normal distribution is that it is possible to transform a normal random variable with meanµ6= 0andσ6= 1into a standard normal random vari-able with meanµ= 0andσ= 1. This transformation is calledSTANDARDIZATION. Given a vectorx, standardization amounts to subtracting the mean from each of its elements, followed by division by the standard deviation.

> x = rnorm(10, 3, 0.1)

65

DRAFT

> x

[1] 2.985037 3.079029 2.895863 2.929407 2.841630 2.996799 [7] 2.934391 3.125997 3.015932 3.072539

> x - mean(x)

[1] -0.002625041 0.091366366 -0.091799655 -0.058255139 [5] -0.146032681 0.009136216 -0.053271546 0.138334988 [9] 0.028269929 0.084876563

> (x - mean(x)) / sd(x)

[1] -0.02943848 1.02462691 -1.02948603 -0.65330150 -1.63768158 [6] 0.10245798 -0.59741306 1.55135590 0.31703274 0.95184711

The functionsd()provides one’s best guess of the standard deviationσfor the vector of sampled observations. By subtracting the mean, we move the density curve along the horizontal axis so that it is centered around zero. By subsequently dividing by the stan-dard deviation, we reshape the curve to fit the curve of the stanstan-dard normal. For example, a normal random random variable with mean3and a small standard deviation of0.1is unlikely to have values below zero — in fact, it is highly unlikely to have values more than3standard deviations(0.3)away from the mean (3). After standardization, however, the new random numbers are nicely centered around the zero. The function inRfor stan-dardization isscale(). When its output is printed in the console, it also lists the mean and standard deviation as the object’s attributesscaled:centerandscaled:scale.

> scale(x) [,1]

[1,] -0.02943848 [2,] 1.02462691 [3,] -1.02948603 [4,] -0.65330150 ...

[10,] 0.95184711 attr(,"scaled:center") [1] 2.987662

attr(,"scaled:scale") [1] 0.08917038

> mean(x) == attr(x, "scaled:center") [1] TRUE

> sd(x) == attr(x1, "scaled:scale") [1] TRUE

In the past, the standard normal distribution was especially important as it was only for the standard normal distribution that tables withp-values for the cumulative distribu-tion funcdistribu-tion were available. In order to use these tables, one had to standardize first. In R, this is no longer necessary. We can usepnorm()with the mean and standard deviation of our choice,

> pnorm(0, 1, 3) - pnorm(-1, 1, 3) [1] 0.1169488

66

DRAFT

or we can standardize first, and then drop mean and standard deviation frompnorm().

> pnorm(-1/3) - pnorm(-2/3) [1] 0.1169488

In both cases, the outcome is exactly the same.

The square of the standard deviation is known as theVARIANCE. The variance is cal-culated with the functionvar().

> v = rnorm(20, 4, 2) # repeating this command

# will result in a different vector

# of random numbers

> sd(v)

[1] 2.113831 # sd of sample

> sqrt(var(v)) # square root of variance [1] 2.113831

Like the standard deviation, the variance is a measure for how much the observations vary around the mean. At first glance, one might think a measure averaging divergences from the mean would do a sensible job, but this average is zero:1

> mean(v - mean(v))

[1] -5.32907e-16 # zero

This problem is avoided by the definition of the variance as a kind of average of the squared divergences from the mean,

> var(v) [1] 4.46828

> sum( (v - mean(v))ˆ2)/(length(v) - 1) [1] 4.46828

where we divide, for technical reasons, not by the number of elements in the vector (re-turned bylength()) but by that number minus one.

Dans le document A practical introduction to statistics (Page 36-39)