• Aucun résultat trouvé

Measures of the central tendency and the dispersion of the data

5. BASIC STATISTICAL TOOLS FOR THE ANALYTICAL CHEMIST

5.4. Measures of the central tendency and the dispersion of the data

made from them is valid only, if the data follows such distribution. The exact shape of the normal distribution, graphically represented by the well known "bell curve", is defined by a function, which has only two parameters: mean and standard deviation.

The arithmetic mean of a set of n measurements x1, x2, x3, ..., xn, is equal to the sum of the

The standard deviation of a set of n measurements x1, x2, x3, ..., xn, is equal to the positive

A characteristic property of the normal distribution is that 68% of all of its observations fall within a range of ∀1 standard deviation from the mean, and ∀2 standard deviations include 95% of the data.

Problems may occur or wrong conclusions are made when a test based on the normal distribution is applied to a set of data, which does not follow this type of distribution. In such situations there are two alternatives to solve the problem. First, we can use some alternative non-parametric test or the so-called "distribution-free test". However, such tests are less powerful and the conclusions they would provide may not be definitive. Alternatively, in many cases one can still use the normal distribution-based test if the size of the sample is large enough. As the sample size increases, the shape of the sampling distribution approaches to a normal shape, even if the distribution of the variable in question is not normal.

In rigour, therefore, it is required that the first step in a statistical analysis should be to examine if the data to be analysed follow a normal distribution. There are several statistical tests, which can be used to determine whether the distribution of the data is normal. One of these parameters is the kurtosis. The kurtosis coefficient is an indication of how flat or steep the distribution of the data is compared to a normal distribution. For a normal distribution, the kurtosis coefficient is zero. When the coefficient is less than zero, the "bell curve" is flat with short tails. When the coefficient is greater than zero, the curve either is very steep at the centre or has relatively long tails.

A second parameter is the skewness, which is used to measure the symmetry or shape of the data. A skewness of zero suggests that the data are symmetrically distributed. Positive values of skewness indicate that the upper tail of the "bell curve" is longer than the lower tail; negative values indicate that the lower tail is longer.

If the kurtosis and the skewness have values between ∀2, the data follow a normal distribution.

Another statistical parameter used quite extensively when reporting results from the analyses of a number of samples, is the confidence interval of the mean. A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, a producer who wishes to estimate his mean daily output; a medical researcher who wishes to estimate the mean response by patients to a new drug; etc. The width of the confidence interval gives us some idea about how uncertain we are about the unknown population parameter, in this case the mean. A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter.

We calculate these intervals for different confidence levels, depending on how precisely we want to be. We interpret an interval calculated at a 95% level as, we are 95% confident that the interval contains the true population mean. We could also say that 95% of all confidence intervals formed in this manner (from different samples of the population) will include the true population mean.

In general, the confidence interval for the mean can be calculated using:

Assuming that the distribution of the data is normal, we can define the confidence interval of the mean with a 95% confidence level, as:

µ= ±x z σ

n (12)

where z (coefficients of area under the normal curve) takes different values according to the degrees of freedom and the confidence level. Thus, for a 95% confidence level, z is equal to 1.96 and for a 99.7% confidence level z takes the value of 2.97. Usually, to facilitate calculations z takes the value of 2 for a 95% confidence level.

As the sample size gets smaller, the uncertainty introduced by using s (the standard deviation) increases. To allow for this, the equation applied to calculate the confidence interval is modified to:

µ = ±x t s

n (13)

where t corresponds to the distribution of Student’s t, which is used for a small number of data following a normal distribution.

Table XVI includes data with results for the determination of zinc in a candidate reference material for chemical analysis; these data will be used to illustrate several applications of statistical tests to analytical results.

As an example, we will calculate the parameters explained so far using the data in Table XVI.

We will assume, for the calculation of the confidence interval a 95% confidence level. Therefore, applying the equations shown above, we find,

TABLE XVI. MASS FRACTION OF ZN IN A CANDIDATE REFERENCE MATERIAL AS DETERMINED BY SEVERAL ANALYTICAL TECHNIQUES

Number of data Analytical technique Number of measurements Mass fraction of Zn (mg/kg)

1 A 6 32.8

2 B 5 32.8

3 A 6 33.5

4 B 6 33.7

5 C 6 34.4

6 C 6 34.6

7 D 6 34.7

8 C 6 34.9

9 C 1 34.9

10 E 4 36.2

11 F 6 36.4

12 C 6 36.7

13 B 6 36.8

14 A 6 37.4

15 G 6 37.9

16 B 6 38.2

17 C 6 40.8

18 B 6 41.0

19 C 2 41.2

20 D 6 41.4

TABLE XVII. STATISTICAL PARAMETERS DESCRIBING THE DATA SET PRESENTED IN TABLE XVI

Statistical parameter Value

Count 20

Average 36.5

Variance 7.9

Standard deviation 2.8

Range 8.6

Skewness 0.554916

Kurtosis 0.817833

Confidence interval for the mean 36.5∀ 1.3 [35.2–37.8]

Observe the values for the skewness and kurtosis, which can be used to determine whether the sample comes from a normal distribution. As mentioned, values of these statistics outside the range of -2 to +2 indicate significant departures from normality, which would tend to invalidate any statistical test regarding the standard deviation. In this case, both the skewness and the kurtosis have value within the range expected for data from a normal distribution.

The interpretation for the confidence interval is that, in repeated sampling, this interval will contain the true mean of the population from which the data come 95.0% of the time. In practical terms, we can state with 95.0% confidence that the true mean of the data is somewhere between 35.2 and 37.8. It is assumed that the population from which the sample comes can be represented by a normal distribution.