• Aucun résultat trouvé

Descriptive Statistics

Dans le document Data Mining Using (Page 53-56)

Exploratory Data Analysis

3.2 Exploring Continuous Variables

3.2.1 Descriptive Statistics

Simple descriptive statistics of continuous variables are useful in summa-rizing central tendency, quantifying variability, detecting extreme outliers, and checking for distributional assumptions. The SAS procedures MEANS, SUMMARY, and UNIVARIATE provide a wide range of summary and exploratory statistics. For additional information on statistical theory, for-mulae, and computational details, readers should refer to Schlotzhauer and Littel3 and SAS Institute.4

3.2.1.1 Measures of Location or Central Tendency

Arithmetic mean. The most commonly used measure of central tendency, the mean is equal to the sum of the variable divided by the number of observations; however, mean can be heavily influ-enced by a few extreme values in the tails of a distribution.

Median. The median is the mid-value of a ranked continuous variable and the number that separates the bottom 50% of the data from the top 50%; thus, half of the values in a sample will have values that are equal to or larger than the median, and half will have values that are equal to or smaller than the median. The median is less sensitive to extreme outliers than the mean; therefore, it is a better measure than the mean for highly skewed distributions. For example, the median salary is usually more informative than the mean salary when summarizing average salary. The mean value is higher than the median in positively skewed distributions and lower than the median in negatively skewed distributions.

Mode. The most frequent observation in a distribution, mode is the most commonly used measure of central tendency with the nominal data.

Geometric mean. The geometric mean is an appropriate measure of central tendency when averages of rates or index numbers are required. It is the nth root of the product of a positive variable. For example, to estimate the average rate of return of a 3-year investment that earns 10% the first year, 50% the second year, and 30% the third year, the geometric mean of these three rates should be used.

Harmonic mean. Harmonic mean is the reciprocal of the average of the reciprocals. The harmonic mean of N positive numbers (x1, x2, …, xn) is equal to N/(1/x1 + 1/x2 + … + 1/xn). The harmonic mean is used to estimate the mean of sample sizes and rates. For example, when averaging rate of speed, which is measured by miles per hour, harmonic mean is the appropriate measure rather than arithmetic mean in averaging the rate.

3456_Book.book Page 42 Wednesday, November 20, 2002 11:34 AM

3.2.1.2 Robust Measures of Location

Winsorized mean. The Winsorized mean compensates for the presence of extreme values in the mean computation by setting the tail values equal to a certain percentile value. For example, when estimating a 95% Winsorized mean, the bottom 2.5% of the values are set equal to the value corresponding to the 2.5th percentile, while the upper 2.5% of the values are set equal to the value corresponding to the 97.5th percentile.

Trimmed mean. The trimmed mean is calculated by excluding a given percentage of the lowest and highest values and then com-puting the mean of the remaining values. For example, by exclud-ing the lower and upper 2.5% of the scores and takexclud-ing the mean of the remaining scores, a 5% trimmed mean is computed. The median is considered as the mean trimmed 100% and the arithmetic mean is the mean trimmed 0%. A trimmed mean is not as affected by extreme outliers as an arithmetic mean. Trimmed means are commonly used in sports ratings to minimize the effects of extreme ratings possibly caused by biased judges.

3.2.1.3 Five-Number Summary Statistics

The five-number summary of a continuous variable consists of the mini-mum value, the first quartile, the median, the third quartile, and the maximum value. The median, or second quartile, is the mid-value of the sorted data. The first quartile is the 25th percentile and the third quartile is the 75th percentile of the sorted data. The range between the first and third quartiles includes half of the data. The difference between the third quartile and the first quartile is called the inter-quartile range (IQR). Thus, these five numbers display the full range of variation (from minimum to maximum), the common range of variation (from first to third quartile), and a typical value (the median).

3.2.1.4 Measures of Dispersion

Range. Range is the difference between the maximum and mini-mum values. It is easy to compute because only two values, the minimum and maximum, are used in the estimation; however, a great deal of information is ignored, and the range is greatly influenced by outliers.

Variance. Variance is the average measure of the variation. It is computed as the average of the square of the deviation from the average; however, because variance relies on the squared

3456_Book.book Page 43 Wednesday, November 20, 2002 11:34 AM

differences of a continuous variable from the mean, a single outlier has greater impact on the size of the variance than does a single value near the mean.

Standard deviation. Standard deviation is the square root of the variance. In a normal distribution, about 68% of the values fall within one standard deviation of the mean, and about 95% of the values fall within two standard deviations of the mean. Both variance and standard deviation measurements take into account the difference between each value and the mean. Consequently, these measures are based on a maximum amount of information.

Inter-quartile range. The IQR is a robust measure of dispersion.

It is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR is hardly affected by extreme scores;

therefore, it is a good measure of spread for skewed distributions.

In normally distributed data, the IQR is approximately equal to 1.35 times the standard deviation.

3.2.1.5 Standard Errors and Confidence Interval Estimates

Standard error. Standard error is the standard deviation of the sampling distribution of a given statistic. Standard errors show the amount of sampling fluctuation that exists in the estimated statistics in repeated sampling. Confidence interval estimation and statistical significance testing are dependent on the magnitude of the standard errors. The standard error of a statistic depends on the sample size.

In general, the larger the sample size, the smaller the standard error.

Confidence interval. The confidence interval is aninterval estimate that quantifies the uncertainty caused by sampling error. It provides a range of values, which are likely to include an unknown population parameter, as the estimated range is being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval is calculated for each sample, then a certain percentage of the intervals will include the unknown population parameter. The width of the confidence interval provides some idea about the uncertainty of the unknown parameter estimates. A very wide interval may indicate that more data must be collected before making inferences about the parameter.

3.2.1.6 Detecting Deviation from Normally Distributed Data Skewness. Skewness is a measure that quantifies the degree of

asymmetry of a distribution. A distribution of a continuous variable

3456_Book.book Page 44 Wednesday, November 20, 2002 11:34 AM

is symmetric if it looks the same to the left and right of the center point. Data from positively skewed (skewed to the right) distribu-tions have values that are clustered together below the mean but have a long tail above the mean. Data from negatively skewed (skewed to the left) distributions have values that are clustered together above the mean but have a long tail below the mean.

The skewness estimate for a normal distribution equals zero. A negative skewness estimate indicates that the data are skewed left (the left tail is heavier than the right tail), and a positive skewness estimate indicates that the data are skewed right (the right tail is heavier than the left tail).

Kurtosis. Kurtosis is a measure to quantify whether the data are peaked or flat relative to a normal distribution. Datasets with large kurtosis have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Datasets with low kurtosis have a flat top near the mean rather than a sharp peak. Kurtosis can be both positive and negative. Distributions with positive kurtosis have typically heavy tails. Kurtosis and skewness estimates are very sensitive to the presence of outliers. These estimates may be influenced by a few extreme observations in the tails of the distribution; therefore, these statistics are not a robust measure of non-normality. The Shapiro–Wilks test5 and the d’Agos-tino–Pearson omnibus test6 are commonly used for detecting non-normal distributions.

Dans le document Data Mining Using (Page 53-56)