NUMERICAL METHODS FOR IDENTIFYING OUTLIERS

One method of using statistics to identify outliers is to useZ-score standardization.

Often, an outlier can be identiﬁed because it is much farther than 3 standard deviations from the mean and therefore has aZ-score standardization that is either less than−3 or greater than 3. Field values withZ-scores much beyond this range probably bear further investigation to verify that they do not represent data entry errors or other issues. For example, the vehicle that takes its time (25 seconds) getting to 60 mph had aZ-score of 3.247. This value is greater than 3 (although not by much), and therefore this vehicle is identiﬁed by this method as an outlier. The data analyst may wish to in-vestigate the validity of this data value or at least suggest that the vehicle get a tune-up!

Unfortunately, the mean and standard deviation, both part of the formula for theZ-score standardization, are rathersensitiveto the presence of outliers. That is, if an outlier is added to a data set, the values of mean and standard deviation will both

EXERCISES 39 be unduly affected by this new data value. Therefore, when choosing a method for evaluating outliers, it may not seem appropriate to use measures which are themselves sensitive to their presence.

Therefore, data analysts have developed more robust statistical methods for outlier detection, which are less sensitive to the presence of the outliers themselves.

One elementary robust method is to use theinterquartile range. Thequartilesof a data set divide the data set into four parts, each containing 25% of the data.

r Theﬁrst quartile(Q1) is the 25th percentile.

r Thesecond quartile(Q2) is the 50th percentile, that is, the median.

r Thethird quartile(Q3) is the 75th percentile.

Theinterquartile range(IQR) is a measure of variability that is much more robust than the standard deviation. The IQR is calculated as IQR=Q3−Q1 and may be interpreted to represent the spread of the middle 50% of the data.

A robust measure of outlier detection is therefore deﬁned as follows. A data value is an outlier if:

a. It is located 1.5(IQR) or more below Q1, or b. It is located 1.5(IQR) or more above Q3.

For example, suppose that for a set of test scores, the 25th percentile was Q1=70 and the 75th percentile was Q3=80, so that half of all the test scores fell between 70 and 80. Then theinterquartile range, the difference between these quartiles, was IQR=80−70=10.

A test score would be robustly identiﬁed as an outlier if:

a. It is lower than Q1−1.5(IQR)=70−1.5(10)=55, or b. It is higher than Q3+1.5(IQR)=80+1.5(10)=95.

In Chapter 3 we apply some basic graphical and statistical tools to help us begin to uncover simple patterns and trends in the data structure.

REFERENCES

1. Dorian Pyle,Data Preparation for Data Mining, Morgan Kaufmann, San Francisco, CA, 1999.

2. R. J. A. Little and D. B. Rubin,Statistical Analysis with Missing Data, Wiley, Hoboken, NJ, 1987.

EXERCISES

1. Describe the possible negative effects of proceeding directly to mine data that has not been preprocessed.

2. Find the mean value for the income attribute of the ﬁve customers in Table 2.1 before preprocessing. What does this number actually mean? Calculate the mean income for the three values left after preprocessing. Does this value have a meaning?

3. Which of the three methods from Figures 2.2 to 2.4 do you prefer for handling missing values?

a. Which method is the most conservative and probably the safest, meaning that it fabri-cates the least amount of data? What are some drawbacks to this method?

b. Which method would tend to lead to an underestimate of the spread (e.g., standard deviation) of the variable? What are some beneﬁts to this method?

c. What are some beneﬁts and drawbacks of the method that chooses values at random from the variable distribution?

4. Make up a classification scheme that is inherently flawed and would lead to misclassifi-cation, as we find in Table 2.2: for example, classes of items bought in a grocery store.

5. Make up a data set consisting of eight scores on an exam in which one of the scores is an outlier.

a. Find the mean score and the median score, with and without the outlier.

b. State which measure, the mean or the median, the presence of the outlier affects more, and why. (Mean, median, and other statistics are explained in Chapter 4.)

c. Verify that the outlier is indeed an outlier, using the IQR method.

6. Make up a data set, consisting of the heights and weights of six children, in which one of the children, but not the other, is an outlier with respect to one of the variables. Then alter this data set so that the child is an outlier with respect to both variables.

7. Using your data set from Exercise 5, ﬁnd the min–max normalization of the scores. Verify that each value lies between zero and 1.

Hands-on Analysis

Use thechurndata set at the book series Web site for the following exercises.

8. Explore whether there are missing values for any of the variables.

9. Compare the area code and state ﬁelds. Discuss any apparent abnormalities.

10. Use a graph to determine visually whether there are any outliers among the number of calls to customer service.

11. Transform theday minutesattribute using min–max normalization. Verify using a graph that all values lie between zero and 1.

12. Transform thenight minutesattribute usingZ-score standardization. Using a graph, de-scribe the range of the standardized values.

C H A P T E R

3

Dans le document An Introduction to Data Mining (Page 57-60)