SELECTING INTERESTING SUBSETS OF THE DATA FOR FURTHER INVESTIGATION

We may use scatter plots (or histograms) to identify interesting subsets of the data, in order to study these subsets more closely. In Figure 3.25 we see that customers with highday minutesand highevening minutesare more likely to churn. But how can we quantify this? Clementine allows the user to click and drag a select box around data points of interest, and select them for further investigation. Here we selected the records within the rectangular box in the upper right. (A better method would be to allow the user to select polygons besides rectangles.)

Thechurndistribution for this subset of records is shown in Figure 3.26. It turns out that over 43% of the customers who have both highday minutesand highevening minutesare churners. This is approximately three times the churn rate of the overall customer base in the data set. Therefore, it is recommended that we consider how we can develop strategies for keeping our heavy-use customers happy so that they do not leave the company’s service, perhaps through discounting the higher levels of minutes used.

Figure 3.25 Selecting an interesting subset of records for further investigation.

BINNING

Binning (also calledbanding) refers to the categorization of numerical or categor-ical variables into a manageable set of classes which are convenient for analysis.

For example, the number of day minutescould be categorized (binned) into three classes:low,medium, andhigh. The categorical variablestatecould be binned into

Figure 3.26 Over 43% of customers with high day and evening minutes churn.

SUMMARY 63

Figure 3.27 Churn rate for customers with low (top) and high (bottom) customer service calls.

a new variable,region, where California, Oregon, Washington, Alaska, and Hawaii would be put in thePaciﬁccategory, and so on. Properly speaking, binning is a data preparation activity as well as an exploratory activity.

There are various strategies for binning numerical variables. One approach is to make the classes of equal width, analogous to equal-width histograms. Another approach is to try to equalize the number of records in each class. You may consider yet another approach, which attempts to partition the data set into identiﬁable groups of records, which, with respect to the target variable, have behavior similar to that for other records in the same class.

For example, recall Figure 3.18, where we saw that customers with fewer than four calls to customer service had a lower churn rate than that of customers who had four or more calls to customer service. We may therefore decide to bin thecustomer service callsvariable into two classes,lowandhigh. Figure 3.27 shows that the churn rate for customers with a low number of calls to customer service is 11.25%, whereas the churn rate for customers with a high number of calls to customer service is 51.69%, more than four times higher.

SUMMARY

Let us consider some of the insights we have gained into thechurndata set through the use of exploratory data analysis.

r The fourchargeﬁelds are linear functions of theminuteﬁelds, and should be omitted.

r Thearea codefield and/or thestatefield are anomalous, and should be omitted until further clarification is obtained.

rThe correlations among the remaining predictor variables are weak, allowing us to retain them all for any data mining model.

Insights with respect tochurn:

rCustomers with the International Plan tend to churn more frequently.

rCustomers with the VoiceMail Plan tend to churn less frequently.

rCustomers with four or morecustomer service callschurn more than four times as often as do the other customers.

rCustomers with highday minutesandevening minutestend to churn at a higher rate than do the other customers.

rCustomers with both highday minutesand highevening minuteschurn about three times more than do the other customers.

rCustomers with lowday minutesand high customer service callschurn at a higher rate than that of the other customers.

rThere is no obvious association ofchurnwith the variablesday calls,evening calls,night calls,international calls,night minutes,international minutes, ac-count length, orvoice mail messages.

Note that we have not applied any data mining algorithms yet on this data set, such as decision tree or neural network algorithms. Yet we have gained considerable insight into the attributes that are associated with customers leaving the company, simply by careful application of exploratory data analysis. These insights can easily be formulated into actionable recommendations, so that the company can take action to lower the churn rate among its customer base.

REFERENCES

1. C. L. Blake and C. J. Merz,Churn Data Set, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html.University of Cali-fornia, Department of Information and Computer Science, Irvine, CA, 1998.

2. Daniel Larose,Data Mining Methods and Models, Wiley-Interscience, Hoboken, NJ (to ap-pear 2005).

EXERCISES

1. Describe the possible consequences of allowing correlated variables to remain in the model.

a. How can we determine whether correlation exists among our variables?

b. What steps can we take to remedy the situation? Apart from the methods described in the text, think of some creative ways of dealing with correlated variables.

c. How might we investigate correlation among categorical variables?

2. For each of the following descriptive methods, state whether it may be applied to categorical data, continuous numerical data, or both.

a. Bar charts b. Histograms c. Summary statistics

EXERCISES 65 d. Cross-tabulations

e. Correlation analysis

f. Scatter plots (two- or three-dimensional) g. Web graphs

h. Binning

3. Why do we need to perform exploratory data analysis? Why shouldn’t we simply pro-ceed directly to the modeling phase and start applying our high-powered data mining software?

4. Make up a ﬁctional data set (attributes with no records is ﬁne) with a pair of anomalous attributes. Describe how EDA would help to uncover the anomaly.

5. Describe the beneﬁts and drawbacks of using normalized histograms. Should we ever use a normalized histogram without reporting it as such? Why not?

6. Describe how scatter plots can uncover patterns in two dimensions that would be invisible from one-dimensional EDA.

7. Describe the beneﬁts and drawbacks of the three methods of binning described in the text. Which methods require little human interaction? Which method does warrant human supervision? Which method might conceivably be used to mislead the public?

Hands-on Analysis

Use theadultdata set at the book series Web site for the following exercises. The target variable isincome, and the goal is to classify income based on the other variables.

8. Which variables are categorical and which are continuous?

9. Using software, construct a table of the ﬁrst 10 records of the data set, to get a feel for the data.

10. Investigate whether there are any correlated variables.

11. For each of the categorical variables, construct a bar chart of the variable, with an overlay of the target variable. Normalize if necessary.

a. Discuss the relationship, if any, each of these variables has with the target variables.

b. Which variables would you expect to make a signiﬁcant appearance in any data mining classiﬁcation model that we work with?

12. For each pair of categorical variables, construct a cross-tabulation. Discuss your salient results.

13. (If your software supports this.) Construct a web graph of the categorical variables. Fine tune the graph so that interesting results emerge. Discuss your ﬁndings.

14. Report on whether anomalous ﬁelds exist in this data set, based on your EDA, which ﬁelds these are, and what we should do about it.

15. Report the mean, median, minimum, maximum, and standard deviation for each of the numerical variables.

16. Construct a histogram of each numerical variable, with an overlay of the target variable income. Normalize if necessary.

a. Discuss the relationship, if any, each of these variables has with the target variables.

b. Which variables would you expect to make a signiﬁcant appearance in any data mining classiﬁcation model we work with?

17. For each pair of numerical variables, construct a scatter plot of the variables. Discuss your salient results.

18. Based on your EDA so far, identify interesting subgroups of records within the data set that would be worth further investigation.

19. Apply binning to one of the numerical variables. Do it such as to maximize the effect of the classes thus created (following the suggestions in the text). Now do it such as to minimize the effect of the classes, so that the difference between the classes is diminished.

Comment.

20. Refer to Exercise 19. Apply the other two binning methods (equal width and equal number of records) to the variable. Compare the results and discuss the differences. Which method do you prefer?

21. Summarize your salient EDA ﬁndings from Exercises 19 and 20 just as if you were writing a report.

C H A P T E R

4

STATISTICAL APPROACHES TO

Dans le document An Introduction to Data Mining (Page 80-86)