APPLICATION OF k-MEANS CLUSTERING USING SAS ENTERPRISE MINER

Next, we turn to the powerful SAS Enterpriser Miner[3] software for an application of thek-means algorithm on thechurndata set from Chapter 3 (available at the book series Web site; also available fromhttp://www.sgi.com/tech/mlc/db/). Recall that the data set contains 20 variables’ worth of information about 3333 customers, along with an indication of whether or not that customer churned (left the company).

The following variables were passed to the Enterprise Miner clustering node:

rFlag (0/1) variables

◦ International Plan and VoiceMail Plan rNumerical variables

◦ Account length,voice mail messages,day minutes,evening minutes,night minutes,international minutes, andcustomer service calls,

◦ After applying min–max normalization to all numerical variables.

The Enterprise Minerclustering node uses SAS’s FASTCLUS procedure, a version of the k-means algorithm. The number of clusters was set to k=3. The three clusters uncovered by the algorithm varied greatly in size, with tiny cluster 1 containing 92 records, large cluster 2 containing 2411 records, and medium-sized cluster 3 containing 830 records.

Some basic cluster proﬁling will help us to learn about the types of records falling into each cluster. Figure 8.7 provides a look at the clustering results window of Enterprise Miner, containing a pie chart proﬁle of theInternational Planmembership across the three clusters. All members of cluster 1, a fraction of the members of cluster 2, and no members of cluster 3 have adopted theInternational Plan. Note that the left most pie chart represents all records, and is similar to cluster 2.

Next, Figure 8.8 illustrates the proportion of VoiceMail Plan adopters in each cluster. (Note the confusing color reversal foryes/noresponses.) Remarkably, clus-ters 1 and 3 contain only VoiceMail Plan adopclus-ters, while cluster 2 contains only non-adopters of the plan. In other words, this ﬁeld was used by the k-means al-gorithm to create a “perfect” discrimination, dividing the data set perfectly among adopters and nonadopters of the International Plan.

It is clear from these results that the algorithm is relying heavily on the categori-cal variables to form clusters. The comparison of the means of the numericategori-cal variables across the clusters in Table 8.5, shows relatively little variation, indicating that the clusters are similar across these dimensions. Figure 8.9, for example, illustrates that the distribution ofcustomer service calls (normalized) is relatively similar in each cluster. If the analyst is not comfortable with this domination of the clustering by the

APPLICATION OFk-MEANS CLUSTERING USING SAS ENTERPRISE MINER 159

Figure 8.7 Enterprise Miner proﬁle of International Plan adopters across clusters.

Figure 8.8 VoiceMail Plan adopters and nonadopters are mutually exclusive.

TABLE 8.5 Comparison of Variable Means Across Clusters Shows Little Variation

Cluster Freq. AcctLength m VMailMessage DayMins mm

1 92 0.4340639598 0.5826939471 0.5360015616

2 2411 0.4131940041 0 0.5126334451

3 830 0.4120730857 0.5731159934 0.5093940185

Cluster EveMins mm NightMins mm IntMins mm CustServCalls

1 0.5669029659 0.4764366069 0.5467934783 0.1630434783

2 0.5507417372 0.4773586813 0.5119784322 0.1752615328

3 0.5564095259 0.4795138596 0.5076626506 0.1701472557

Figure 8.9 Distribution ofcustomer service callsis similar across clusters.

categorical variables, he or she can choose to stretch or shrink the appropriate axes, as mentioned earlier, which will help to adjust the clustering algorithm to a more suitable solution.

The clusters may therefore be summarized, using only the categorical variables, as follows:

rCluster 1:Sophisticated Users. A small group of customers who have adopted both the International Plan and the VoiceMail Plan.

rCluster 2: The Average Majority. The largest segment of the customer base, some of whom have adopted the VoiceMail Plan but none of whom have adopted the International Plan.

rCluster 3: Voice Mail Users. A medium-sized group of customers who have all adopted the VoiceMail Plan but not the International Plan.

Figure 8.10 Churn behavior across clusters for International Plan adopters and nonadopters.

REFERENCES 161

Figure 8.11 Churn behavior across clusters for VoiceMail Plan adopters and nonadopters.

A more detailed clustering proﬁle, including both categorical and numerical variables, is given in Chapter 9.

Using Cluster Membership to Predict Churn

Suppose, however, that we would like to apply these clusters to assist us in thechurn classiﬁcation task. We may compare the proportions of churners directly among the various clusters, using graphs such as Figure 8.10. Here we see that overall (the leftmost column of pie charts), the proportion of churners is much higher among those who have adopted the International Plan than among those who have not. This ﬁnding was uncovered in Chapter 3. Note that the churn proportion is higher in cluster 1, which contains International Plan adopters, than in cluster 2, which contains a mixture of adopters and nonadopters, and higher still than cluster 3, which contains no such adopters of the International Plan. Clearly, the company should look at the plan to see why the customers who have it are leaving the company at a higher rate.

Now, since we know from Chapter 3 that the proportion of churners is lower among adopters of the VoiceMail Plan, we would expect that the churn rate for cluster 3 would be lower than for the other clusters. This expectation is conﬁrmed in Figure 8.11.

In Chapter 9 we explore using cluster membership as input to downstream data mining models.

REFERENCES

1. J. MacQueen, Some methods for classiﬁcation and analysis of multivariate observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281–297, University of California Press, Berkeley, CA, 1967.

2. Andrew Moore, k-Means and Hierarchical Clustering, Course Notes, http://

www-2.cs.cmu.edu/∼awm/tutorials/, 2001.

3. The SAS Institute, Cary, NC,www.sas.com.

EXERCISES

1. To which cluster for the 90210 zip code would you prefer to belong?

2. Describe the goal of all clustering methods.

3. Suppose that we have the following data (one variable). Use single linkage to identify the clusters. Data: 0 0 1 3 3 6 7 9 10 10

4. Suppose that we have the following data (one variable). Use complete linkage to identify the clusters. Data: 0 0 1 3 3 6 7 9 10 10

5. What is an intuitive idea for the meaning of thecentroidof a cluster?

6. Suppose that we have the following data:

a b c d e f g h i j

(2,0) (1,2) (2,2) (3,2) (2,3) (3,3) (2,4) (3,4) (4,4) (3,5) Identify the cluster by applying thek-means algorithm, withk= 2. Try using initial cluster centers as far apart as possible.

7. Refer to Exercise 6. Show that the ratio of the between-cluster variation to the within-cluster variation decreases with each pass of the algorithm.

8. Once again identify the clusters in Exercise 6 data, this time by applying thek-means algorithm, withk= 3. Try using initial cluster centers as far apart as possible.

9. Refer to Exercise 8. Show that the ratio of the between-cluster variation to the within-cluster variation decreases with each pass of the algorithm.

10. Which clustering solution do you think is preferable? Why?

Hands-on Analysis

Use the cerealsdata set, included at the book series Web site, for the following exercises. Make sure that the data are normalized.

11. Using all of the variables exceptnameandrating, run thek-means algorithm withk= 5 to identify clusters within the data.

12. Develop clustering proﬁles that clearly describe the characteristics of the cereals within the cluster.

13. Rerun thek-means algorithm withk= 3.

14. Which clustering solution do you prefer, and why?

15. Develop clustering proﬁles that clearly describe the characteristics of the cereals within the cluster.

16. Use cluster membership to predictrating. One way to do this would be to construct a histogram ofratingbased on cluster membership alone. Describe how the relationship you uncovered makes sense, based on your earlier proﬁles.

C H A P T E R

9

Dans le document An Introduction to Data Mining (Page 177-182)