• Aucun résultat trouvé

COMPARISON OF THE C5.0 AND CART ALGORITHMS APPLIED TO REAL DATA

Dans le document An Introduction to Data Mining (Page 141-147)

Next, we apply decision tree analysis using Clementine on a real-world data set.

The data setadultwas abstracted from U.S. census data by Kohavi [5] and is available online from the University of California at Irvine Machine Learning Repository [6].

Here we are interested in classifying whether or not a person’s income is less than

$50,000, based on the following set of predictor fields.

rNumerical variables Age

Years of education Capital gains Capital losses

Hours worked per week rCategorical variables

Race Gender Work class Marital status

The numerical variables were normalized so that all values ranged between zero and 1. Some collapsing of low-frequency classes was carried out on thework class andmarital statuscategories. Clementine was used to compare the C5.0 algorithm (an update of the C4.5 algorithm) with CART, examining a training set of 24,986 records. The decision tree produced by the CART algorithm is shown in Figure 6.8.

COMPARISON OF THE C5.0 AND CART ALGORITHMS APPLIED TO REAL DATA 123

Figure 6.8 CART decision tree for the adult data set.

Here, the tree structure is displayed horizontally, with the root node at the left and the leaf nodes on the right. For the CART algorithm, the root node split is onmarital status, with a binary split separating married persons from all others (Marital Status in[“Divorced” “Never married” “Separated” “Widowed”]). That is, this particular split onmarital statusmaximized the CART split selection criterion [equation (6.1)]:

(s|t)=2PLPR

# classes

j=1

P(j|tL)−P(j|tR)

Note that the mode classification for each branch is≤50,000. The married branch leads to a decision node, with several further splits downstream. However, the nonmarried branch is a leaf node, with a classification of≤50,000 for the 13,205 such records, with 93.6% confidence. In other words, of the 13,205 persons in the data set who are not presently married, 93.6% of them have incomes below $50,000.

The root node split is considered to indicate the most important single variable for classifying income. Note that the split on theMarital Statusattribute is binary, as are all CART splits on categorical variables. All the other splits in the full CART decision tree shown in Figure 6.8 are on numerical variables. The next decision node is education-num, representing the normalized number of years of education. The split occurs ateducation-num<0.8333 (mode≤50,000) versuseducation-num >0.8333 (mode>50,000). However, what is the actual number of years of education that the normalized value of 0.8333 represents? The normalization, carried out automatically using Insightful Miner, was of the form

X= X

range(X) = X

max(X)−min(X)

a variant of min–max normalization. Therefore, denormalizationis required to iden-tify the original field values. Years of education ranged from 16 (maximum) to 1 (min-imum), for a range of 15. Therefore, denormalizing, we have X=range(X)·X = 15(0.8333)=12.5.Thus, the split occurs right at 12.5 years of education. Those with at least some college education tend to have higher incomes than those who do not.

Interestingly, for both education groups,capital gainsandcapital lossrepresent the next two most important decision nodes. Finally, for the lower-education group, the last split is again oneducation-num, whereas for the higher-education group, the last split is onhours-per-week.

Now, will the information-gain splitting criterion and the other characteristics of the C5.0 algorithm lead to a decision tree that is substantially different from or largely similar to the tree derived using CART’s splitting criteria? Compare the CART decision tree above with Clementine’s C5.0 decision tree of the same data displayed in Figure 6.9.

Differences emerge immediately at the root node. Here, the root split is on the capital-gain attribute, with the split occurring at the relatively low normal-ized level of 0.0685. Since the range of capital gains in this data set is $99,999 (maximum=99,999, minimum=0), this is denormalized asX =range(X)·X = 99,999(0.0685)=$6850. More than half of those with capital gains greater than

$6850 have incomes above $50,000, whereas more than half of those with capital gains of less than $6850 have incomes below $50,000. This is the split that was chosen

Figure 6.9 C5.0 decision tree for the adult data set.

COMPARISON OF THE C5.0 AND CART ALGORITHMS APPLIED TO REAL DATA 125 by the information-gain criterion as the optimal split among all possible splits over all fields. Note, however, that there are 23 times more records in the low-capital-gains category than in the high-capital-gains category (23,921 versus 1065 records).

For records with lower capital gains, the second split occurs oncapital loss, with a pattern similar to the earlier split on capital gains. Most people (23,165 records) had low capital loss, and most of these have incomes below $50,000. Most of the few (756 records) who had higher capital loss had incomes above $50,000.

For records with low capital gains and low capital loss, consider the next split, which is made onmarital status. Note that C5.0 provides a separate branch for each field value, whereas CART was restricted to binary splits. A possible drawback of C5.0’s strategy for splitting categorical variables is that it may lead to an overly bushy tree, with many leaf nodes containing few records. In fact, the decision tree displayed in Figure 6.9 is only an excerpt from the much larger tree provided by the soft-ware.

To avoid this problem, analysts may alter the algorithm’s settings to require a certain minimum number of records to be passed to child decision nodes. Figure 6.10 shows a C5.0 decision tree from Clementine on the same data, this time requiring each decision node to have at least 300 records. In general, a business or research decision may be rendered regarding the minimum number of records considered to be actionable. Figure 6.10 represents the entire tree.

Again, capital gains represents the root node split, with the split occurring at the same value. This time, however, the high-capital-gains branch leads directly to a leaf node, containing 1065 records, and predicting with 98.3 confidence that the proper classification for these persons is income greater than $50,000. For the other records,

Figure 6.10 C5.0 decision tree with a required minimum number of records at each decision node.

the second split is again on the same value of the same attribute as earlier, capital loss. For high capital loss, this leads directly to a leaf node containing 756 records predicting high income with only 70.8% confidence.

For those with low capital gains and low capital loss, the third split is again marital status, with a separate branch for each field value. Note that for all values of the marital status fieldexcept“married,” these branches lead directly to child nodes predicting income of at most $50,000 with various high values of confidence. For married persons, further splits are considered.

Although the CART and C5.0 decision trees do not agree in the details, we may nevertheless glean useful information from the broad areas of agreement between them. For example, the most important variables are clearly marital status, education, capital gains, capital loss, and perhaps hours per week. Both models agree that these fields are important, but disagree as to the ordering of their importance. Much more modeling analysis may be called for.

For a soup-to-nuts application of decision trees to a real-world data set, from data preparation through model building and decision rule generation, see Reference 7.

REFERENCES

1. Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone,Classification and Regression Trees, Chapman & Hall/CRC Press, Boca Raton, FL, 1984.

2. Ruby L. Kennedy, Yuchun Lee, Benjamin Van Roy, Christopher D. Reed, and Richard P.

Lippman,Solving Data Mining Problems through Pattern Recognition, Pearson Education, Upper Saddle River, NJ, 1995.

3. J. Ross Quinlan,C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, 1992.

4. Mehmed Kantardzic,Data Mining: Concepts, Models, Methods, and Algorithms, Wiley-Interscience, Hoboken, NJ, 2003.

5. Ronny Kohavi, Scaling up the accuracy of naive Bayes classifiers: A decision tree hybrid, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996.

6. C. L. Blake and C. J. Merz, UCI Repository of Machine Learning Databases,http://www .ics.uci.edu/mlearn/MLRepository.html, University of California, Depart-ment of Information and Computer Science, Irvine, CA, 1998.

7. Daniel Larose, Data Mining Methods and Models, Wiley-Interscience, Hoboken, NJ (to appear 2005).

EXERCISES

1. Describe the possible situations when no further splits can be made at a decision node.

2. Suppose that our target variable is continuous numeric. Can we apply decision trees directly to classify it? How can we work around this?

3. True or false: Decision trees seek to form leaf nodes to maximize heterogeneity in each node.

4. Discuss the benefits and drawbacks of a binary tree versus a bushier tree.

EXERCISES 127

TABLE E6.4 Decision Tree Data

Occupation Gender Age Salary

Service Female 45 $48,000

Male 25 $25,000

Male 33 $35,000

Management Male 25 $45,000

Female 35 $65,000

Male 26 $45,000

Female 45 $70,000

Sales Female 40 $50,000

Male 30 $40,000

Staff Female 50 $40,000

Male 25 $25,000

Consider the data in Table E6.4. The target variable is salary. Start by discretizing salary as follows:

rLess than $35,000 Level 1

r$35,000 to less than $45,000 Level 2 r$45,000 to less than $55,000 Level 3

rAbove $55,000 Level 4

5. Construct a classification and regression tree to classifysalarybased on the other variables.

Do as much as you can by hand, before turning to the software.

6. Construct a C4.5 decision tree to classifysalarybased on the other variables. Do as much as you can by hand, before turning to the software.

7. Compare the two decision trees and discuss the benefits and drawbacks of each.

8. Generate the full set of decision rules for the CART decision tree.

9. Generate the full set of decision rules for the C4.5 decision tree.

10. Compare the two sets of decision rules and discuss the benefits and drawbacks of each.

Hands-on Analysis

For the following exercises, use thechurndata set available at the book series Web site. Normalize the numerical data and deal with the correlated variables.

11. Generate a CART decision tree.

12. Generate a C4.5-type decision tree.

13. Compare the two decision trees and discuss the benefits and drawbacks of each.

14. Generate the full set of decision rules for the CART decision tree.

15. Generate the full set of decision rules for the C4.5 decision tree.

16. Compare the two sets of decision rules and discuss the benefits and drawbacks of each.

C H A P T E R

7

Dans le document An Introduction to Data Mining (Page 141-147)