DECISION AND REGRESSION TREES

Common Predictive

Tree‐based methods have long been popular in predictive modeling.

Their popularity is owed, in no small part, to their simplicity and pre-dictive power along with the small number of assumptions.

Tree‐based methods of modeling have been around since the 1960s, with popular methods including CART, C4.5, and CHAID as common implementations. Decision trees were fi rst presented by Leo Breiman in his 1984 book Classifi cation and Regression Trees . The ini-tial CART algorithm has been improved and extended in a number of ways through the years, but the overall principles remain.

Begin with a set of training data and look at the input variables and their corresponding levels to determine the possible split points.

Each split point is then evaluated to determine which split is “best.”

“Best” in this case is creating the purest set of subgroups. The process is then repeated with each subgroup until a stopping condition has been met. The main distinction between decision trees and regression trees is that decision trees assign a category from a fi nite or nominal list of choices, such as gender or color. Regression trees, in contrast, predict a continuous value for each region, such as income or age.

Consider the problem of trying to predict the gender of students from a general elective course at a university. You could build a decision

Figure 5.17 Improvement of Digit Recognition Using Deep Learning

102 _▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

tree that predicts gender by dividing a representative population by height, weight, birth date, last name, or fi rst name. Last name and birth date should have no predictive power. Height and weight, however, would likely be able to predict gender in some cases (the very tall are likely to be male and the very short likely to be female). More predic-tive still would be fi rst names, as there are fewer names than heights or weights that are gender‐neutral. A variable that is hard to capture but probably very predictive is hair length. This example shows that with the right attribute, it is often possible to get near‐perfect separation with a single variable. In most real‐world problems, that single “magic” vari-able does not exist, is too costly to collect, or is something that could not be known beforehand for the purposes of scoring new cases.¹²

Decision and regression trees both recursively partition the input space one variable at a time. Often a single split is used at each level, as it is equivalent to multiple splits.

Figure 5.18 shows a tree where the original data set (region A) is fi rst split on variable y by whether it is less than 2 (region B) or greater than or equal to 2 (C). Region C is then further split into two regions, D (y<5) and E (y> =5).

These same regions are shown in Figure 5.19 as partitioned input space. The partitioned input space only shows the leaves of a tree (a leaf is a nodes that have no further splits i.e. terminal node). The leaves that been split are called branches. The branches in Figure 5.19 are A and C.

Building a tree happens in two phases. First, a large tree is created by recursively splitting the input data set using a metric to determine

12 Do not forget that the purpose of building a predictive model is to anticipate a

behav-ior or attribute.

x<2

y<5

x>=2

y>=5 E C

D A

Figure 5.18 Basic Decision Tree

C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 103

the best split. Growth continues until a stopping point is reached con-ditions might include: insuffi cient data for additional splits, additional splits do not improve model fi t, or tree has reached a specifi ed depth.

After the tree has been built, it is often overfi t (which makes it poor for prediction). That is, it matches the input data too well—it matches all of the small fl uctuations that are present in the input data set but not in the general population. Thus, the overlarge tree is then pruned back until it is predictive in general, not just for the particular input data set.

Pruning is particularly effective if a second, validation data set is used to determine how well the tree performs on general data. This comes at the cost of taking some data that would have been used to grow the tree. The C4.5 splitting method attempts to prune based on estimated error rates instead of using a separate data set.

There are a large number of methods to determine the “goodness”

of a split. A few of the main ones are discussed here—misclassifi cation rate, entropy, Gini, Chi‐square, and variance. The interested reader is encouraged to consult a survey book, such as Data Mining with Decision Trees by Lior Rokach and Oded Maimon. As they mention, no one

“goodness” method will be the best in every scenario, so there is ben-efi t in being familiar with a number of them.

Consider the Irisdata that was used as an example in the Neural Networks section. After the data is portioned using the same 70% for training and 30% for validation, all the split points from the four input variables are evaluated to see which gives the purest split. As you can see in the fi gure on the next page, if the petal width is less than 7.5 mil-limeters (mm), then the Iris is a setosa . If the petal width is greater than 7.5 mm or missing, then it is either versicolor or r virginica . A further split

x=2 B

D E

y=5

Figure 5.19 Basic Decision Tree Shown as a Partition Region

104 _▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

is needed to create a distinction among all three groups. All of the vari-ables and their split points are then considered for how to refi ne the group for purity. In this case the petal width is again the best variable to create pure groups. This time the split point is having a petal width of less than 17.5 mm or missing (and greater than 7.5 mm because those observations have already been removed). About 90% of those Iris are versicolor . On the other side of the branch it is almost 100% r virginica . The tree is not further split, even though there is a few fl owers wrongly classifi ed (4% of the validation); this is because there is no split that improves the classifi cation and does not violate a stopping rule.

The English rules for classifi cation of future Iris fl owers appears as follows: Number of Observations = 34 Predicted: species=setosa = 1.00 Number of Observations = 36 Predicted: species=setosa = 0.00 Number of Observations = 33 Predicted: species=setosa = 0.00 Predicted: species=versicolor = 0.03 Predicted: species=virginica = 0.97

C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 105

Misclassifi cation rate is based on the nominal (i.e., unordered set of values, such as Cincinnati , i Chattanooga, and Spokane ) target predict-ed value, usually the most common value. Misclassifi cation asks how many observations in each region (N_r observations in each region, with N_R observations in the combined region) were mispredicted (N_m) , and thus is

Variance, used for regression trees, is similar to misclassifi cation rate if the predicted value for the region ( ˆt_r) is simply the average value of the target in that region. The variance is the usual variance formula, summed over the observations in each of the regions separately. Ift_iis the target value in observationi:

Entropy, used with nominal targets, uses the “information” of a split. The goodness metric is then the gain of information (decrease of entropy). The entropy for a set of regions with N_t observations that have a value t for the target:

Gini, also for nominal targets, is similar:

G N

Methods based on Chi‐square, including CHAID, view the split as a two‐way contingency table. For instance, Table 5.4 shows a three‐

region split with two target levels. Each cell in the table is the number of observations within each region with each target value. The p‐value

Table 5.4 Observations

Region A Region B Region C

Target value α 97 75 23

Target value β 3 50 175

106 _▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

of the resulting Chi‐square sum determines the goodness of the split and also serves as a threshold to end growth.

The fi nished tree can be used to predict the value of a data set, understand how the data is partitioned, and determine variable im-portance. The tree structure makes it easy to see what factors were most important to determining the value of the target variable. The more important variables will be split more often and will generally be higher in the tree, where more data is being partitioned. The amount of data that a variable partitions determines how important that vari-able is. This importance can then be used to guide further analysis, such as regression.

Although a single decision tree is easy to understand, it can be sensitive to input data fl uctuations despite the pruning process. For instance, say two variables are different only by 2 out of 1,000 obser-vations and therefore partition the data almost identically. However, both variables do the best job of partitioning the data. Depending on how the data is used, one variable or the other may be selected, and the other will be ignored since it will likely not help partition the data much further. This is advantageous in that it means that correlated variables are compensated for automatically. However, which variable is selected as a split in the tree may change between runs based on very small changes in the data. Additionally, if the fi t is not almost perfect, the later splits may be wildly different depending on which variable is chosen, resulting in considerably different trees.

Because of this, in addition to building a single decision tree, there are a number of methods that build upon using many decision trees. Random Forest™ uses bootstrap methods to create an ensem-ble of trees, one for each bootstrap sample. Additionally, the variaensem-bles eligible to be used in splitting is randomly varied in order to decor-relate the variables. Once the forest of trees is created, they “vote” to determine the predicted value of input data. Additive methods, includ-ing AdaBoost and gradient boostinclud-ing, use a number of very basic trees (often just a single split) together to predict the value. The sum of all of the trees’ predictions are used to determine the fi nal prediction, using a mapping method such as a logit function for nominal targets.

Unlike forest models, the trees in additive models are created in series, in such a way as to correct mispredictions of the previous trees. This is

C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 107

accomplished through adjusting observation weights to make misclas-sifi ed observations more important.

Dans le document Additional praise for Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners (Page 125-131)