How Decision Trees Work - Machine Learning

Every tree is comprised of nodes. Each node is associated with one of the input variables. The edges coming from that node are the total possible values of that node. A leaf represents the value based on the values given from the input variable in the path running from the root node to the leaf. Because a picture paints a thousand words, see Figure 3-1 for an example.

AGE

Home Owner?

Loan No Loan Loan No Loan

Good Credit?

Y N Y N

<55 >55

Figure 3-1: A decision tree

Decision trees always start with a root node and end on a leaf. Notice that the trees don’t converge at any point; they split their way out as the nodes are processed.

Figure 3-1 shows a decision tree that classifi es a loan decision. The root node is

“Age” and has two branches that come from it, whether the customer is younger than 55 years old or older.

The age of the client determines what happens next. If the person is younger than 55, then the tree prompts you to fi nd out if he or she is a student. If the client is older than 55 then you are prompted to check his or her credit rating.

With this type of machine learning, you are using supervised learning to deduce the optimal method to make a prediction; what I mean by “supervised learning” is that you give the classifi er data with the outcomes. The real ques-tion is, “What’s the best node to start with as the root node?” The next secques-tion examines how that calculation is done.

Building a Decision Tree

Decision trees are built around the basic concept of this algorithm.

■ Check the model for the base cases.

■ Iterate through all the attributes (attr).

■ Get the normalized information gain from splitting on attr.

■ Let best_attr be the attribute with the highest information gain.

■ Create a decision node that splits on the best_attr attribute.

■ Work on the sublists that are obtained by splitting on best_attr and add those nodes as child nodes.

That’s the basic outline of what happens when you build a decision tree.

Depending on the algorithm type, like the ones previously mentioned, there might be subtle differences in the way things are done.

Manually Walking Through an Example

If you are interested in the basic mechanics of how the algorithm works and want to follow along, this section walks through the basics of calculating entropy and information gain. If you want to get to the hands-on part of the chapter, then you can skip this section.

The method of using information gain based on pre- and post-attribute entropy is the key method used within the ID3 and C4.5 algorithms. As these are the commonly used algorithms, this section concentrates on that basic method of fi nding out how the decision tree is built.

With machine learning–based decision trees, you can get the algorithm to do all the work for you. It will fi gure out which is the best node to use as the root node. This requires fi nding out the purity of each node. Consider Table

3-1, which includes only true/false values, of some user purchases through an

There are four nodes in the table:

■ Does the customer have an account?

■ Did the customer read previous product reviews?

■ Is the customer a returning customer?

■ Did the customer purchase the product?

At the start of calculating the decision tree there is no awareness of the node that will give the best result. You’re looking for the node that can best predict the outcome. This requires some calculation. Enter entropy.

Calculating Entropy

Entropy is a measure of uncertainty and is measured in bits and comes as a number between zero and 1 (entropy bits are not the same bits as used in computing terminology). Basically, you are looking for the unpredictability in a random variable.

You need to calculate the gain for the positive and negative cases. I’ve written a quick Java program to do the calculating:

package chapter3;

public class InformationGain {

private double calcLog2(double value) { if(value <= 0.) {

return 0.;

}

return Math.log10(value) / Math.log10(2.);

}

public double calcGain(double positive, double negative) { double sum = positive + negative;

public static void main(String[] args) {

InformationGain ig = new InformationGain();

System.out.println(ig.calcGain(2, 3));

} }

Looking back at the table of customers with credit accounts there are three with and two without. So calculating the gain with these variables you get the following result:

Gain(3,2) = (3/5)*log2(3/5) + (2/5)*log2(2/5)

= 0.97

log2() refers to the calculation in the calcLog2() method in the code snippet. If you don’t want to type or compile the code listing, then try copying and pasting the gain equation into www.wolframalpha.com and you’ll see the answer there.

The outcomes of the variables in the reads reviews attribute linking back to the accounts attribute are the following:

Reads reviews = [Y, Y, N]

Does not read reviews = [N, Y]

You can now calculate the entropy with the split based on the fi rst attribute:

Gain(2,1) = (2/3)*log2(2/3) + (1/3)*log2(1/3)

= 0.91

Gain(1,1) = (1/2)*log2(1/2) + (1/2)*log2(1/2)

= 1

The net gain is fi nally calculated:

Net gain(attribute = has credit account)

= (2/5) * 0.91 + (3/5) * 1

= 0.96

So, you have two gains: one before the split (0.97) and one after the split (0.96).

You’re nearly done on this attribute. You just have to calculate the informa-tion gain.

Information Gain

When you know the gain before and after the split in the attribute, you can calculate the information gain. With the attribute to see if the customer has a credit account, your calculation will be the following:

InformationGain = Gain(before the split) – Gain(after the split) = 0.97 – 0.96

= 0.01

So, the information gain on the has credit account attribute is 0.01.

Rinse and Repeat

The previous two sections covered the calculation of information gain for one attribute, Has Credit Account. You need to work on the other two attributes to fi nd their information gain.

Reads Reviews:

Gain(3,2) = 0.97 Net Gain = 0.4

Information Gain = 0.57 Previous Customer:

Gain(4,1) = 0.72 Net Gain = 0.486

Information Gain = 0.234

With the values of information gain for all the attributes, you can now make a decision on which node to start with in the tree.

ATTRIBUTE INFORMATION GAIN Has Credit Account 0.01

Reads Reviews 0.57

Is Previous Customer 0.234

Now things are becoming clearer; the Reads Reviews attribute has the highest information gain and therefore should be the root node in the tree, then comes the Is Previous Customer node followed by Has Credit Account.

The order of information gain determines where the node will appear in the decision tree model. The node with the highest gain becomes the root node.

That’s enough of the basic theory of how decision trees work. The best way to learn is to get something working, which is described in the next section.

Dans le document Machine Learning (Page 74-79)