Decision Tree Classification - Predictive Modeling for Classification

Predictive Modeling for Classification

4.1 Decision Tree Classification

Decision tree classification is one of the most widely used and practical meth-ods for inductive inference. Decision tree learning is robust to noisy data and is capable of learning both conjunctive and disjunctive expressions. It is generally used to approximate discrete-valued target functions. Mitchell [59]

characterizes problems suited to decision trees as follows (presentation cour-tesy Hamilton et al.[39]):

• Instances are composed of attribute-value pairs.

- Instances are described by a fixed set of attributes (e.g., temperature) and their values (e.g., hot).

- The easiest situation for decision tree learning occurs when each at-tribute takes on a small number of disjoint possible values (e.g., hot, mild, cold).

- Extensions to the basic algorithm allow handling real-valued attributes as well (e.g., temperature).

• The target function has discrete output values.

- A decision tree assigns a classification to each example. Boolean clas-sification (with only two possible classes) is the simplest. Methods can easily be extended to learning functions multiple (> 2) possible output values.

- Learning target functions with real-valued outputs is also possible (though significant extensions to the basic algorithm are necessary);

these are commonly referred to as regression trees.

• Disjunctive descriptions may be required (since decision trees naturally represent disjunctive expressions).

• The training data may contain errors. Decision tree learning methods are robust to errors - both errors in classifications of the training examples and errors in the attribute values that describe these examples.

• The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values (e.g., temperature is known for only some of the examples).

The model built by the algorithm is represented by a decision tree - hence the name. A decision tree is a sequential arrangement of tests (an appropriate test is prescribed at every step in an analysis). The leaves of the tree predict the class of the instance. Every path from the tree root to a leaf corresponds to a conjunction of attribute tests. Thus, the entire tree represents a disjunction of conjunctions of constraints on the attribute-values of instances. This tree can also be represented as a set of if-then rules. This adds to the readabihty and intuitiveness of the model.

For instance, consider the weather dataset shown in Table 4.1. Figure 4.1 shows one possible decision tree learned from this data set. New instances are classified by sorting them down the tree from the root node to some leaf node, which provides the classification of the instance. Every interior node of the

32 Predictive Modeling for Classification

tree specifies a test of some attribute for the instance; each branch descending from that node corresponds to one of the possible values for this attribute.

So, an instance is classified by starting at the root node of the decision tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated at the node on this branch and so on until a leaf node is reached. For example the instance {sunny, hot, normal, FALSE} would be classified as "Yes" by the tree in figure 4.1.

Table 4.1. The Weather Dataset outlook

While many possible trees can be learned from the same set of training data, finding the optimal decision tree is an NP-complete problem. Occam's Razor (specialized to decision trees) is used as a guiding principle: "The world is inherently simple. Therefore the smallest decision tree that is consistent with the samples is the one that is most likely to identify unknown objects correctly". Rather than building all the possible trees, measuring the size of each, and choosing the smallest tree that best fits the data, several heuristics can be used in order to build a good tree.

Quinlan's IDS[72] algorithm is based on an information theoretic heuris-tic. It is appeahngly simple and intuitive. As such, it is quite popular for constructing a decision tree. The seminal papers in Privacy Preserving Data Mining [4, 57] proposed solutions for constructing a decision tree using ID3 without disclosure of the data used to build the tree.

The basic IDS algorithm is given in Algorithm 1. An information theoretic heuristic is used to decide the best attribute to split the tree. The subtrees are built by recursively applying the IDS algorithm to the appropriate subset of the dataset. Building an IDS decision tree is a recursive process, operating on

Decision Tree Classification ³³ Outlook

Humidity Yes Wind

No Yes

Fig. 4.1. A decision tree learned from the weather dataset

the decision attributes R, class attribute C, and training entities T. At each stage, one of three things can happen:

1. R might be empty; i.e., the algorithm has no attributes on which to make a choice. In this case, a decision on the class must be made simply on the basis of the transactions. A simple heuristic is to create a leaf node with the class of the leaf being the majority class of the transactions in T.

2. All the transactions in T may have the same class c. In this case, a leaf is created with class c.

3. Otherwise, we recurse:

a) Find the attribute A that is the most effective classifier for transac-tions in T, specifically the attribute that gives the highest information gain.

b) Partition T based on the values a^ of ^ .

c) Return a tree with root labeled A and edges a^, with the node at the end of edge a^ constructed from calling ID3 with i? — {A}, C, T{Ai).

In step 3a, information gain is defined as the change in the entropy relative to the class attribute. Specifically, the entropy

HciT) =

c£C

\nc)\,_\T{c

log-Analogously, the entropy after classifying with A is

\T{a)\

Hc{T\A) = Y. •

\T\

HciTia)).

34 Predictive Modeling for Classification

Information gain due t o t h e a t t r i b u t e A is now defined as Gain{A)''=^ Hc{T)-Hc{T\A).

T h e goal, then, is t o find A t h a t maximizes Gain{A). Since Hc{T) is fixed for any given T , this is equivalent t o finding A t h a t minimizes HciT\A).

E x p a n d i n g , we get:

Hc{T\A) = J2 ^^Hc{T{A))

aGA

\T{a,c)\

\T{A)\

1 Y - m M Y ^ | r ( a , c ) | , /

^ ^ | T ( a , c ) | l o g ( | T ( a , c ) | ) + aeAcec

^ | T ( a ) | l o g ( | T ( a ) | ) ) (4.1)

A l g o r i t h m 1 I D 3 ( R , C , T ) tree learning algorithm Require: H, the set of attributes

Require: C, the class attribute Require: T, the set of transactions

1: if i^ is empty t h e n

2: return a leaf node, with class value assigned to most transactions in T 3: else if all transactions in T have the same class c t h e n

4: return a leaf node with the class c 5: else

6: Determine the attribute A that best classifies the transactions in T

7: Let a i , . . . , a m be the values of attribute A. Partition T into the vn partitions T ( a i ) , . . . , T{ayr^ such that every transaction in T(ai) has the attribute value ai.

8: Return a tree whose root is labeled A (this is the test attribute) and has vn edges labeled a i , . . . ,am such that for every i, the edge ai goes to the tree IDZ{R-A,C,T{ai)).

9: end if

Dans le document PRIVACY PRESERVING DATA MINING (Page 36-39)