Supervised Learning of Univariate Deci- Deci-sion Trees

Decision Trees

6.2 Supervised Learning of Univariate Deci- Deci-sion Trees

Several systems for learning decision trees have been proposed. Promi-nent among these are ID3 and its new version, C4.5 Quinlan, 1986, Quinlan, 1993], and CART Breiman,et al., 1984] We discuss here only batch methods, although incremental ones have also been proposed Utgo, 1989].

6.2.1 Selecting the Type of Test

As usual, we havenfeatures or attributes. If the attributes are binary, the tests are simply whether the attribute's value is 0 or 1. If the attributes are categorical, but non-binary, the tests might be formed by dividing the attribute values into mutually exclusive and exhaustive subsets. A decision tree with such tests is shown in Fig. 6.4. If the attributes are numeric, the tests might involve \interval tests," for example 7xi13.2.

cq-1

vn-1

Figure 6.3: A Decision Tree Implementing a Decision List

6.2.2 Using Uncertainty Reduction to Select Tests

The main problem in learning decision trees for the binary-attribute case is selecting the order of the tests. For categorical and numeric attributes, we must also decide what the tests should be (besides selecting the order).

Several techniques have been tried the most popular one is at each stage to select that test that maximally reduces an entropy-like measure.

We show how this technique works for the simple case of tests with binary outcomes. Extension to multiple-outcome tests is straightforward computationally but gives poor results because entropy is always decreased by having more outcomes.

Theentropyor uncertainty still remaining about the class of a pattern|

knowing that it is in some set, !, of patterns is dened as:

H(!) =^;^X

i p(i^j!)log₂p(i^j!)

where p⁽i^j!) is the probability that a pattern drawn at random from ! belongs to classi, and the summation is over all of the classes. We want to select tests at each node such that as we travel down the decision tree, the uncertainty about the class of a pattern becomes less and less.

x3 = a, b, c, or d {a, c}

{b}

x1 = e, b, or d {e,b}

{d}

x4 = a, e, f, or g {a, g} {e, f}

x2 = a, or g

{a} {g}

2 1

1 2

{d}

Figure 6.4: A Decision Tree with Categorical Attributes

Since we do not in general have the probabilities p(i^j!), we estimate them by sample statistics. Although these estimates might be errorful, they are nevertheless useful in estimating uncertainties. Let ^p(i^j!) be the number of patterns in ! belonging to class idivided by the total number of patterns in !. Then an estimate of the uncertainty is:

H^(!) =^;^X

i p^(i^j!)log₂p^(i^j!)

For simplicity, from now on we'll drop the \hats" and use sample statistics as if they were real probabilities.

If we perform a test,T^{, having}k possible outcomes on the patterns in

!, we will createksubsets, !₁!₂::: !_k. Suppose thatni of the patterns in ! are in !_i for i ^{= 1}:::k^{. (Some} ni may be 0.) If we knew thatT applied to a pattern in ! resulted in the j-th outcome (that is, we knew that the pattern was in !_j), the uncertainty about its class would be:

H(!_j) =^;^X

i p(i^j!_j)log₂p(i^j!_j)

and the reduction in uncertainty (beyond knowing only that the pattern was in !) would be:

H(!)^;H(!_j)

Of course we cannot say that the test T is guaranteed always to produce that amount of reduction in uncertainty because we don't know that the result of the test will be thej-th outcome. But we can estimate theaverage uncertainty over all the !_j, by:

EHT(!)] =^X

j p(!j)H(!j)

where byHT(!) we mean the average uncertainty after performing testT on the patterns in !, p(!_j) is the probability that the test has outcomej, and the sum is taken from 1 tok. Again, we don't know the probabilities p(!j), but we can use sample values. The estimate ^p(!j) of p(!j) is just the number of those patterns in ! that have outcomejdivided by the total number of patterns in !. Theaveragereduction in uncertainty achieved by testT (applied to patterns in !) is then:

RT(!) =H(!)^;EHT(!)]

An important family of decision tree learning algorithms selects for the root of the tree that test that gives maximum reduction of uncertainty, and then applies this criterion recursively until some termination condition is met (which we shall discuss in more detail later). The uncertainty calcu-lations are particularly simple when the tests have binary outcomes and when the attributes have binary values. We'll give a simple example to illustrate how the test selection mechanism works in that case.

Suppose we want to use the uncertainty-reduction method to build a decision tree to classify the following patterns:

pattern class (0, 0, 0) 0 (0, 0, 1) 0 (0, 1, 0) 0 (0, 1, 1) 0 (1, 0, 0) 0 (1, 0, 1) 1 (1, 1, 0) 0 (1, 1, 1) 1

x2 x3

The test x1

Figure 6.5: Eight Patterns to be Classied by a Decision Tree What single test,x1,x2, orx3, should be performed rst? The illustration in Fig. 6.5 gives geometric intuition about the problem.

The initial uncertainty for the set, !, containing all eight points is:

H(!) =^;(6=8)log₂(6=8)^;(2=8)log₂(2=8) =0:81

Next, we calculate the uncertainty reduction if we performx1 rst. The left-hand branch has only patterns belonging to class 0 (we call them the set !_l), and the right-hand-branch (!_r) has two patterns in each class. So, the uncertainty of the left-hand branch is:

Hx¹(!l) =^;(4=4)log₂(4=4)^;(0=4)log₂(0=4) =0 And the uncertainty of the right-hand branch is:

Hx¹(!_r) =^;(2=^4)log2(2=⁴⁾^;⁽²=^4)log2(2=^{4) =1}

Half of the patterns \go left" and half \go right" on test x1. Thus, the average uncertainty after performing thex1 test is:

1=2Hx¹(!_l) + 1=2Hx¹(!_r) = 0:5

Therefore the uncertainty reduction on ! achieved byx1is:

Rx¹(!) =0:⁸¹^;⁰:^{5 = 0}:³¹

By similar calculations, we see that the testx3achieves exactly the same uncertainty reduction, butx2achieves no reduction whatsoever. Thus, our

\greedy" algorithm for selecting a rst test would select either x1 or x3. Supposex1 is selected. The uncertainty-reduction procedure would select x3 as the next test. The decision tree that this procedure creates thus implements the Boolean function: f =x1x3.

SeeQuinlan, 1986, sect. 4] for another

example.

6.2.3 Non-Binary Attributes

If the attributes are non-binary, we can still use the uncertainty-reduction technique to select tests. But now, in addition to selecting an attribute, we must select a test on that attribute. Suppose for example that the value of an attribute is a real number and that the test to be performed is to set a threshold and to test to see if the number is greater than or less than that threshold. In principle, given a set of labeled patterns, we can measure the uncertainty reduction for each test that is achieved by every possible threshold (there are only a nite number of thresholds that give dierent test results if there are only a nite number of training patterns).

Similarly, if an attribute is categorical (with a nite number of categories), there are only a nite number of mutually exclusive and exhaustive subsets into which the values of the attribute can be split. We can calculate the uncertainty reduction for each split.

Dans le document Department of Computer Science Stanford University (Page 91-96)