Interpreting the Model - Data Mining with

Each Microsoft data mining algorithm in SQL Server 2005 is associated with a content viewer. The viewer is specifically designed to display the patterns dis-covered by the associated algorithm.

Figure 5.5 is a screenshot of Microsoft Decision Tree Viewer, displaying the classification tree model of CollegePlans. The tree is laid out horizontally with the root node on the left side. Each node contains a histogram bar with differ-ent colors, represdiffer-enting various states. In this case, there are two colors in the histogram bar. The dark color represents College Plan = Yes, and the white color represents College Plan = No. The bottom part of the screen is a dockable window, displaying the node legend of the selected node. The decision tree patterns are very easy to interpret. Each path from the root to a given node forms a rule. In the figure, the selected node represents the node path: Parent Encouragement = Encouraged and IQ >= 110 and Parent Income >=

49632.238 and Gender = Male. From the Node Legend window, you can see that there are 537 cases classified to this node, and probability of College Plan = Yes in this node is 88.54%.

Figure 5.4 Content for decision tree model Intermediate

node

Intermediate node

Leaf node Leaf node Tree 1

Model

Tree 2 ... Tree n

N OT E You may notice the probability of the missing state in Figure 5.5 is 0.01%, though no one among 437 students classified in this node has a missing state. This is becase a prior is considered when calculating the probability for each state. Usually, this favors those states with low supports.

There are a number of buttons and drop-down lists on the toolbar, including buttons for Zoom in, Zoom out, and Zoom to fit. The Tree drop-down list is used for making tree selections. Remember, a model may contain a set of trees.

The left bar in this drop-down list is a handy indication, telling us the size of the associated tree. The Histograms combo box allows users to specify the number of states (colors) to display in the histogram bar of each tree node. For example, if a predictable attribute has 10 states, you can use this drop-down list to display the most important five states. The other states are all grouped and shown in gray.

You may want to find the nodes representing the highest probability of Col-lege Plan = Yes. You can do so by looking at the histogram bar of each node. But when there are many states of the predictable attribute, it is not obvious which are the desired nodes. The Background drop-down list is very useful for this purpose. It controls the background color of the tree nodes. By default, the tree node background color represents the number of cases classified in each node.

The darker a node is, the more cases it contains. You can also pick a particular state of the predictable attribute in this drop-down list — for example, College Plan = Yes. In this case, the background color represents the probability for the selected state. If a node is a dark color, it has high probability associated with the given state.

Figure 5.5 Decision tree viewer

T I P Take a closer look at the node labeled Parent Income >=49652, it splits on the Genderattribute. But its two children have very similar probability distributions. This split doesn’t give additional information. This is a case of over split. To avoid this, you may raise the value of Complecity_Penalty.

All Microsoft data mining viewers in SQL Server 2005 have multiple tabs, which display the patterns at different angles. Figure 5.6 is the Dependency Network tab for decision tree algorithms. The dependency network displays the relationships among attributes derived from decision tree model’s content.

Each node in the figure represents one attribute, and each edge represents the relationship between two nodes. An edge has a direction, pointing from the input attribute (node) to the predictable attribute (node). An edge can be bidi-rectional, which means two nodes can predict each other. In the figure, Parental Encouragement, IQ, Parents’ Income, and Gender can all predict College Plans.

An edge has a weight; the weight is associated with the slider at the left side.

The heavier the weight, the stronger the predictor is. The weight is derived from the tree’s statistics, mainly based on the split score. In this example, if we move the slider down, we can see that the most important attribute for predict-ing College Plans is Parental Encouragement, and the weakest one is Gender.

The Dependency Network viewer can be very useful when there are lots of predictable attributes, for example, a model with a predictable nested table. In this case, each node in the dependency network represents a tree root. You can think this graph as bird’s-eye view over a forest. It provides extremely useful information for exploratory data analysis.

Figure 5.6 Dependency network pane of Decision Tree viewer

T I P In most cases, not all the input attributes are used for splitting a tree. The unselected attributes generally have less impact on the predictable attribute has. They are not displayed in the dependency network.

But be careful. In some cases, important attributes don’t appear in the tree split. For example, suppose that a tree predicts HouseOwnership. Education and Income are two attributes that have a major impact on HouseOwnership.

These two attributes are highly correlated. For example, higher education is always associated with high income. When the tree splits on one attribute, it is almost equivalent to splitting on the other. After the split on Income, Education is not an important attribute in subtrees. As a consequence, there is no tree split based on Education. Thus Education is not displayed in the dependency network, even though it is a good predictor for HouseOwnership. This is a weakness of decision trees. We recommend building an additional model using Naïve Bayes, which doesn’t have this shortcoming.

Figure 5.7 displays a regression tree model predicting Parents’ Income. The regressor is IQ. The overall view of the regression tree is similar to that of a classification tree. However, the tree nodes are different. There is no histogram bar in a regression tree. Instead, it contains a diamond bar representing the dis-tribution of predictable variables (continuous values). The diamond repre-sents the value distribution of the given node. The diamond is centered on the mean of the predictable attribute of those cases that are classified in the current node. The width of the diamond is twice the standard deviation. If the dia-mond is thin, the prediction on this node is more precise.

Figure 5.7 Visualizing a regression tree

As explained previously, each node of a regression tree model contains a regression formula. For example, the selected node in Figure 5.7 contains the following regression formula:

IQ = 98.777+0.00008*(Parent Income-42,150.301)

Summary

The decision tree is a very popular data mining algorithm. Microsoft Decision Trees can be used for three different data mining tasks: classification, regres-sion, and association.

In this chapter, you were given an overview of the decision tree and the principles of the Microsoft Decision Tree algorithm. By now, you should know the concept of entropy, which is one of the popular tree splitting scores. You also know how to use the set of parameters to tune the accuracy of a tree model. For example, Complexity_Penaltyis a parameter you can use fre-quently to avoid overtraining or undertraining.

You’ve learned how to build mining models, using the decision tree algo-rithm, for three types of data mining tasks using DMX query. You also learned the prediction queries for three different types of tree models.

The last part of the chapter taught you how to interpret a tree model, using the Decision Tree viewer. Viewing a tree is not as simple as it seemed to be;

you’ve learned how to identify whether a tree is overtrained and how to deter-mine the importance of each input attribute (predictor).

169 Suppose that you are a retail store manager. You want to know the sales fore-cast for the next few weeks for each product category, each subcategory, and each individual item. This information can help you optimize your inventory.

You don’t want to have items out of stock when there is a strong customer demand, nor do you like to overstock items. When holidays approach, the sales of certain items may peek. For example, Easter triggers the sale of Easter eggs, chocolate, jelly beans, and so on. You want to know when and how much to order of each of these products.

The Microsoft Time Series Algorithm is designed to resolve this kind of puz-zle. Let’s have a closer look at this data mining technique in this chapter.

In this chapter, you will learn about:

■■ The principles of the Microsoft Time Series algorithm

■■ Setting the algorithm parameters

■■ DMX queries related to time series

■■ Interpreting a Microsoft Time Series model

Microsoft

Dans le document Data Mining with (Page 186-192)