• Aucun résultat trouvé

ADVANCED DATA MINING ALGORITHMS Interactive Trees

TheSTATISTICA Interactive Trees(I-Trees) module builds trees for predicting dependent variables that are continuous (estimation) or categorical (classification). The program sup-ports the classic Classification and Regression Tree (CART) algorithm (Breiman et al.; see also Ripley, 1996) as well as the CHAID algorithm (Chi-Square Automatic Interaction Detector; Kass, 1980). The module can use algorithms, user-defined rules, criteria specified via an interactive graphical user interface (brushing tools), or a combination of those meth-ods. This enables users to try various predictors and splitting criteria in combination with almost all the functions of automatic tree building.

Figure 8.2 displays the tree results layout in the I-Trees module. The main results screen is shown in Figure 8.3.

Manually Building the Tree

The I-Trees module doesn’t build trees by default, so when you first display the Trees Results dialog, no trees have been built. (If you click the Tree Graph button at this point, a single box will be displayed with a single root node, as in Figure 8.4.)

The Tree Browser

The final tree results are displayed in the workbook tree browser, which clearly identifies the number of splitting nodes and terminal nodes of the tree (Figure 8.5).

To review the statistics and other information (e.g., splitting rule) associated with each node, simply highlight it and review the summary graph in the right pane. The split nodes can be collapsed or expanded in the manner that most users are accustomed to from stan-dard MS Windows-style tree browser controls. Another useful feature of the workbook tree browser is the ability to quickly review the effect of consecutive splits on the resulting child nodes in an animation-like manner.

Advantages of I-Trees

• I-Trees is particularly optimized for very large data sets, and in many cases the raw data do not have to be stored locally for the analyses.

• It is more flexible in the handling of missing data. Because the Interactive Trees module does not support ANCOVA-like design matrices, it is more flexible in the handling of missing data; for example, in CHAID analyses, the program will handle predictors one at a time to determine a best (next) split; in the General CHAID (GCHAID) Models module, observations with missing data for any categorical predictor are eliminated from the analysis.

• You can perform “what-if” analyses to gain better insights into your data by interactively deleting individual branches, growing other branches, and observing various results statistics for the different trees (tree models).

154 8. ADVANCED ALGORITHMS FOR DATA MINING

FIGURE 8.2 Layout of part of the I-Trees interface inSTATISTICAData Miner.

FIGURE 8.3 The results screen of the I-Trees module.

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

FIGURE 8.4 Initial tree graph showing only one node (no splitting yet).

FIGURE 8.5 Final tree results from I-Trees, using example of Iris data set.

156 8. ADVANCED ALGORITHMS FOR DATA MINING

• You can automatically grow some parts of the tree but manually specify splits for other branches or nodes.

• You can define specific splits and select alternative important predictors other than those chosen automatically by the program.

• You can quickly copy trees into new projects to explore alternative splits and methods for growing branches.

• You can save entire trees (projects) for later use.

Building Trees Interactively

Building trees interactively has proven popular in applied research, and data exploration is based on experts’ knowledge about the domain or area under investigation, and relies on interactive choices (for how to grow the tree) by such experts to arrive at “good” (valid) models for prediction or predictive classification. In other words, instead of building trees automatically, using sophisticated algorithms for choosing good predictors and splits (for growing the branches of the tree), a user may want to determine manually which variables to include in the tree and how to split those variables to create the branches of the tree. This enables the user to experiment with different variables and scenarios and ideally to derive a better understanding of the phenomenon under investigation by combining her or his expertise with the analytic capabilities and options for building the tree (see also the next section).

Combining Techniques

In practice, it may often be most useful to combine the automatic methods for building trees with educated guesses and domain-specific expertise. You may want to grow some portions of the tree using automatic methods and refine and modify the choices made by the program (for how to grow the branches of the tree) based on your expertise. Another common situation in which this type of combination is called for is when some variables that are chosen automatically for some splits are not easily observable because they cannot be measured reliably or economically (i.e., obtaining such measurements would be too expensive). For example, suppose the automatic analysis at some point selects a variable Incomeas a good predictor for the next split; however, you may not be able to obtain reliable data on income from the new sample to which you want to apply the results of the current analysis (e.g., for predicting some behavior of interest, such as whether or not the person will purchase something from your catalog). In this case, you may want to select a surrogate variable, i.e., a variable that you can observe easily and that is likely related or similar to variableIncome(with respect to its predictive power; for example, a variable Number of years of education may be related toIncomeand have similar predictive power; while most people are reluctant to reveal their level of income, they are more likely to report their level of education, and hence, this latter variable is more easily measured).

The I-Trees module provides a large number of options to enable users to interactively determine all aspects of the tree-building process. You can select the variables to use for each split (branch) from a list of suggested variables, determine how and where to split a variable, interactively grow the tree branch by branch or level by level, grow the entire tree automatically, delete (“prune back”) individual branches of trees, and more. All of these 157

ADVANCED DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

options are provided in an efficient graphical user interface in which you can “brush” the current tree, i.e., select a specific node to grow a branch, delete a branch, etc. As in all mod-ules for predictive data mining, the decision rmod-ules contained in the final tree built for regres-sion or classification prediction can optionally be saved in a variety of ways for deployment in data mining projects, including C/Cþþ,STATISTICAVisual Basic, or Predictive Model Markup Language (PMML). Hence, final trees computed via this module can quickly and efficiently be turned into solutions for predicting or classifying new observations.