• Aucun résultat trouvé

Multivariate Adaptive Regression Splines (MARSplines)

We’ll use theSTATISTICAData Miner software tool to describe the MARSplines algo-rithm, but the ideas described can be applied to whatever software package you use.

Note:Many of the paragraphs in this section are adapted from theSTATISTICAonline help, StatSoft, Inc. (2008).STATISTICA(data analysis software system), version 8.0. www.statsoft.com.

The STATISTICA Multivariate Adaptive Regression Splines (MARSplines) module is a generalization of techniques (called MARS) popularized by Friedman (1991) for solving regression- and classification-type problems, with the goal to predict the value of a set of dependent or outcome variables from a set of independent or predictor variables. MAR-Splines can handle both categorical and continuous variables (whether response or predic-tors). With categorical responses, MARSplines will treat the problem as a classification problem; with continuous dependent variables, as a regression problem. MARSplines will automatically determine that for you.

MARSplines is a nonparametric procedure that makes no assumption about the underly-ing functional relationship between the dependent and independent variables. Instead, it constructs the model from a set of coefficients and basis functions that are entirely “driven”

from the data. In a sense, the method follows decision trees in being based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression or classification equation. This makes MARSplines particularly suitable for problems with higher input dimensions (i.e., with more than two variables), where thecurse of dimensionalitywould likely create problems for other techniques.

The MARSplines technique has become particularly popular in data mining because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, etc.) between the predictor variables and the dependent (outcome) variable of interest.

Instead, useful models (i.e., models that yield accurate predictions) can be derived even in situations in which the relationships between the predictors and the dependent variables are non-monotone and difficult to approximate with parametric models. For more informa-tion about this technique and how it compares to other methods for nonlinear regression (or regression trees), see Hastie et al. (2001).

In linear regression, the response variable is hypothesized to depend linearly on the predictor variables. It’s a parametric method, which assumes that the nature of the relation-ships (but not the specific parameters) between the dependent and independent variables is knowna priori(e.g., is linear). By contrast, nonparametric methods do not make any such assumption as to how the dependent variables are related to the predictors. Instead, it allows the model function to be “driven” directly from data.

158 8. ADVANCED ALGORITHMS FOR DATA MINING

Multivariate Adaptive Regression Splines (MARSplines) constructs a model from a set of coefficients and features or “basis functions” that are determined from the data. You can think of the general “mechanism” by which the MARSplines algorithm operates as multiple piecewise linear regression where each breakpoint (estimated from the data) defines the

“region of application” for a particular (very simple) linear equation.

Basis Functions

Specifically, MARSplines uses two-sided truncated functions of the form (as shown in Figure 8.6) as basis functions for linear or nonlinear expansion, which approximates the relationships between the response and predictor variables.

Figure 8.6 shows a simple example of two basis functions (t – x)þand (x – t)þ(adapted from Hastie et al., 2001, Figure 9.9). Parametertis the knot of the basis functions (defining the “pieces” of the piecewise linear regression); these knots (t parameters) are also deter-mined from the data. The plus (þ) signs next to the terms(t – x)and(x – t) simply denote that only positive results of the respective equations are considered; otherwise, the respec-tive functions evaluate to zero. This can also be seen in the illustration.

The MARSplines Model

The basis functions together with the model parameters (estimated via least squares esti-mation) are combined to produce the predictions given the inputs. The generalMARSplines model equation (see Hastie et al., 2001, equation 9.19) is given as

y¼fðXÞ ¼boþXM

m¼1

bmhmðXÞ

where the summation is over theM nonconstant terms in the model. To summarize, y is predicted as a function of the predictor variables X(and their interactions); this function FIGURE 8.6 MARSplines basis functions for linear and nonlinear analysis.

159

ADVANCED DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

consists of an intercept parameter (bo) and the weighted (bybm) sum of one or more basis functionshm(X), of the kind illustrated earlier. You can also think of this model as “select-ing” a weighted sum of basis functions from the set of (a large number of) basis functions that span all values of each predictor (i.e., that set would consist of one basis function and parametert, for each distinct value for each predictor variable). The MARSplines algo-rithm then searches over the space of all inputs and predictor values (knot locationst) as well as interactions between variables. During this search, an increasingly larger number of basis functions is added to the model (selected from the set of possible basis functions), to maximize an overall least squares goodness-of-fit criterion. As a result of these opera-tions, MARSplines automatically determines the most important independent variables as well as the most significant interactions among them. The details of this algorithm are further described in Hastie et al. (2001).

Categorical Predictors

MARSplines is well suited for tasks involving categorical predictors variables. Different basis functions are computed for each distinct value for each predictor, and the usual tech-niques for handling categorical variables are applied. Therefore, categorical variables (with class codes rather than continuous or ordered data values) can be accommodated by this algorithm without requiring any further modifications.

Multiple Dependent (Outcome) Variables

The MARSplines algorithm can be applied to multiple dependent (outcome) variables, whether continuous or categorical. When the dependent variables are continuous, the algorithm will treat the task as regression; otherwise, as a classification problem. When the outputs are multiple, the algorithm will determine a common set of basis functions in the predictors but estimate different coefficients for each dependent variable. This method of treating multiple outcome variables is not unlike some neural network architectures, where multiple outcome variables can be predicted from common neurons and hidden layers; in the case of MARSplines, multiple outcome variables are predicted from common basis functions, with different coefficients.

MARSplines and Classification Problems

Because MARSplines can handle multiple dependent variables, it is easy to apply the algorithm to classification problems as well. First, it will code the classes in the categorical response variable into multiple indicator variables (e.g., 1¼observation belongs to classk, 0¼observation does not belong to classk); then MARSplines will fit a model and compute predicted (continuous) values or scores; and finally, for prediction, it will assign each case to the class for which the highest score is predicted (see also Hastie et al., 2001, for a description of this procedure).

160 8. ADVANCED ALGORITHMS FOR DATA MINING

Model Selection and Pruning

In general, nonparametric models are adaptive and can exhibit a high degree of flexibil-ity that may ultimately result in overfitting if no measures are taken to counteract it.

Although overfit models can achieve zero error on training data (provided they have a suf-ficiently large number of parameters), they will almost certainly perform poorly when pre-sented with new observations or instances (i.e., they do not generalize well to the prediction of “new” cases). MARSplines tends to overfit the data as well. To combat this problem, it uses a pruning technique (similar to that in classification trees) to limit the complexity of the model by reducing the number of its basis functions.

MARSplines as a Predictor (Feature) Selection Method

The selection of and pruning of basis functions in MARSplines makes this method a very powerful tool for predictor selection. The MARSplines algorithm will pick up only those basis functions (and those predictor variables) that make a “sizeable” contribution to the prediction. The Results dialog of the Multivariate Adaptive Regression Splines (MARSplines) module will clearly identify (highlight) only those variables associated with basis functions that were retained for the final solution (model).

Applications

MARSplines has become very popular recently for finding predictive models for “diffi-cult” data mining problems, i.e., when the predictor variables do not exhibit simple and/

or monotone relationships to the dependent variable of interest. Because of the specific manner in which MARSplines selects predictors (basis functions) for the model, it generally does well in situations in which regression-tree models are also appropriate, i.e., where hierarchically organized successive splits on the predictor variables yield accurate pre-dictions. In fact, this technique is as much a generalization of regression trees as it is of multiple regression. The “hard” binary splits are replaced by “smooth” basis functions.

A large number of graphs can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Various code generator options are available for saving estimated (fully parameterized) models for deployment in C/Cþþ/C#, Visual Basic, or PMML.

The MARSplines Algorithm

Implementing MARSplines involves a two-step procedure that is applied successively until a desired model is found. In the first step, we build the model (increase its complexity) by repeatedly adding basis functions until a user-defined maximum level of complexity is reached. (We start with the simplest—the constant; then we iteratively add the next term, of all possible, that most reduces training error.) Once we have built a very complex model, we begin a backward procedure to iteratively remove the least significant basis functions from the model, i.e., those whose removal leads to the least reduction in the (least-squares) goodness of fit.

161

ADVANCED DATA MINING ALGORITHMS

II. THE ALGORITHMS IN DATA MINING AND TEXT MINING

MARSplines is a local nonparametric method that builds a piecewise linear regression model; it uses separate regression slopes in distinct intervals of the predictor variable space (Figure 8.7).

The slope of the piecewise regression line is allowed to change from one interval to the other as the two “knots” points are crossed; knots mark the end of one region and beginning of another. Like CART, its structure is found first by overfitting and then pruning back.

The major advantage of MARSplines is that it automates all those aspects of regression modeling that are difficult and time consuming to conduct by hand:

• Selecting which predictors to use for building models;

• Transforming variables to account for nonlinear relationships;

• Detecting interactions that are important;

• Self-testing to ensure that the model will work on future data.

The result is a more accurate and more complete model that could be handcrafted—

especially by inexperienced modelers.