• Aucun résultat trouvé

Multiple Linear Regression Modeling

Dans le document Data Mining Using (Page 162-167)

Supervised Learning Methods: Prediction

5.3 Multiple Linear Regression Modeling

In MLR, the association between two sets of variables is described by a linear equation that predicts the response variable from a function of predictor variables. The estimated MLR model contains regression param-eters that are estimated by using the least-squares criterion in such a way that prediction is optimized. In most situations, MLR models merely provide useful approximations of the true unknown model; however, even in cases where theory is lacking, an MLR model may provide an excellent predictive equation if the model is carefully formulated from a large representative database. The major conceptual limitation of MLR modeling based on observational studies is that one can only ascertain relationships and never be sure about the underlying causal mechanism.

Significant regression relationship does not imply cause-and-effect rela-tionships in uncontrolled observational studies; however, MLR modeling is considered to be the most widely used technique by all disciplines.

The statistical theory, methods, and computation aspects of MLR ar e presented in detail elsewhere.1,2

3456_Book.book Page 153 Wednesday, November 20, 2002 11:34 AM

5.3.1 MLR Key Concepts and Terminology

5.3.1.1 Overall Model Fit

In MLR, the statistical significance of the overall fit is determined by an F test by comparing the regression model variance to the error variance. The R2 estimate is an indicator of how well the model fi ts the data (e.g., an R2 close to 1.0 indicates that model has accounted for almost all of the variability with the variables specifi ed in the model). The concept of R2 can be visually examined in an overlay plot of ordered and centered response variables (describing the total vari-ation) and the corresponding residuals (describing the residual varia-tion) vs. the ascending observation sequence. The area in total variation not covered by the residual variation illustrates model explained vari-ation. Other model and data violations also show up in the explained variation plot (see Figures 5.13 and 5.24 for examples of explained variation plots). Whether a given R2 value is considered to be large or small depends on the context of the particular study. The R2 is not recommended for selecting the best model because it does not account for the presence of redundant predictor variables; however, the R2(adjusted) is recommended for model selection because the sample size and number of predictors are used in adjusting the R2 estimate. Caution must be taken with interpretation of R2 for models with no intercept term. As a general rule, no intercept models should be fit except when theoretical justification exists and the data appear to fit a no-intercept framework.

5.3.1.2 Regression Parameter Estimates

In MLR, the regression model is estimated by the least-squares criterion by finding the best-fitted line, which minimizes the error in the regres-sion. The regression model contains a Y intercept and regression coefficients (bi) for each predictor variables. The bi measure the partial contributions of each pr edictor variable to the pr ediction of the response. Thus, the bi estimate the amount by which the mean response changes when the predictor is changed by one unit while all the other predictors are unchanged. However, if the model includes interactions or higher order terms, it may not be possible to interpret individual regression coefficients. For example, if the equation includes both linear and quadratic terms for a given variable, we cannot physically change the value of the linear term without also changing the value of the quadratic term. To interpret the direction of the relationship between the predictor variable and the response, look at the signs

3456_Book.book Page 154 Wednesday, November 20, 2002 11:34 AM

(plus or minus) of the regression or b coefficients. If a b coefficient is positive, then the relationship of this variable with the response is positive, and if the b coefficient is negative then the relationship is negative. In an observational study wher e the true model for m is unknown, interpretation of parameter estimates becomes even mor e complicated. A parameter estimate can be interpreted as the expected difference in response between two observations that dif fer by one unit on the predictor in question and have the same values for all other predictors. We cannot make inferences about changes in an observational study because we have not actually changed anything.

It may not even be possible, in principle, to change one pr edictor independently of all the others, nor can you draw conclusions about causality without experimental manipulation.

5.3.1.3 Standardized Regression Coefficients

Two regression coefficients in the same model can be directly compared only if the predictors are measured in the same units. Sometimes stan-dardized regression coefficients are used to compare the effects of pre-dictors measured in different units. Standardizing the variables (zero mean with unit standard deviation) effectively makes standard deviation the unit of measurement. This makes sense only if the standard deviation is a meaningful quantity, which is usually the case only if the observations are sampled from well-defined databases.

5.3.1.4 Significance of Regression Parameters

The statistical significance of regression parameters is determined based on the partial sums of squares (SS2) and the t statistics derived by dividing the parameter estimates by its standard error. If higher order model terms such as quadratic and cross products are included in the regression model, the p values based on SS2 are incorrect for the linear and main effects of the parameters. Under these circumstances, correct significance tests for the linear and main effects could be determined using the sequential sums of squares (SS1).

Although p values based on a t test provide the statistical significance of a given variable in predicting the response in that sample, it does not necessarily measure the importance of a predictor. An important predictor can have a large (nonsignificant) p value if the sample is small, if the predictor is measured over a narrow range, if there are large measurement errors, or if another closely related predictor is included in the equation.

3456_Book.book Page 155 Wednesday, November 20, 2002 11:34 AM

An unimportant predictor can have a very small p value in a large sample.

Computing a confidence interval for a parameter estimate provides more useful information than just looking at the p value, but confidence intervals do not solve problems of measurement errors in predictors or highly correlated predictors.

5.3.1.5 Model Estimation in MLR with Categorical Variables

When categorical variables are used as predictors, separate regression models are estimated for each level or a combination of levels within all categorical variables included in the model. One of the levels is treated as the baseline and differences in the intercept and slope estimates for all other levels compared with the base level are estimated and tested for significance. The main effects of the categorical variables and the inter-action between the categorical variables and continuous predictors must be specified in the model statement to estimate differences in the intercepts and slopes, respectively.3 MLR models with categorical variables can be modeled more efficiently in the SAS general linear model (GLM) proce-dure,3 where the GLM generates the suitable design matrix when the categorical variables are listed in the class statement. The influence of the categorical variables on the response could be examined graphically in scatterplots between the response and each predictor variable by each categorical variable. The significance of the categorical variable and the need for fitting the heterogeneous slope model could be checked visually by examining the interaction between the predictor and the categorical variable (see Figures 5.20 and 5.21 for examples of regression diagnostic plots suitable for testing the significance of categorical variable). For additional details regarding the fitting of MLR with categorical variables, refer to Freund and Littell3 and SAS Institute.4

5.3.1.6 Predicted and Residual Scores

After the model has been fit, predicted and residual values are usually estimated. The regression line expresses the best prediction of the response for the given predictor variable. The deviation of a particular observed value from the regression line (its predicted value) is called the residual value. The smaller the variability of the residual values around the regres-sion line relative to the overall variability, the better the pr ediction.

Standard errors of residuals and the studentized residual, which is the ratio of the residual to its standard error, are also useful in modeling. The studentized residual is useful in detecting outliers because an observation greater than 2.5 in absolute terms could be treated as an outlier. The

3456_Book.book Page 156 Wednesday, November 20, 2002 11:34 AM

predicted residual for the ith observation is defined as the residual for the ith observation based on the regression model that results from dropping the ith observation from the parameter estimates. The sum of squares of predicted residual errors is called the PRESS statistic. Another R2 statistic, called the R2 prediction, is useful in estimating the predictive power of the regression based on the PRESS. A big drop in the R2 prediction value from the R2 or a negative R2 prediction value is an indication of a very unstable regression model with low predictive poten-tial. There are two kinds of interval estimates for the predicted value. For a given level of confidence, the confidence interval provides an interval estimate for the mean value of the response, whereas the prediction interval is an interval estimate for an individual value of a response. These interval estimates are useful in developing scorecards for observations in the database.

5.3.2 Exploratory Analysis Using Diagnostic Plots

Simple scatter plots are very useful in exploring the relationship between a response and a single predictor variable in a simple linear regression, but these plots are not effective in revealing the complex relationships of predictor variables or data problems in multiple linear regressions. How-ever, partial scatter plots are considered useful in detecting influential observations and multiple outliers, nonlinearity and model specification errors, multicollinearity, and heteroscedasticity problems.5 These partial plots illustrate the partial effects or the effects of a given predictor variable after adjusting for all other predictor variables in the regression model.

Two types of partial scatterplots are considered superior in detecting regression model problems. The mechanics of these two partial plots are described using two variable MLR models:

1. Augmented partial residual plot6 between response (Y) and the predictor variable X1

Step 1. Fit a quadratic regression model:

(Eq. 5.1) Step 2. Add the X1 linear (b1X1) and the X1 quadratic (b3X12) components back to the residual (ei):

(Eq. 5.2) Yi= +b b0 1X1+b2X2+b3X1+ei

2

APR= +e bi 1X1+b3X1 2 3456_Book.book Page 157 Wednesday, November 20, 2002 11:34 AM

Step 3. Fit a simple linear regression between augmented partial

Dans le document Data Mining Using (Page 162-167)