• Aucun résultat trouvé

Violations of Regression Model Assumptions 11

Dans le document Data Mining Using (Page 169-172)

Supervised Learning Methods: Prediction

5.3.4 Violations of Regression Model Assumptions 11

If sample regression data violates one or more of the MLR assumptions, the results of the analysis may be incorrect or misleading. The assumptions for a valid MLR are:

Model parameters are correctly specified.

Residuals from the regression are independent and have zero mean, constant variance, and normal distribution.

Influential outliers are absent.

Multicollinearity is not present.

5.3.4.1 Model Specification Error

When important predictor variables or significant higher order model terms (quadratic and cross-product) are omitted from the regression model, the residual error term no longer has the random error property. The aug-mented partial residual plot is very efficient in detecting the need for a nonlinear (quadratic) term. The need for an interaction between any two predictor variables could be evaluated in the “interaction test” plot. Simple scatterplots between a predictor and the response variable by an indicator variable could indicate the need for an interaction term between a pre-dictor and the indicator variable. The significance of any omitted prepre-dictor variable can only be evaluated by including it in the model and following the usual diagnostic routine.

5.3.4.2 Serial Correlation Among the Residual

In a time series or spatially correlated data, the residuals are usually not independent and positively correlated among the adjacent observations.

This condition is known as serial correlation or first-order autocorrelation.

When the first-order autocorrelation is severe (>0.3), the standard error for the parameter estimates are underestimated. The significance of the first-order autocorrelation could be evaluated by the Durbin–Watson test and an approximate test based on the 2/n critical value criteria. The cyclic pattern observed in case of significant positive autocorrelation can be evaluated by examining the trend plot between residuals by the

3456_Book.book Page 160 Wednesday, November 20, 2002 11:34 AM

observation sequence (see Figure 5.14 for an example of an autocorre-lation detection plot). The SAS AUTOREG procedure available in the SAS/ETS12 module provides an effective method of adjusting for auto-correlation. A user-friendly SAS macro available in the author’s SAS macro page13 adjusts for autocorrelation using the ETS/AUTOREG procedure.

5.3.4.3 Influential Outliers

The presence of significant outliers produces biased regression estimates and reduces the predictive power of the regression model. An influential outlier may act as a high-leverage point, distorting the fitted equation and perhaps fitting the model poorly. The SAS/REG procedure has many powerful influential diagnostic statistics.14 If the absolute value of the student residual for a given observation is greater than 2.5, then it could be treated as a significant outlier. High-leverage data points are highly influential and have significant hat values. The DFFITS statistic shows the impact of each data point by estimating the change in the predicted value in standardized units when the ith observation is excluded from the model;

a DFFITS statistic greater than 1.5 could be used as a cutoff value in influential observation detection. An outlier detection bubble plot between the Student and hat value identifies the outliers if they fall outside the 2.5 boundary line and indicates influential points if the diameter of the bubble plot, which is proportional to DFFITS, is relatively big. Robust regressions using iterative weighted regression methods are available to minimize the impact of influential outliers.15 A user-friendly SAS macro available in the author’s SAS macro page13 adjusts for influential observations based on robust regression by the HUBER and TUKEY methods.

5.3.4.4 Multicollinearity

When a predictor variable is nearly a linear combination of other predictors in the model, the affected estimates are unstable and have high standard errors. If multicollinearity among the predictors is strong, the partial regression estimates may have the wrong sign or size and are unstable.

If a predictor involved in a collinear relationship is removed from the model, the sign and size of the remaining predictor can change dramat-ically. The fitting of higher order polynomials of a predictor variable with a mean not equal to zero can create difficult multicollinearity problems.

The PROC REG provides VIF and COLLINOINT options for detecting multicollinearity. The condition indices >30 indicate the pr esence of severe multicollinearity. The VIF option provides the variance inflation factors, which measure the inflation in the variances of the parameter

3456_Book.book Page 161 Wednesday, November 20, 2002 11:34 AM

estimates due to multicollinearity that exists among the predictor vari-ables. A VIF value greater than 10 is usually considered significant. The presence of severe multicollinearity could be detected graphically in the VIF plot when the partial leverage points shrink and form a cluster near the mean of the predictor variable relative to the partial residual. One of the remedial measures for multicollinearity is to redefine highly correlated variables. For example, if X and Y are highly correlated, they could be replaced in a linear regression by X + Y and X – Y without changing the fit of the model or statistics for other pr edictors. User-friendly SAS macros available in the author’s SAS macro page13 adjust for multicollinearity based on ridge regression15 and incomplete principal component regression15 methods.

5.3.4.5 Heteroscedasticity in Residual Variance

Nonconstancy of error variance occurs in MLR if the residual variance is not constant and shows a trend with the change in the predicted value. The standard error of the parameters becomes incorrect, resulting in incorrect significance tests and confidence interval estimates. A fan pattern, like the profile of a megaphone, with a noticeable flare either to the right or to the left in the residual plot against predicted value is the indication of significant heteroscedasticity. The Breusch–Pagan test,11 based on the significance of linear model using the squared absolute residual as the response and all combination of variables as predictors, is recommended for detecting heteroscedasticity. However, the presence of significant outliers and non-normality may confound with heterosce-dasticity and may interfere with the detection. If both nonlinearity and unequal variances are present, employing a transformation on response may have the effect of simultaneously improving the linearity and promoting equality of the variances. User-friendly SAS macros available in the author’s SAS macro page13 adjust for heteroscedasticity based on Box–Cox11 regression and heterogeneity regression models using the MIXED model approach.16

5.3.4.6 Non-Normality of Residuals

Multiple linear regression models are fairly robust against violation of non-normality, especially in large samples. Signs of non-normality are signifi-cant skewness (lack of symmetry) and/or kurtosis light-tailedness or heavy-tailedness. The normal probability plot (normal quantile–quantile [Q–Q]

plot), along with the normality test,17 can provide information on the normality of the residual distribution. In the case of non-normality, fitting

3456_Book.book Page 162 Wednesday, November 20, 2002 11:34 AM

generalized linear models based on the SAS GENMOD18 procedure or employing a transformation on response or one or more predictor variables may result in a more powerful test. However, if only a small number of data points (<32) is available, non-normality can be difficult to detect. If the sample size is large (>300), the normality test may detect statistically significant but trivial departures from normality that will have no real effect on the multiple linear regression tests (see Figures 5.14 to 5.15 for examples of model violation detection plots).

Dans le document Data Mining Using (Page 169-172)