• Aucun résultat trouvé

VERIFYING MODEL ASSUMPTIONS

Dans le document An Introduction to Data Mining (Page 104-109)

VERIFYING MODEL ASSUMPTIONS

Before a model can be implemented, the requisite model assumptions must be ver-ified. Using a model whose assumptions are not verified is like building a house whose foundation may be cracked. Making predictions using a model where the as-sumptions are violated may lead to erroneous and overoptimistic results, with costly consequences when deployed.

These assumptions—linearity, independence, normality, and constant variance

—may be checked using a normality plot of the residuals (Figure 4.8), and a plot of the standardized residuals against the fitted (predicted) values (Figure 4.9). One evaluates a normality plot by judging whether systematic deviations from linearity exist in the plot, in which case one concludes that the data values plotted (the residuals in this case) are not drawn from the particular distribution (the normal distribution in this case). We do not detect systematic deviations from linearity in the normal plot of the standardized residuals, and thereby conclude that our normality assumption is intact.

The plot of the residuals versus the fits (Figure 4.9) is examined for discernible patterns. If obvious curvature exists in the scatter plot, the linearity assumption is violated. If the vertical spread of the points in the plot is systematically nonuniform, the constant variance assumption is violated. We detect no such patterns in Figure 4.9 and therefore conclude that the linearity and constant variance assumptions are intact for this example.

The independence assumption makes sense for this data set, since we would not expect that the rating for one particular cereal would depend on the rating for another cereal. Time-dependent data can be examined for order independence using a runs test or a plot of the residuals versus ordering.

After thus checking that the assumptions are not violated, we may therefore proceed with the multiple regression analysis. Minitab provides us with the multiple regression output shown in Figure 4.10.

2.5

Figure 4.8 Normal plot of the residuals.

3

Figure 4.9 Plot of standardized residuals versus fitted (predicted values).

Let us examine these very interesting results carefully. The estimated regression equation is as follows:

The estimated nutritional rating equals 55.9 minus 0.225 times the number of calories plus 2.88 times the grams of protein minus 2.00 times the grams of fat

minus 0.0546 times the milligrams of sodium plus 2.57 times the grams of fiber

Rating = 55.9 0.225 Calories + 2.88 Protein 2.00 Fat 0.0546 Sodium

Figure 4.10 Minitab multiple regression output.

VERIFYING MODEL ASSUMPTIONS 87 plus 1.08 times the grams of carbohydrates

minus 0.823 times the grams of sugar

minus 0.0514 times the percent RDA of vitamins

This is the equation we may use to perform point estimation and prediction for the nutritional rating of new cereals. For example, suppose that there is a new cereal with 80 calories, 2 grams of protein, no fat, no sodium, 3 grams of fiber, 16 grams of carbohydrates, no sugars, and 0% RDA of vitamins (similar to Shredded Wheat). Then the predicted nutritional rating is 55.9−0.225(80)+2.88(2)−2.00(0)

−0.0546(0)+2.57(3)+1.08(16)−0.823(0)−0.0514(0)=68.62 using the un-rounded coefficients provided byminitab. This prediction is remarkably close to the actual nutritional rating for Shredded Wheat of 68.2359, so that the prediction error isyyˆ =68.2359−68.62= −0.3841.

Of course, point estimates have drawbacks, so analogous to the simple linear regression case, we can find confidence intervals and prediction intervals in multiple regression as well. We can find a 95% confidence interval for the mean nutritional rating of all such cereals (with characteristics similar to those of Shredded Wheat:

80 calories, 2 grams of protein, etc.), to be (67.914, 69.326). Also, a 95% prediction interval for the nutritional rating of a randomly chosen cereal with characteristics similar to those of Shredded Wheat is (66.475, 70.764). As before, the prediction interval is wider than the confidence interval.

Here follow further comments on the multiple regression results given in Figure 4.10. TheR2value of 99.5% is extremely high, nearly equal to the maximum possible R2 of 100%. This indicates that our multiple regression models accounts for nearly all of the variability in the nutritional ratings. The standard error of the estimate,s, has a value of about 1, meaning that our typical prediction error is about one point on the nutrition rating scale, and that about 95% (based on the normal distribution of the errors) of our predictions will be within two points of the actual value. Compare this with ansvalue of about 9 for the simple linear regression model in Figure 4.5. Using more data in our regression model has allowed us to reduce our prediction error by a factor of 9.

Note also that thep-values (underP) for all the predictor variables equal zero (actually, they are rounded to zero), indicating that each of the variables, including carbohydrates, belongs in the model. Recall that earlier it appeared thatcarbohydrates did not have a very high correlation with rating, so some modelers may have been tempted to eliminatecarbohydratesfrom the model based on this exploratory finding.

However, as we mentioned in Chapter 3, it is often best to allow variables to remain in the model even if the EDA does not show obvious association with the target. Here, carbohydrateswas found to be a significant predictor ofrating,in the presence of the other predictors. Eliminatingcarbohydratesas a predictor in the regression resulted in a point estimate for a Shredded Wheat–like cereal to have a nutritional rating of 68.805, more distant from the actual rating of 68.2359 than was the prediction that includedcarbohydratesin the model. Further, the model withoutcarbohydrateshad a decreased R2 value and an svalue that more than doubled, to 2.39 (not shown).

Eliminating this variable due to seeming lack of association in the EDA phase would

have been a mistake, reducing the functionality of model and impairing its estimation and prediction precision.

REFERENCES

1. Robert Johnson and Patricia Kuby,Elementary Statistics, Brooks-Cole, Toronto, Ontario, Canada, 2004.

2. Data and Story Library,www.lib.stat.cmu.edu/DASL, Carnegie Mellon University, Pittsburgh, PA.

EXERCISES

1. Explain why measures of spread are required when summarizing a data set.

2. Explain the meaning of the termstandard deviationto a layman who has never read a statistics or data mining book.

3. Give an example from your own experience, or from newspapers, of the use of statistical inference.

4. Give an example from your own experience, or from newspapers, of the idea of sampling error.

5. What is the meaning of the termmargin of error?

6. Discuss the relationship between the width of a confidence interval and the confidence level associated with it.

7. Discuss the relationship between the sample size and the width of a confidence interval.

Which is better, a wide interval or a tight interval? Why?

8. Explain clearly why we use regression analysis and for which type of variables it is appropriate.

9. Suppose that we are interested in predicting weight of students based on height. We have run a regression analysis with the resulting estimated regression equation as follows: “The estimated weight equals (−180 pounds) plus (5 pounds times the height in inches).”

a. Suppose that one student is 3 inches taller than another student. What is the estimated difference in weight?

b. Suppose that a given student is 65 inches tall. What is the estimated weight?

c. Suppose that the regression equation above was based on a sample of students ranging in height from 60 to 75 inches. Now estimate the height of a 48-inch-tall student.

Comment.

d. Explain clearly the meaning of the 5 in the equation above.

e. Explain clearly the meaning of the−180 in the equation above.

Hands-On Analysis

Use the cerealsdata set included, at the book series Web site, for the following exercises. Use regression to estimateratingbased onfiberalone.

EXERCISES 89

10. What is the estimated regression equation?

11. Explain clearly the value of the slope coefficient you obtained in the regression.

12. What does the value of they-intercept mean for the regression equation you obtained?

Does it make sense in this example?

13. What would be a typical prediction error obtained from using this model to predictrating?

Which statistic are you using to measure this? What could we do to lower this estimated prediction error?

14. How closely does our model fit the data? Which statistic are you using to measure this?

15. Find a point estimate for the rating for a cereal with a fiber content of 3 grams.

16. Find a 95% confidence interval for the true mean rating for all cereals with a fiber content of 3 grams.

17. Find a 95% prediction interval for a randomly chosen cereal with a fiber content of 3 grams.

18. Based on the regression results, what would we expect a scatter plot ofratingversusfiber to look like? Why?

For the following exercises, use multiple regression to estimateratingbased onfiberandsugars.

19. What is the estimated regression equation?

20. Explain clearly and completely the value of the coefficient for fiber you obtained in the regression.

21. Compare theR2values from the multiple regression and the regression done earlier in the exercises. What is going on? Will this always happen?

22. Compare thesvalues from the multiple regression and the regression done earlier in the exercises. Which value is preferable, and why?

C H A P T E R

5

k-NEAREST NEIGHBOR

Dans le document An Introduction to Data Mining (Page 104-109)