• Aucun résultat trouvé

BIVARIATE METHODS: SIMPLE LINEAR REGRESSION

Dans le document An Introduction to Data Mining (Page 94-98)

Figure 4.2 Summary statistics of customers with both the International Plan and VoiceMail Plan and with more than 200 day minutes.

We are 95% confident that the population mean number of customer service calls for all customers who have both plans and who have more than 220 minutes of day use falls between 0.875 and 2.339 calls. The margin of error for this specific subset of customers is 0.732, which indicates that our estimate of the mean number of customer service calls for this subset of customers is much less precise than for the customer base as a whole.

Confidence interval estimation can be applied to any desired target parameter.

The most widespread interval estimates are for the population mean, the population standard deviation, and the population proportion of successes.

BIVARIATE METHODS: SIMPLE LINEAR REGRESSION

So far we have discussed estimation measures for one variable at a time. Analysts, however, are often interested in bivariate methods of estimation, for example, using the value of one variable to estimate the value of a different variable.

To help us learn about regression methods for estimation and prediction, let us get acquainted with a new data set,cereals. Thecerealsdata set, included at the book series Web site courtesy of theData and Story Library[2], contains nutrition information for 77 breakfast cereals and includes the following variables:

r Cereal name r Cereal manufacturer r Type (hot or cold) r Calories per serving r Grams of protein

rGrams of fat

rPercentage of recommended daily allowance of vitamins (0% 25%, or 100%) rWeight of one serving

rNumber of cups per serving

rShelf location (1=bottom, 2=middle, 3=top) rNutritional rating, calculated byConsumer Reports

Table 4.4 provides a peek at the eight of these fields for the first 16 cereals. We are interested in estimating the nutritionalratingof a cereal given itssugarcontent.

Figure 4.3 shows a scatter plot of the nutritional rating versus the sugar content for the 77 cereals, along with the least-squares regression line.

The regression line is written in the form ˆy=b0+b1x, called theregression equationor theestimated regression equation(ERE), where:

r yˆis the estimated value of the response variable rb0is they-interceptof the regression line rb1is theslopeof the regression line

rb0andb1, together, are called theregression coefficients

TABLE 4.4 Excerpt fromCerealsData Set: Eight Fields, First 16 Cereals

Cereal Name Manuf. Sugars Calories Protein Fat Sodium Rating

100% Bran N 6 70 4 1 130 68.4030

100% Natural Bran Q 8 120 3 5 15 33.9837

All-Bran K 5 70 4 1 260 59.4255

All-Bran Extra Fiber K 0 50 4 0 140 93.7049

Almond Delight R 8 110 2 2 200 34.3848

Apple Cinnamon Cheerios G 10 110 2 2 180 29.5095

Apple Jacks K 14 110 2 0 125 33.1741

Basic 4 G 8 130 3 2 210 37.0386

Bran Chex R 6 90 2 1 200 49.1203

Bran Flakes P 5 90 3 0 210 53.3138

Cap’n’Crunch Q 12 120 1 2 220 18.0429

Cheerios G 1 110 6 2 290 50.7650

Cinnamon Toast Crunch G 9 120 1 3 210 19.8236

Clusters G 7 110 3 2 140 40.4002

Cocoa Puffs G 13 110 1 1 180 22.7364

BIVARIATE METHODS: SIMPLE LINEAR REGRESSION 77

Figure 4.3 Scatter plot of nutritional rating versus sugar content for 77 cereals.

In this case the ERE is given as ˆy=59.4−2.42(sugars), so thatb0=59.4 andb1 = −2.42. This estimated regression equation can then be interpreted as: “The estimated cereal rating equals 59.4 minus 2.42 times the sugar content in grams.” The regression line and the ERE are used as alinear approximationof the relationship between thex(predictor) andy(response) variables, that is, between sugar content and nutritional rating. We can use the regression line or the ERE to make estimates or predictions.

For example, suppose that we are interested in estimating the nutritional rating for a new cereal (not in the original data) that containsx=1 gram of sugar. Using the ERE, we find the estimated nutritional rating for a cereal with 1 gram of sugar to be

ˆ

y=59.4−2.42(1)=56.98.Note that this estimated value for the nutritional rating lies directly on the regression line, at the location (x=1,yˆ =56.98), as shown in Figure 4.3. In fact, for any given value ofx(sugar content), the estimated value fory (nutritional rating) lies precisely on the regression line.

Now, there is one cereal in our data set that does have a sugar content of 1 gram, Cheerios. Its nutrition rating, however, is 50.765, not 56.98 as we estimated above for the new cereal with 1 gram of sugar. Cheerios’ point in the scatter plot is located at (x =1,y=50.765), within the oval in Figure 4.3. Now, the upper arrow in Figure 4.3 is pointing to a location on the regression line directly above the Cheerios point. This is where the regression equation predicted the nutrition rating to be for a cereal with a sugar content of 1 gram. The prediction was too high by 56.98−50.765=6.215 rating points, which represents the vertical distance from the Cheerios data point to the regression line. This vertical distance of 6.215 rating points, in general (y−y),ˆ is known variously as theprediction error,estimation error, orresidual.

We of course seek to minimize the overall size of our prediction errors. Least-squares regressionworks by choosing the unique regression line that minimizes the sum of squared residuals over all the data points. There are alternative methods of choosing the line that best approximates the linear relationship between the vari-ables, such as median regression, although least squares remains the most common method.

The y-interceptb0 is the location on they-axis where the regression line in-tercepts the y-axis, that is, the estimated value for the response variable when the predictor variable equals zero. Now, in many regression situations, a value of zero for the predictor variable would not make sense. For example, suppose that we were trying to predict elementary school student weight (y) based on student height (x).

The meaning ofheight=0 is unclear, so that the denotative meaning of they-intercept would not make interpretive sense in this case.

However, for our data set, a value of zero for the sugar content does make sense, as several cereals contain zero grams of sugar. Therefore, for our data set, the y-interceptb0 =59.4 simply represents the estimated nutritional rating for cereals with zero sugar content. Note that none of the cereals containing zero grams of sugar have this estimated nutritional rating of exactly 59.4. The actual ratings, along with the prediction errors, are shown in Table 4.5. Note that all the predicted ratings are the same, since all these cereals had identical values for the predictor variable (x=0).

The slope of the regression line indicates the estimated change in yper unit increase inx. We interpretb1= −2.42 to mean the following: “For each increase of 1 gram in sugar content, the estimated nutritional rating decreases by 2.42 rat-ing points.” For example, cereal A with 5 more grams of sugar than cereal B would have an estimated nutritional rating 5(2.42)=12.1 ratings points lower than cereal B.

The correlation coefficientrforratingandsugarsis−0.76, indicating that the nutritional rating and the sugar content are negatively correlated. It is not a coincidence that both r andb1 are both negative. In fact, the correlation coefficientr and the regression slopeb1always have the same sign.

TABLE 4.5 Actual Ratings, Predicted Ratings, and Prediction Errors for Cereals with Zero Grams of Sugar

Cereal Actual Rating Predicted Rating Prediction Error

Quaker Oatmeal 50.8284 59.4 8.5716

All-Bran with Extra Fiber 93.7049 59.4 34.3049

Cream of Wheat (Quick) 64.5338 59.4 5.1338

Puffed Rice 60.7561 59.4 1.3561

Puffed Wheat 63.0056 59.4 3.6056

Shredded Wheat 68.2359 59.4 8.8359

Shredded Wheat ’n’Bran 74.4729 59.4 15.0729

Shredded Wheat Spoon Size 72.8018 59.4 13.4018

Dans le document An Introduction to Data Mining (Page 94-98)