Figure 4.2 Summary statistics of customers with both the International Plan and VoiceMail Plan and with more than 200 day minutes.
We are 95% confident that the population mean number of customer service calls for all customers who have both plans and who have more than 220 minutes of day use falls between 0.875 and 2.339 calls. The margin of error for this specific subset of customers is 0.732, which indicates that our estimate of the mean number of customer service calls for this subset of customers is much less precise than for the customer base as a whole.
Confidence interval estimation can be applied to any desired target parameter.
The most widespread interval estimates are for the population mean, the population standard deviation, and the population proportion of successes.
BIVARIATE METHODS: SIMPLE LINEAR REGRESSION
So far we have discussed estimation measures for one variable at a time. Analysts, however, are often interested in bivariate methods of estimation, for example, using the value of one variable to estimate the value of a different variable.
To help us learn about regression methods for estimation and prediction, let us get acquainted with a new data set,cereals. Thecerealsdata set, included at the book series Web site courtesy of theData and Story Library[2], contains nutrition information for 77 breakfast cereals and includes the following variables:
r Cereal name r Cereal manufacturer r Type (hot or cold) r Calories per serving r Grams of protein
rGrams of fat
rPercentage of recommended daily allowance of vitamins (0% 25%, or 100%) rWeight of one serving
rNumber of cups per serving
rShelf location (1=bottom, 2=middle, 3=top) rNutritional rating, calculated byConsumer Reports
Table 4.4 provides a peek at the eight of these fields for the first 16 cereals. We are interested in estimating the nutritionalratingof a cereal given itssugarcontent.
Figure 4.3 shows a scatter plot of the nutritional rating versus the sugar content for the 77 cereals, along with the least-squares regression line.
The regression line is written in the form ˆy=b0+b1x, called theregression equationor theestimated regression equation(ERE), where:
r yˆis the estimated value of the response variable rb0is they-interceptof the regression line rb1is theslopeof the regression line
rb0andb1, together, are called theregression coefficients
TABLE 4.4 Excerpt fromCerealsData Set: Eight Fields, First 16 Cereals
Cereal Name Manuf. Sugars Calories Protein Fat Sodium Rating
100% Bran N 6 70 4 1 130 68.4030
100% Natural Bran Q 8 120 3 5 15 33.9837
All-Bran K 5 70 4 1 260 59.4255
All-Bran Extra Fiber K 0 50 4 0 140 93.7049
Almond Delight R 8 110 2 2 200 34.3848
Apple Cinnamon Cheerios G 10 110 2 2 180 29.5095
Apple Jacks K 14 110 2 0 125 33.1741
Basic 4 G 8 130 3 2 210 37.0386
Bran Chex R 6 90 2 1 200 49.1203
Bran Flakes P 5 90 3 0 210 53.3138
Cap’n’Crunch Q 12 120 1 2 220 18.0429
Cheerios G 1 110 6 2 290 50.7650
Cinnamon Toast Crunch G 9 120 1 3 210 19.8236
Clusters G 7 110 3 2 140 40.4002
Cocoa Puffs G 13 110 1 1 180 22.7364
BIVARIATE METHODS: SIMPLE LINEAR REGRESSION 77
Figure 4.3 Scatter plot of nutritional rating versus sugar content for 77 cereals.
In this case the ERE is given as ˆy=59.4−2.42(sugars), so thatb0=59.4 andb1 = −2.42. This estimated regression equation can then be interpreted as: “The estimated cereal rating equals 59.4 minus 2.42 times the sugar content in grams.” The regression line and the ERE are used as alinear approximationof the relationship between thex(predictor) andy(response) variables, that is, between sugar content and nutritional rating. We can use the regression line or the ERE to make estimates or predictions.
For example, suppose that we are interested in estimating the nutritional rating for a new cereal (not in the original data) that containsx=1 gram of sugar. Using the ERE, we find the estimated nutritional rating for a cereal with 1 gram of sugar to be
ˆ
y=59.4−2.42(1)=56.98.Note that this estimated value for the nutritional rating lies directly on the regression line, at the location (x=1,yˆ =56.98), as shown in Figure 4.3. In fact, for any given value ofx(sugar content), the estimated value fory (nutritional rating) lies precisely on the regression line.
Now, there is one cereal in our data set that does have a sugar content of 1 gram, Cheerios. Its nutrition rating, however, is 50.765, not 56.98 as we estimated above for the new cereal with 1 gram of sugar. Cheerios’ point in the scatter plot is located at (x =1,y=50.765), within the oval in Figure 4.3. Now, the upper arrow in Figure 4.3 is pointing to a location on the regression line directly above the Cheerios point. This is where the regression equation predicted the nutrition rating to be for a cereal with a sugar content of 1 gram. The prediction was too high by 56.98−50.765=6.215 rating points, which represents the vertical distance from the Cheerios data point to the regression line. This vertical distance of 6.215 rating points, in general (y−y),ˆ is known variously as theprediction error,estimation error, orresidual.
We of course seek to minimize the overall size of our prediction errors. Least-squares regressionworks by choosing the unique regression line that minimizes the sum of squared residuals over all the data points. There are alternative methods of choosing the line that best approximates the linear relationship between the vari-ables, such as median regression, although least squares remains the most common method.
The y-interceptb0 is the location on they-axis where the regression line in-tercepts the y-axis, that is, the estimated value for the response variable when the predictor variable equals zero. Now, in many regression situations, a value of zero for the predictor variable would not make sense. For example, suppose that we were trying to predict elementary school student weight (y) based on student height (x).
The meaning ofheight=0 is unclear, so that the denotative meaning of they-intercept would not make interpretive sense in this case.
However, for our data set, a value of zero for the sugar content does make sense, as several cereals contain zero grams of sugar. Therefore, for our data set, the y-interceptb0 =59.4 simply represents the estimated nutritional rating for cereals with zero sugar content. Note that none of the cereals containing zero grams of sugar have this estimated nutritional rating of exactly 59.4. The actual ratings, along with the prediction errors, are shown in Table 4.5. Note that all the predicted ratings are the same, since all these cereals had identical values for the predictor variable (x=0).
The slope of the regression line indicates the estimated change in yper unit increase inx. We interpretb1= −2.42 to mean the following: “For each increase of 1 gram in sugar content, the estimated nutritional rating decreases by 2.42 rat-ing points.” For example, cereal A with 5 more grams of sugar than cereal B would have an estimated nutritional rating 5(2.42)=12.1 ratings points lower than cereal B.
The correlation coefficientrforratingandsugarsis−0.76, indicating that the nutritional rating and the sugar content are negatively correlated. It is not a coincidence that both r andb1 are both negative. In fact, the correlation coefficientr and the regression slopeb1always have the same sign.
TABLE 4.5 Actual Ratings, Predicted Ratings, and Prediction Errors for Cereals with Zero Grams of Sugar
Cereal Actual Rating Predicted Rating Prediction Error
Quaker Oatmeal 50.8284 59.4 −8.5716
All-Bran with Extra Fiber 93.7049 59.4 34.3049
Cream of Wheat (Quick) 64.5338 59.4 5.1338
Puffed Rice 60.7561 59.4 1.3561
Puffed Wheat 63.0056 59.4 3.6056
Shredded Wheat 68.2359 59.4 8.8359
Shredded Wheat ’n’Bran 74.4729 59.4 15.0729
Shredded Wheat Spoon Size 72.8018 59.4 13.4018