PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS
Third Edition
tion of the accompanying code (“the product”) cannot and do not warrant the performance or results that may be obtained by using the product. The product is sold “as is” without warranty of merchantability or fitness for any particular purpose. AP warrants only that the magnetic diskette(s) on which the code is recorded is free from defects in material and faulty workmanship under the normal use and service for a period of ninety (90) days from the date the product is delivered. The purchaser’s sole and exclusive remedy in the event of a defect is expressly limited to either replacement of the diskette(s) or refund of the purchase price, at AP’s sole discretion.
In no event, whether as a result of breach of contract, warranty, or tort (including negligence), will AP or anyone who has been involved in the creation or production of the product be liable to purchaser for any damages, including any lost profits, lost savings, or other incidental or consequential damages arising out of the use or inability to use the product or any modifications thereof, or due to the contents of the code, even if AP has been advised on the possibility of such damages, or for any claim by any other party.
Any request for replacement of a defective diskette must be postage prepaid and must be accompanied by the original defective diskette, your mailing address and telephone number, and proof of date of purchase and purchase price. Send such requests, stating the nature of the problem, to Academic Press Customer Service, 6277 Sea Harbor Drive, Orlando, FL 32887, 1-800-321-5068. AP shall have no obligation to refund the purchase price or to replace a diskette based on claims of defects in the nature or operation of the product.
Some states do not allow limitation on how long an implied warranty lasts, nor exclu- sions or limitations of incidental or consequential damage, so the above limitations and exclusions may not apply to you. This warranty gives you specific legal rights, and you may also have other rights, which vary from jurisdiction to jurisdiction.
The re-export of United States original software is subject to the United States laws under the Export Administration Act of 1969 as amended. Any further sale of the product shall be in compliance with the United States Department of Commerce Administra- tion regulations. Compliance with such regulations is your responsibility and not the responsibility of AP.
PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS
■
Third Edition
■Sheldon M. Ross
Department of Industrial Engineering and Operations Research University of California, Berkeley
Amsterdam Boston Heidelberg London New York Oxford Paris San Diego San Francisco Singapore Sydney Tokyo
84 Theobald’s Road, London WC1X 8RR, UK This book is printed on acid-free paper.
Copyright © 2004, Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: [email protected].
You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting
“Customer Support” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data Application submitted
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library ISBN: 0-12-598057-4 (Text)
ISBN: 0-12-598059-0 (CD-ROM)
For all information on all Academic Press publications visit our Web site at www.academicpress.com Printed in the United States of America
04 05 06 07 08 09 9 8 7 6 5 4 3 2 1
Preface . . . xiii
Chapter 1 Introduction to Statistics . . . 1
1.1 Introduction . . . 1
1.2 Data Collection and Descriptive Statistics . . . 1
1.3 Inferential Statistics and Probability Models . . . 2
1.4 Populations and Samples . . . 3
1.5 A Brief History of Statistics . . . 3
Problems . . . 7
Chapter 2 Descriptive Statistics . . . 9
2.1 Introduction . . . 9
2.2 Describing Data Sets . . . 9
2.2.1 Frequency Tables and Graphs . . . 10
2.2.2 Relative Frequency Tables and Graphs . . . 10
2.2.3 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots . . . 14
2.3 Summarizing Data Sets . . . 17
2.3.1 Sample Mean, Sample Median, and Sample Mode . . . 17
2.3.2 Sample Variance and Sample Standard Deviation . . . 22
2.3.3 Sample Percentiles and Box Plots . . . 24
2.4 Chebyshev’s Inequality . . . 27
2.5 Normal Data Sets . . . 31
2.6 Paired Data Sets and the Sample Correlation Coefficient . . . 33
Problems . . . 41
Chapter 3 Elements of Probability . . . 55
3.1 Introduction . . . 55
3.2 Sample Space and Events . . . 56
3.3 Venn Diagrams and the Algebra of Events . . . 58
3.4 Axioms of Probability . . . 59
3.5 Sample Spaces Having Equally Likely Outcomes . . . 61
3.6 Conditional Probability . . . 67
3.7 Bayes’ Formula . . . 70 vii
3.8 Independent Events . . . 76
Problems . . . 80
Chapter 4 Random Variables and Expectation . . . 89
4.1 Random Variables . . . 89
4.2 Types of Random Variables . . . 92
4.3 Jointly Distributed Random Variables . . . 95
4.3.1 Independent Random Variables . . . 101
*4.3.2 Conditional Distributions . . . 105
4.4 Expectation . . . 107
4.5 Properties of the Expected Value . . . 111
4.5.1 Expected Value of Sums of Random Variables . . . 115
4.6 Variance . . . 118
4.7 Covariance and Variance of Sums of Random Variables . . . 121
4.8 Moment Generating Functions . . . 126
4.9 Chebyshev’s Inequality and the Weak Law of Large Numbers . . . 127
Problems . . . 130
Chapter 5 Special Random Variables . . . 141
5.1 The Bernoulli and Binomial Random Variables . . . 141
5.1.1 Computing the Binomial Distribution Function . . . 147
5.2 The Poisson Random Variable . . . 148
5.2.1 Computing the Poisson Distribution Function . . . 155
5.3 The Hypergeometric Random Variable . . . 156
5.4 The Uniform Random Variable . . . 160
5.5 Normal Random Variables . . . 168
5.6 Exponential Random Variables . . . 175
*5.6.1 The Poisson Process . . . 179
*5.7 The Gamma Distribution . . . 182
5.8 Distributions Arising from the Normal . . . 185
5.8.1 The Chi-Square Distribution . . . 185
*5.8.1.1 The Relation Between Chi-Square and Gamma Random Variables . . . 187
5.8.2 Thet-Distribution . . . 189
5.8.3 TheF-Distribution . . . 191
*5.9 The Logistics Distribution . . . 192
Problems . . . 194
Chapter 6 Distributions of Sampling Statistics . . . 201
6.1 Introduction . . . 201
6.2 The Sample Mean . . . 202
6.3 The Central Limit Theorem . . . 204
6.3.1 Approximate Distribution of the Sample Mean . . . 210
6.3.2 How Large a Sample is Needed? . . . 212
6.4 The Sample Variance . . . 213
6.5 Sampling Distributions from a Normal Population . . . 214
6.5.1 Distribution of the Sample Mean . . . 215
6.5.2 Joint Distribution ofXandS2 . . . 215
6.6 Sampling from a Finite Population . . . 217
Problems . . . 221
Chapter 7 Parameter Estimation . . . 229
7.1 Introduction . . . 229
7.2 Maximum Likelihood Estimators . . . 230
*7.2.1 Estimating Life Distributions . . . 238
7.3 Interval Estimates . . . 240
7.3.1 Confidence Interval for a Normal Mean When the Variance is Unknown . . . 246
7.3.2 Confidence Intervals for the Variances of a Normal Distribution . . . 251
7.4 Estimating the Difference in Means of Two Normal Populations . . . 253
7.5 Approximate Confidence Interval for the Mean of a Bernoulli Random Variable . . . 260
*7.6 Confidence Interval of the Mean of the Exponential Distribution . . . 265
*7.7 Evaluating a Point Estimator . . . 266
*7.8 The Bayes Estimator . . . 272
Problems . . . 277
Chapter 8 Hypothesis Testing . . . 291
8.1 Introduction . . . 291
8.2 Significance Levels . . . 292
8.3 Tests Concerning the Mean of a Normal Population . . . 293
8.3.1 Case of Known Variance . . . 293
8.3.2 Case of Unknown Variance: Thet-Test . . . 305
8.4 Testing the Equality of Means of Two Normal Populations . . . 312
8.4.1 Case of Known Variances . . . 312
8.4.2 Case of Unknown Variances . . . 314
8.4.3 Case of Unknown and Unequal Variances . . . 318
8.4.4 The Pairedt-Test . . . 319
8.5 Hypothesis Tests Concerning the Variance of a Normal Population . . . . 321
8.5.1 Testing for the Equality of Variances of Two Normal Populations . . . 322
8.6 Hypothesis Tests in Bernoulli Populations . . . 323
8.6.1 Testing the Equality of Parameters in Two Bernoulli Populations . . . 327
8.7 Tests Concerning the Mean of a Poisson Distribution . . . 330
8.7.1 Testing the Relationship Between Two Poisson Parameters . . . 331
Problems . . . 334
Chapter 9 Regression . . . 351
9.1 Introduction . . . 351
9.2 Least Squares Estimators of the Regression Parameters . . . 353
9.3 Distribution of the Estimators . . . 355
9.4 Statistical Inferences about the Regression Parameters . . . 361
9.4.1 Inferences Concerningβ . . . 362
9.4.1.1 Regression to the Mean . . . 366
9.4.2 Inferences Concerningα . . . 370
9.4.3 Inferences Concerning the Mean Responseα+βx0 . . . 371
9.4.4 Prediction Interval of a Future Response . . . 373
9.4.5 Summary of Distributional Results . . . 375
9.5 The Coefficient of Determination and the Sample Correlation Coefficient . . . 376
9.6 Analysis of Residuals: Assessing the Model . . . 378
9.7 Transforming to Linearity . . . 381
9.8 Weighted Least Squares . . . 384
9.9 Polynomial Regression . . . 391
*9.10 Multiple Linear Regression . . . 394
9.10.1 Predicting Future Responses . . . 405
9.11 Logistic Regression Models for Binary Output Data . . . 410
Problems . . . 413
Chapter 10 Analysis of Variance . . . 439
10.1 Introduction . . . 439
10.2 An Overview . . . 440
10.3 One-Way Analysis of Variance . . . 442
10.3.1 Multiple Comparisons of Sample Means . . . 450
10.3.2 One-Way Analysis of Variance with Unequal Sample Sizes . . . 452
10.4 Two-Factor Analysis of Variance: Introduction and Parameter Estimation . . . 454
10.5 Two-Factor Analysis of Variance: Testing Hypotheses . . . 458
10.6 Two-Way Analysis of Variance with Interaction . . . 463
Problems . . . 471
Chapter 11 Goodness of Fit Tests and Categorical Data Analysis . . . 483
11.1 Introduction . . . 483
11.2 Goodness of Fit Tests When all Parameters are Specified . . . 484
11.2.1 Determining the Critical Region by Simulation . . . 490
11.3 Goodness of Fit Tests When Some Parameters are Unspecified . . . 493
11.4 Tests of Independence in Contingency Tables . . . 495
11.5 Tests of Independence in Contingency Tables Having Fixed
Marginal Totals . . . 499
*11.6 The Kolmogorov–Smirnov Goodness of Fit Test for Continuous Data . . . 504
Problems . . . 508
Chapter 12 Nonparametric Hypothesis Tests . . . 515
12.1 Introduction . . . 515
12.2 The Sign Test . . . 515
12.3 The Signed Rank Test . . . 519
12.4 The Two-Sample Problem . . . 525
12.4.1 The Classical Approximation and Simulation . . . 529
12.5 The Runs Test for Randomness . . . 533
Problems . . . 537
Chapter 13 Quality Control . . . 545
13.1 Introduction . . . 545
13.2 Control Charts for Average Values: TheX-Control Chart . . . 546
13.2.1 Case of Unknownµandσ . . . 549
13.3 S-Control Charts . . . 554
13.4 Control Charts for the Fraction Defective . . . 557
13.5 Control Charts for Number of Defects . . . 559
13.6 Other Control Charts for Detecting Changes in the Population Mean . . . 563
13.6.1 Moving-Average Control Charts . . . 563
13.6.2 Exponentially Weighted Moving-Average Control Charts . . . 565
13.6.3 Cumulative Sum Control Charts . . . 571
Problems . . . 573
Chapter 14* Life Testing . . . 581
14.1 Introduction . . . 581
14.2 Hazard Rate Functions . . . 581
14.3 The Exponential Distribution in Life Testing . . . 584
14.3.1 Simultaneous Testing — Stopping at therth Failure . . . 584
14.3.2 Sequential Testing . . . 590
14.3.3 Simultaneous Testing — Stopping by a Fixed Time . . . 594
14.3.4 The Bayesian Approach . . . 596
14.4 A Two-Sample Problem . . . 598
14.5 The Weibull Distribution in Life Testing . . . 600
14.5.1 Parameter Estimation by Least Squares . . . 602
Problems . . . 604
Appendix of Tables . . . 611
Index . . . 617
* Denotes optional material.
The third edition of this book continues to demonstrate how to apply probability theory to gain insight into real, everyday statistical problems and situations. As in the previous editions, carefully developed coverage of probability motivates probabilistic models of real phenomena and the statistical procedures that follow. This approach ultimately results in an intuitive understanding of statistical procedures and strategies most often used by practicing engineers and scientists.
This book has been written for an introductory course in statistics, or in probability and statistics, for students in engineering, computer science, mathematics, statistics, and the natural sciences. As such it assumes knowledge of elementary calculus.
ORGANIZATION AND COVERAGE
Chapter 1presents a brief introduction to statistics, presenting its two branches of descrip- tive and inferential statistics, and a short history of the subject and some of the people whose early work provided a foundation for work done today.
The subject matter of descriptive statistics is then considered inChapter 2.Graphs and tables that describe a data set are presented in this chapter, as are quantities that are used to summarize certain of the key properties of the data set.
To be able to draw conclusions from data, it is necessary to have an understanding of the data’s origination. For instance, it is often assumed that the data constitute a
“random sample” from some population. To understand exactly what this means and what its consequences are for relating properties of the sample data to properties of the entire population, it is necessary to have some understanding of probability, and that is the subject of Chapter 3. This chapter introduces the idea of a probability experi- ment, explains the concept of the probability of an event, and presents the axioms of probability.
Our study of probability is continued inChapter 4, which deals with the important concepts of random variables and expectation, and inChapter 5,which considers some special types of random variables that often occur in applications. Such random variables as the binomial, Poisson, hypergeometric, normal, uniform, gamma, chi-square, t, and F are presented.
In Chapter 6, we study the probability distribution of such sampling statistics as the sample mean and the sample variance. We show how to use a remarkable theoretical result of probability, known as the central limit theorem, to approximate the probability distribution of the sample mean. In addition, we present the joint
xiii
probability distribution of the sample mean and the sample variance in the impor- tant special case in which the underlying data come from a normally distributed population.
Chapter 7shows how to use data to estimate parameters of interest. For instance, a scientist might be interested in determining the proportion of Midwestern lakes that are afflicted by acid rain. Two types of estimators are studied. The first of these estimates the quantity of interest with a single number (for instance, it might estimate that 47 percent of Midwestern lakes suffer from acid rain), whereas the second provides an esti- mate in the form of an interval of values (for instance, it might estimate that between 45 and 49 percent of lakes suffer from acid rain). These latter estimators also tell us the “level of confidence” we can have in their validity. Thus, for instance, whereas we can be pretty certain that the exact percentage of afflicted lakes is not 47, it might very well be that we can be, say, 95 percent confident that the actual percentage is between 45 and 49.
Chapter 8 introduces the important topic of statistical hypothesis testing, which is concerned with using data to test the plausibility of a specified hypothesis. For instance, such a test might reject the hypothesis that fewer than 44 percent of Midwestern lakes are afflicted by acid rain. The concept of the p-value, which measures the degree of plausibility of the hypothesis after the data have been observed, is introduced. A variety of hypothesis tests concerning the parameters of both one and two normal populations are considered. Hypothesis tests concerning Bernoulli and Poisson parameters are also presented.
Chapter 9 deals with the important topic of regression. Both simple linear regression — including such subtopics as regression to the mean, residual analysis, and weighted least squares — and multiple linear regression are considered.
Chapter 10introduces the analysis of variance. Both one-way and two-way (with and without the possibility of interaction) problems are considered.
Chapter 11is concerned with goodness of fit tests, which can be used to test whether a proposed model is consistent with data. In it we present the classical chi-square goodness of fit test and apply it to test for independence in contingency tables. The final section of this chapter introduces the Kolmogorov–Smirnov procedure for testing whether data come from a specified continuous probability distribution.
Chapter 12deals with nonparametric hypothesis tests, which can be used when one is unable to suppose that the underlying distribution has some specified parametric form (such as normal).
Chapter 13considers the subject matter of quality control, a key statistical technique in manufacturing and production processes. A variety of control charts, including not only the Shewhart control charts but also more sophisticated ones based on moving averages and cumulative sums, are considered.
Chapter 14deals with problems related to life testing. In this chapter, the exponential, rather than the normal, distribution, plays the key role.
NEW TO THIS EDITION
New exercises and real data examples have been added throughout, including:
• The One-sided Chebyshev Inequality for Data (Section 2.4)
• The Logistics Distribution and Logistic Regression (Sections 5.4 and 9.11)
• Estimation and Testing in proofreader problems (Examples 7.2B and 8.7g)
• Product Form Estimates of Life Distributions (Section 7.2.1)
• Observational Studies (Example 8.6e)
About the CD
Packaged along with the text is a PC disk that can be used to solve most of the statistical problems in the text. For instance, the disk computes thep-values for most of the hypothesis tests, including those related to the analysis of variance and to regression. It can also be used to obtain probabilities for most of the common distributions. (For those students without access to a personal computer, tables that can be used to solve all of the problems in the text are provided.)
One program on the disk illustrates the central limit theorem. It considers random variables that take on one of the values 0, 1, 2, 3, 4, and allows the user to enter the probabilities for these values along with an integern. The program then plots the probability mass function of the sum ofnindependent random variables having this distribution. By increasingn, one can “see” the mass function converge to the shape of a normal density function.
ACKNOWLEDGEMENTS
We thank the following people for their helpful comments on the Third Edition:
• Charles F. Dunkl, University of Virginia, Charlottesville
• Gabor Szekely, Bowling Green State University
• Krzysztof M. Ostaszewski, Illinois State University
• Micael Ratliff, Northern Arizona University
• Wei-Min Huang, Lehigh University
• Youngho Lee, Howard University
• Jacques Rioux, Drake University
• Lisa Gardner, Bradley University
• Murray Lieb, New Jersey Institute of Technology
• Philip Trotter, Cornell University
INTRODUCTION TO STATISTICS
1.1 INTRODUCTION
It has become accepted in today’s world that in order to learn about something, you must first collect data.Statisticsis the art of learning from data. It is concerned with the collection of data, its subsequent description, and its analysis, which often leads to the drawing of conclusions.
1.2 DATA COLLECTION AND DESCRIPTIVE STATISTICS
Sometimes a statistical analysis begins with a given set of data: For instance, the government regularly collects and publicizes data concerning yearly precipitation totals, earthquake occurrences, the unemployment rate, the gross domestic product, and the rate of inflation.
Statistics can be used to describe, summarize, and analyze these data.
In other situations, data are not yet available; in such cases statistical theory can be used to design an appropriate experiment to generate data. The experiment chosen should depend on the use that one wants to make of the data. For instance, suppose that an instruc- tor is interested in determining which of two different methods for teaching computer programming to beginners is most effective. To study this question, the instructor might divide the students into two groups, and use a different teaching method for each group.
At the end of the class the students can be tested and the scores of the members of the different groups compared. If the data, consisting of the test scores of members of each group, are significantly higher in one of the groups, then it might seem reasonable to suppose that the teaching method used for that group is superior.
It is important to note, however, that in order to be able to draw a valid conclusion from the data, it is essential that the students were divided into groups in such a manner that neither group was more likely to have the students with greater natural aptitude for programming. For instance, the instructor should not have let the male class members be one group and the females the other. For if so, then even if the women scored significantly higher than the men, it would not be clear whether this was due to the method used to teach them, or to the fact that women may be inherently better than men at learning
1
programming skills. The accepted way of avoiding this pitfall is to divide the class members into the two groups “at random.” This term means that the division is done in such a manner that all possible choices of the members of a group are equally likely.
At the end of the experiment, the data should be described. For instance, the scores of the two groups should be presented. In addition, summary measures such as the aver- age score of members of each of the groups should be presented. This part of statistics, concerned with the description and summarization of data, is calleddescriptive statistics.
1.3 INFERENTIAL STATISTICS AND PROBABILITY MODELS
After the preceding experiment is completed and the data are described and summarized, we hope to be able to draw a conclusion about which teaching method is superior. This part of statistics, concerned with the drawing of conclusions, is calledinferential statistics.
To be able to draw a conclusion from the data, we must take into account the possibility of chance. For instance, suppose that the average score of members of the first group is quite a bit higher than that of the second. Can we conclude that this increase is due to the teaching method used? Or is it possible that the teaching method was not responsible for the increased scores but rather that the higher scores of the first group were just a chance occurrence? For instance, the fact that a coin comes up heads 7 times in 10 flips does not necessarily mean that the coin is more likely to come up heads than tails in future flips. Indeed, it could be a perfectly ordinary coin that, by chance, just happened to land heads 7 times out of the total of 10 flips. (On the other hand, if the coin had landed heads 47 times out of 50 flips, then we would be quite certain that it was not an ordinary coin.)
To be able to draw logical conclusions from data, we usually make some assumptions about the chances (orprobabilities) of obtaining the different data values. The totality of these assumptions is referred to as aprobability modelfor the data.
Sometimes the nature of the data suggests the form of the probability model that is assumed. For instance, suppose that an engineer wants to find out what proportion of computer chips, produced by a new method, will be defective. The engineer might select a group of these chips, with the resulting data being the number of defective chips in this group. Provided that the chips selected were “randomly” chosen, it is reasonable to suppose that each one of them is defective with probabilityp, wherepis the unknown proportion of all the chips produced by the new method that will be defective. The resulting data can then be used to make inferences aboutp.
In other situations, the appropriate probability model for a given data set will not be readily apparent. However, careful description and presentation of the data sometimes enable us to infer a reasonable model, which we can then try to verify with the use of additional data.
Because the basis of statistical inference is the formulation of a probability model to describe the data, an understanding of statistical inference requires some knowledge of
the theory of probability. In other words, statistical inference starts with the assumption that important aspects of the phenomenon under study can be described in terms of probabilities; it then draws conclusions by using data to make inferences about these probabilities.
1.4 POPULATIONS AND SAMPLES
In statistics, we are interested in obtaining information about a total collection of elements, which we will refer to as thepopulation. The population is often too large for us to examine each of its members. For instance, we might have all the residents of a given state, or all the television sets produced in the last year by a particular manufacturer, or all the households in a given community. In such cases, we try to learn about the population by choosing and then examining a subgroup of its elements. This subgroup of a population is called asample.
If the sample is to be informative about the total population, it must be, in some sense, representative of that population. For instance, suppose that we are interested in learning about the age distribution of people residing in a given city, and we obtain the ages of the first 100 people to enter the town library. If the average age of these 100 people is 46.2 years, are we justified in concluding that this is approximately the average age of the entire population? Probably not, for we could certainly argue that the sample chosen in this case is probably not representative of the total population because usually more young students and senior citizens use the library than do working-age citizens.
In certain situations, such as the library illustration, we are presented with a sample and must then decide whether this sample is reasonably representative of the entire population.
In practice, a given sample generally cannot be assumed to be representative of a population unless that sample has been chosen in a random manner. This is because any specific nonrandom rule for selecting a sample often results in one that is inherently biased toward some data values as opposed to others.
Thus, although it may seem paradoxical, we are most likely to obtain a representative sample by choosing its members in a totally random fashion without any prior consid- erations of the elements that will be chosen. In other words, we need not attempt to deliberately choose the sample so that it contains, for instance, the same gender percentage and the same percentage of people in each profession as found in the general population.
Rather, we should just leave it up to “chance” to obtain roughly the correct percentages.
Once a random sample is chosen, we can use statistical inference to draw conclusions about the entire population by studying the elements of the sample.
1.5 A BRIEF HISTORY OF STATISTICS
A systematic collection of data on the population and the economy was begun in the Italian city states of Venice and Florence during the Renaissance. The termstatistics, derived from the wordstate, was used to refer to a collection of facts of interest to the state. The idea of
collecting data spread from Italy to the other countries of Western Europe. Indeed, by the first half of the 16th century it was common for European governments to require parishes to register births, marriages, and deaths. Because of poor public health conditions this last statistic was of particular interest.
The high mortality rate in Europe before the 19th century was due mainly to epidemic diseases, wars, and famines. Among epidemics, the worst were the plagues. Starting with the Black Plague in 1348, plagues recurred frequently for nearly 400 years. In 1562, as a way to alert the King’s court to consider moving to the countryside, the City of London began to publish weekly bills of mortality. Initially these mortality bills listed the places of death and whether a death had resulted from plague. Beginning in 1625 the bills were expanded to include all causes of death.
In 1662 the English tradesman John Graunt published a book entitled Natural and Political Observations Made upon the Bills of Mortality. Table 1.1, which notes the total number of deaths in England and the number due to the plague for five different plague years, is taken from this book.
TABLE 1.1 Total Deaths in England
Year Burials Plague Deaths
1592 25,886 11,503
1593 17,844 10,662
1603 37,294 30,561
1625 51,758 35,417
1636 23,359 10,400
Source: John Graunt, Observations Made upon the Bills of Mortality.
3rd ed. London: John Martyn and James Allestry (1st ed. 1662).
Graunt used London bills of mortality to estimate the city’s population. For instance, to estimate the population of London in 1660, Graunt surveyed households in certain London parishes (or neighborhoods) and discovered that, on average, there were approxi- mately 3 deaths for every 88 people. Dividing by 3 shows that, on average, there was roughly 1 death for every 88/3 people. Because the London bills cited 13,200 deaths in London for that year, Graunt estimated the London population to be about
13,200×88/3=387,200
Graunt used this estimate to project a figure for all England. In his book he noted that these figures would be of interest to the rulers of the country, as indicators of both the number of men who could be drafted into an army and the number who could be taxed.
Graunt also used the London bills of mortality — and some intelligent guesswork as to what diseases killed whom and at what age — to infer ages at death. (Recall that the bills of mortality listed only causes and places at death, not the ages of those dying.) Graunt then used this information to compute tables giving the proportion of the population that
TABLE 1.2 John Graunt’s Mortality Table
Age at Death Number of Deaths per 100 Births
0–6 36
6–16 24
16–26 15
26–36 9
36–46 6
46–56 4
56–66 3
66–76 2
76 and greater 1
Note: The categories go up to but do not include the right-hand value. For instance, 0–6 means all ages from 0 up through 5.
dies at various ages. Table 1.2 is one of Graunt’s mortality tables. It states, for instance, that of 100 births, 36 people will die before reaching age 6, 24 will die between the age of 6 and 15, and so on.
Graunt’s estimates of the ages at which people were dying were of great interest to those in the business of selling annuities. Annuities are the opposite of life insurance in that one pays in a lump sum as an investment and then receives regular payments for as long as one lives.
Graunt’s work on mortality tables inspired further work by Edmund Halley in 1693.
Halley, the discoverer of the comet bearing his name (and also the man who was most responsible, by both his encouragement and his financial support, for the publication of Isaac Newton’s famousPrincipia Mathematica), used tables of mortality to compute the odds that a person of any age would live to any other particular age. Halley was influential in convincing the insurers of the time that an annual life insurance premium should depend on the age of the person being insured.
Following Graunt and Halley, the collection of data steadily increased throughout the remainder of the 17th and on into the 18th century. For instance, the city of Paris began collecting bills of mortality in 1667; and by 1730 it had become common practice throughout Europe to record ages at death.
The term statistics, which was used until the 18th century as a shorthand for the descriptive science of states, became in the 19th century increasingly identified with numbers. By the 1830s the term was almost universally regarded in Britain and France as being synonymous with the “numerical science” of society. This change in meaning was caused by the large availability of census records and other tabulations that began to be systematically collected and published by the governments of Western Europe and the United States beginning around 1800.
Throughout the 19th century, although probability theory had been developed by such mathematicians as Jacob Bernoulli, Karl Friedrich Gauss, and Pierre-Simon Laplace, its use in studying statistical findings was almost nonexistent, because most social statisticians
at the time were content to let the data speak for themselves. In particular, statisticians of that time were not interested in drawing inferences about individuals, but rather were concerned with the society as a whole. Thus, they were not concerned with sampling but rather tried to obtain censuses of the entire population. As a result, probabilistic inference from samples to a population was almost unknown in 19th century social statistics.
It was not until the late 1800s that statistics became concerned with inferring conclusions from numerical data. The movement began with Francis Galton’s work on analyzing hereditary genius through the uses of what we would now call regression and correlation analysis (see Chapter 9), and obtained much of its impetus from the work of Karl Pearson.
Pearson, who developed the chi-square goodness of fit tests (see Chapter 11), was the first director of the Galton Laboratory, endowed by Francis Galton in 1904. There Pearson originated a research program aimed at developing new methods of using statistics in inference. His laboratory invited advanced students from science and industry to learn statistical methods that could then be applied in their fields. One of his earliest visiting researchers was W. S. Gosset, a chemist by training, who showed his devotion to Pearson by publishing his own works under the name “Student.” (A famous story has it that Gosset was afraid to publish under his own name for fear that his employers, the Guinness brewery, would be unhappy to discover that one of its chemists was doing research in statistics.) Gosset is famous for his development of thet-test (see Chapter 8).
Two of the most important areas of applied statistics in the early 20th century were population biology and agriculture. This was due to the interest of Pearson and others at his laboratory and also to the remarkable accomplishments of the English scientist Ronald A. Fisher. The theory of inference developed by these pioneers, including among others
TABLE 1.3 The Changing Definition of Statistics
Statistics has then for its object that of presenting a faithful representation of a state at a determined epoch. (Quetelet, 1849)
Statistics are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man. (Galton, 1889)
Statistics may be regarded (i) as the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data. (Fisher, 1925)
Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology. (E. Pearson, 1936)
Statistics is the name for that science and art which deals with uncertain inferences — which uses numbers to find out something about nature and experience. (Weaver, 1952)
Statistics has become known in the 20th century as the mathematical tool for analyzing experimental and observational data. (Porter, 1986)
Statistics is the art of learning from data. (this book, 2004)
Karl Pearson’s son Egon and the Polish born mathematical statistician Jerzy Neyman, was general enough to deal with a wide range of quantitative and practical problems. As a result, after the early years of the 20th century a rapidly increasing number of people in science, business, and government began to regard statistics as a tool that was able to provide quantitative solutions to scientific and practical problems (see Table 1.3).
Nowadays the ideas of statistics are everywhere. Descriptive statistics are featured in every newspaper and magazine. Statistical inference has become indispensable to public health and medical research, to engineering and scientific studies, to marketing and quality control, to education, to accounting, to economics, to meteorological forecasting, to polling and surveys, to sports, to insurance, to gambling, and to all research that makes any claim to being scientific. Statistics has indeed become ingrained in our intellectual heritage.
Problems
1. An election will be held next week and, by polling a sample of the voting population, we are trying to predict whether the Republican or Democratic candidate will prevail. Which of the following methods of selection is likely to yield a representative sample?
(a) Poll all people of voting age attending a college basketball game.
(b) Poll all people of voting age leaving a fancy midtown restaurant.
(c) Obtain a copy of the voter registration list, randomly choose 100 names, and question them.
(d) Use the results of a television call-in poll, in which the station asked its listeners to call in and name their choice.
(e) Choose names from the telephone directory and call these people.
2. The approach used in Problem 1(e) led to a disastrous prediction in the 1936 presidential election, in which Franklin Roosevelt defeated Alfred Landon by a landslide. A Landon victory had been predicted by theLiterary Digest. The maga- zine based its prediction on the preferences of a sample of voters chosen from lists of automobile and telephone owners.
(a) Why do you think theLiterary Digest’s prediction was so far off?
(b) Has anything changed between 1936 and now that would make you believe that the approach used by theLiterary Digestwould work better today?
3. A researcher is trying to discover the average age at death for people in the United States today. To obtain data, the obituary columns of theNew York Timesare read for 30 days, and the ages at death of people in the United States are noted. Do you think this approach will lead to a representative sample?
4. To determine the proportion of people in your town who are smokers, it has been decided to poll people at one of the following local spots:
(a) the pool hall;
(b) the bowling alley;
(c) the shopping mall;
(d) the library.
Which of these potential polling places would most likely result in a reasonable approximation to the desired proportion? Why?
5. A university plans on conducting a survey of its recent graduates to determine information on their yearly salaries. It randomly selected 200 recent graduates and sent them questionnaires dealing with their present jobs. Of these 200, however, only 86 were returned. Suppose that the average of the yearly salaries reported was
$75,000.
(a) Would the university be correct in thinking that $75,000 was a good approxi- mation to the average salary level of all of its graduates? Explain the reasoning behind your answer.
(b) If your answer to part (a) is no, can you think of any set of conditions relat- ing to the group that returned questionnaires for which it would be a good approximation?
6. An article reported that a survey of clothing worn by pedestrians killed at night in traffic accidents revealed that about 80 percent of the victims were wearing dark- colored clothing and 20 percent were wearing light-colored clothing. The conclu- sion drawn in the article was that it is safer to wear light-colored clothing at night.
(a) Is this conclusion justified? Explain.
(b) If your answer to part (a) is no, what other information would be needed before a final conclusion could be drawn?
7. Critique Graunt’s method for estimating the population of London. What implicit assumption is he making?
8. The London bills of mortality listed 12,246 deaths in 1658. Supposing that a survey of London parishes showed that roughly 2 percent of the population died that year, use Graunt’s method to estimate London’s population in 1658.
9. Suppose you were a seller of annuities in 1662 when Graunt’s book was published.
Explain how you would make use of his data on the ages at which people were dying.
10. Based on Graunt’s mortality table:
(a) What proportion of people survived to age 6?
(b) What proportion survived to age 46?
(c) What proportion died between the ages of 6 and 36?
DESCRIPTIVE STATISTICS
2.1 INTRODUCTION
In this chapter we introduce the subject matter of descriptive statistics, and in doing so learn ways to describe and summarize a set of data. Section 2.2 deals with ways of describing a data set. Subsections 2.2.1 and 2.2.2 indicate how data that take on only a relatively few distinct values can be described by using frequency tables or graphs, whereas Subsection 2.2.3 deals with data whose set of values is grouped into different intervals.
Section 2.3 discusses ways of summarizing data sets by use of statistics, which are numerical quantities whose values are determined by the data. Subsection 2.3.1 considers three statistics that are used to indicate the “center” of the data set: the sample mean, the sample median, and the sample mode. Subsection 2.3.2 introduces the sample variance and its square root, called the sample standard deviation. These statistics are used to indicate the spread of the values in the data set. Subsection 2.3.3 deals with sample percentiles, which are statistics that tell us, for instance, which data value is greater than 95 percent of all the data. In Section 2.4 we present Chebyshev’s inequality for sample data. This famous inequality gives a lower bound to the proportion of the data that can differ from the sample mean by more thanktimes the sample standard deviation. Whereas Chebyshev’s inequality holds for all data sets, we can in certain situations, which are discussed in Section 2.5, obtain more precise estimates of the proportion of the data that is withink sample standard deviations of the sample mean. In Section 2.5 we note that when a graph of the data follows a bell-shaped form the data set is said to be approximately normal, and more precise estimates are given by the so-called empirical rule. Section 2.6 is concerned with situations in which the data consist of paired values. A graphical technique, called the scatter diagram, for presenting such data is introduced, as is the sample correlation coefficient, a statistic that indicates the degree to which a large value of the first member of the pair tends to go along with a large value of the second.
2.2 DESCRIBING DATA SETS
The numerical findings of a study should be presented clearly, concisely, and in such a manner that an observer can quickly obtain a feel for the essential characteristics of
9
the data. Over the years it has been found that tables and graphs are particularly useful ways of presenting data, often revealing important features such as the range, the degree of concentration, and the symmetry of the data. In this section we present some common graphical and tabular ways for presenting data.
2.2.1 Frequency Tables and Graphs
A data set having a relatively small number of distinct values can be conveniently presented in afrequency table. For instance, Table 2.1 is a frequency table for a data set consisting of the starting yearly salaries (to the nearest thousand dollars) of 42 recently graduated students with B.S. degrees in electrical engineering. Table 2.1 tells us, among other things, that the lowest starting salary of $47,000 was received by four of the graduates, whereas the highest salary of $60,000 was received by a single student. The most common starting salary was
$52,000, and was received by 10 of the students.
TABLE 2.1 Starting Yearly Salaries Starting Salary Frequency
47 4
48 1
49 3
50 5
51 8
52 10
53 0
54 5
56 2
57 3
60 1
Data from a frequency table can be graphically represented by aline graphthat plots the distinct data values on the horizontal axis and indicates their frequencies by the heights of vertical lines. A line graph of the data presented in Table 2.1 is shown in Figure 2.1.
When the lines in a line graph are given added thickness, the graph is called abar graph.
Figure 2.2 presents a bar graph.
Another type of graph used to represent a frequency table is thefrequency polygon, which plots the frequencies of the different data values on the vertical axis, and then connects the plotted points with straight lines. Figure 2.3 presents a frequency polygon for the data of Table 2.1.
2.2.2 Relative Frequency Tables and Graphs
Consider a data set consisting ofnvalues. Iff is the frequency of a particular value, then the ratiof/nis called itsrelative frequency. That is, the relative frequency of a data value is
12
10
8
6
4
2
0
Frequency
Starting salary
47 48 49 50 51 52 53 54 56 57 60
FIGURE 2.1 Starting salary data.
12
10
8
6
4
2
0
Frequency
Starting salary
47 48 49 50 51 52 53 54 56 57 60
FIGURE 2.2 Bar graph for starting salary data.
Starting salary
Frequency
12
10
8
6
4
2
0
47 48 49 50 51 52 53 54 56 57 60
FIGURE 2.3 Frequency polygon for starting salary data.
the proportion of the data that have that value. The relative frequencies can be represented graphically by a relative frequency line or bar graph or by a relative frequency polygon.
Indeed, these relative frequency graphs will look like the corresponding graphs of the absolute frequencies except that the labels on the vertical axis are now the old labels (that gave the frequencies) divided by the total number of data points.
EXAMPLE 2.2a Table 2.2 is a relative frequency table for the data of Table 2.1. The rela- tive frequencies are obtained by dividing the corresponding frequencies of Table 2.1 by 42, the size of the data set. ■
Apie chartis often used to indicate relative frequencies when the data are not numerical in nature. A circle is constructed and then sliced into different sectors; one for each distinct type of data value. The relative frequency of a data value is indicated by the area of its sector, this area being equal to the total area of the circle multiplied by the relative frequency of the data value.
EXAMPLE 2.2b The following data relate to the different types of cancers affecting the 200 most recent patients to enroll at a clinic specializing in cancer. These data are represented in the pie chart presented in Figure 2.4. ■
TABLE 2.2
Starting Salary Frequency
47 4/42=.0952
48 1/42=.0238
49 3/42
50 5/42
51 8/42
52 10/42
53 0
54 5/42
56 2/42
57 3/42
60 1/42
Melanoma 4.5%
Bladder 6%
Lung 21%
Breast 25%
Colon 16%
Prostate 27.5%
FIGURE 2.4
Type of Cancer Number of New Cases Relative Frequency
Lung 42 .21
Breast 50 .25
Colon 32 .16
Prostate 55 .275
Melanoma 9 .045
Bladder 12 .06
2.2.3 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots
As seen in Subsection 2.2.2, using a line or a bar graph to plot the frequencies of data values is often an effective way of portraying a data set. However, for some data sets the number of distinct values is too large to utilize this approach. Instead, in such cases, it is useful to divide the values into groupings, orclass intervals, and then plot the number of data values falling in each class interval. The number of class intervals chosen should be a trade-off between (1) choosing too few classes at a cost of losing too much information about the actual data values in a class and (2) choosing too many classes, which will result in the TABLE 2.3 Life in Hours of 200 Incandescent Lamps
Item Lifetimes
1,067 919 1,196 785 1,126 936 918 1,156 920 948
855 1,092 1,162 1,170 929 950 905 972 1,035 1,045
1,157 1,195 1,195 1,340 1,122 938 970 1,237 956 1,102
1,022 978 832 1,009 1,157 1,151 1,009 765 958 902
923 1,333 811 1,217 1,085 896 958 1,311 1,037 702
521 933 928 1,153 946 858 1,071 1,069 830 1,063
930 807 954 1,063 1,002 909 1,077 1,021 1,062 1,157
999 932 1,035 944 1,049 940 1,122 1,115 833 1,320
901 1,324 818 1,250 1,203 1,078 890 1,303 1,011 1,102
996 780 900 1,106 704 621 854 1,178 1,138 951
1,187 1,067 1,118 1,037 958 760 1,101 949 992 966
824 653 980 935 878 934 910 1,058 730 980
844 814 1,103 1,000 788 1,143 935 1,069 1,170 1,067
1,037 1,151 863 990 1,035 1,112 931 970 932 904
1,026 1,147 883 867 990 1,258 1,192 922 1,150 1,091
1,039 1,083 1,040 1,289 699 1,083 880 1,029 658 912
1,023 984 856 924 801 1,122 1,292 1,116 880 1,173
1,134 932 938 1,078 1,180 1,106 1,184 954 824 529
998 996 1,133 765 775 1,105 1,081 1,171 705 1,425
610 916 1,001 895 709 860 1,110 1,149 972 1,002
frequencies of each class being too small for a pattern to be discernible. Although 5 to 10 class intervals are typical, the appropriate number is a subjective choice, and of course, you can try different numbers of class intervals to see which of the resulting charts appears to be most revealing about the data. It is common, although not essential, to choose class intervals of equal length.
The endpoints of a class interval are called the class boundaries. We will adopt the left-end inclusion convention, which stipulates that a class interval contains its left-end but not its right-end boundary point. Thus, for instance, the class interval 20–30 contains all values that are both greater thanor equal to20 and less than 30.
Table 2.3 presents the lifetimes of 200 incandescent lamps. A class frequency table for the data of Table 2.3 is presented in Table 2.4. The class intervals are of length 100, with the first one starting at 500.
TABLE 2.4 A Class Frequency Table
Frequency
(Number of Data Values in
Class Interval the Interval)
500–600 2
600–700 5
700–800 12
800–900 25
900–1000 58
1000–1100 41
1100–1200 43
1200–1300 7
1300–1400 6
1400–1500 1
5 6 7 8 9 10 11 12 13 14 15 Life in units of 100 hours
Number of occurrences
60 50 40 30 20 10 00
FIGURE 2.5 A frequency histogram.
1.0
0.8
0.6
0.4
0.2
0
Lifetimes
500 700 900 1,100 1,300 1,500
FIGURE 2.6 A cumulative frequency plot.
A bar graph plot of class data, with the bars placed adjacent to each other, is called ahistogram. The vertical axis of a histogram can represent either the class frequency or the relative class frequency; in the former case the graph is called afrequency histogramand in the latter arelative frequency histogram. Figure 2.5 presents a frequency histogram of the data in Table 2.4.
We are sometimes interested in plotting a cumulative frequency (or cumulative relative frequency) graph. A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data whose values are less than or equal to it. A cumulative relative frequency plot of the data of Table 2.3 is given in Figure 2.6. We can conclude from this figure that 100 percent of the data values are less than 1,500, approximately 40 percent are less than or equal to 900, approximately 80 percent are less than or equal to 1,100, and so on. A cumulative frequency plot is called anogive.
An efficient way of organizing a small- to moderate-sized data set is to utilize astem and leaf plot. Such a plot is obtained by first dividing each data value into two parts — its stem and its leaf. For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit. Thus, for instance, the value 62 is expressed as
Stem Leaf
6 2
and the two data values 62 and 67 can be represented as Stem Leaf
6 2, 7
EXAMPLE 2.2c Table 2.5 gives the monthly and yearly average daily minimum tempera- tures in 35 U.S. cities.
The annual average daily minimum temperatures from Table 2.5 are represented in the following stem and leaf plot.
7 0.0 6 9.0
5 1.0, 1.3, 2.0, 5.5, 7.1, 7.4, 7.6, 8.5, 9.3
4 0.0, 1.0, 2.4, 3.6, 3.7, 4.8, 5.0, 5.2, 6.0, 6.7, 8.1, 9.0, 9.2 3 3.1, 4.1, 5.3, 5.8, 6.2, 9.0, 9.5, 9.5
2 9.0, 9.8
2.3 SUMMARIZING DATA SETS
Modern-day experiments often deal with huge sets of data. For instance, in an attempt to learn about the health consequences of certain common practices, in 1951 the medical statisticians R. Doll and A. B. Hill sent questionnaires to all doctors in the United Kingdom and received approximately 40,000 replies. Their questions dealt with age, eating habits, and smoking habits. The respondents were then tracked for the ensuing 10 years and the causes of death for those who died were monitored. To obtain a feel for such a large amount of data, it is useful to be able to summarize it by some suitably chosen measures. In this section we present some summarizing statistics, where a statistic is a numerical quantity whose value is determined by the data.
2.3.1 Sample Mean, Sample Median, and Sample Mode
In this section we introduce some statistics that are used for describing the center of a set of data values. To begin, suppose that we have a data set consisting of the nnumerical valuesx1,x2,. . .,xn. The sample mean is the arithmetic average of these values.
Definition Thesample mean, designated byx, is defined by¯
¯ x=
n i=1
xi/n
The computation of the sample mean can often be simplified by noting that if for constants aandb
yi=axi+b, i=1,. . .,n
TABLE 2.5 Normal Daily Minimum Temperature — Selected Cities
[In Fahrenheit degrees. Airport data except as noted. Based on standard 30-year period, 1961 through 1990]
Annual State Station Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. avg.
AL Mobile . . . . 40.0 42.7 50.1 57.1 64.4 70.7 73.2 72.9 68.7 57.3 49.1 43.1 57.4 AK Juneau . . . . 19.0 22.7 26.7 32.1 38.9 45.0 48.1 47.3 42.9 37.2 27.2 22.6 34.1 AZ Phoenix . . . . 41.2 44.7 48.8 55.3 63.9 72.9 81.0 79.2 72.8 60.8 48.9 41.8 59.3 AR Little Rock . . . . 29.1 33.2 42.2 50.7 59.0 67.4 71.5 69.8 63.5 50.9 41.5 33.1 51.0 CA Los Angeles . . . . 47.8 49.3 50.5 52.8 56.3 59.5 62.8 64.2 63.2 59.2 52.8 47.9 55.5 Sacramento . . . . 37.7 41.4 43.2 45.5 50.3 55.3 58.1 58.0 55.7 50.4 43.4 37.8 48.1 San Diego . . . . 48.9 50.7 52.8 55.6 59.1 61.9 65.7 67.3 65.6 60.9 53.9 48.8 57.6 San Francisco . . . . 41.8 45.0 45.8 47.2 49.7 52.6 53.9 55.0 55.2 51.8 47.1 42.7 49.0 CO Denver . . . . 16.1 20.2 25.8 34.5 43.6 52.4 58.6 56.9 47.6 36.4 25.4 17.4 36.2 CT Hartford . . . . 15.8 18.6 28.1 37.5 47.6 56.9 62.2 60.4 51.8 40.7 32.8 21.3 39.5 DE Wilmington . . . . 22.4 24.8 33.1 41.8 52.2 61.6 67.1 65.9 58.2 45.7 37.0 27.6 44.8 DC Washington . . . . 26.8 29.1 37.7 46.4 56.6 66.5 71.4 70.0 62.5 50.3 41.1 31.7 49.2 FL Jacksonville . . . . 40.5 43.3 49.2 54.9 62.1 69.1 71.9 71.8 69.0 59.3 50.2 43.4 57.1 Miami . . . . 59.2 60.4 64.2 67.8 72.1 75.1 76.2 76.7 75.9 72.1 66.7 61.5 69.0 GA Atlanta . . . . 31.5 34.5 42.5 50.2 58.7 66.2 69.5 69.0 63.5 51.9 42.8 35.0 51.3 HI Honolulu . . . . 65.6 65.4 67.2 68.7 70.3 72.2 73.5 74.2 73.5 72.3 70.3 67.0 70.0 ID Boise . . . . 21.6 27.5 31.9 36.7 43.9 52.1 57.7 56.8 48.2 39.0 31.1 22.5 39.1 IL Chicago . . . . 12.9 17.2 28.5 38.6 47.7 57.5 62.6 61.6 53.9 42.2 31.6 19.1 39.5 Peoria . . . . 13.2 17.7 29.8 40.8 50.9 60.7 65.4 63.1 55.2 43.1 32.5 19.3 41.0 IN Indianapolis . . . . 17.2 20.9 31.9 41.5 51.7 61.0 65.2 62.8 55.6 43.5 34.1 23.2 42.4 IA Des Moines . . . . 10.7 15.6 27.6 40.0 51.5 61.2 66.5 63.6 54.5 42.7 29.9 16.1 40.0 KS Wichita . . . . 19.2 23.7 33.6 44.5 54.3 64.6 69.9 67.9 59.2 46.6 33.9 23.0 45.0 KY Louisville . . . . 23.2 26.5 36.2 45.4 54.7 62.9 67.3 65.8 58.7 45.8 37.3 28.6 46.0 LA New Orleans . . . . 41.8 44.4 51.6 58.4 65.2 70.8 73.1 72.8 69.5 58.7 51.0 44.8 58.5 ME Portland . . . . 11.4 13.5 24.5 34.1 43.4 52.1 58.3 57.1 48.9 38.3 30.4 17.8 35.8 MD Baltimore . . . . 23.4 25.9 34.1 42.5 52.6 61.8 66.8 65.7 58.4 45.9 37.1 28.2 45.2 MA Boston . . . . 21.6 23.0 31.3 40.2 49.8 59.1 65.1 64.0 56.8 46.9 38.3 26.7 43.6 MI Detroit . . . . 15.6 17.6 27.0 36.8 47.1 56.3 61.3 59.6 52.5 40.9 32.2 21.4 39.0 Sault Ste. Marie . . . . 4.6 4.8 15.3 28.4 38.4 45.5 51.3 51.3 44.3 36.2 25.9 11.8 29.8 MN Duluth . . . . −2.2 2.8 15.7 28.9 39.6 48.5 55.1 53.3 44.5 35.1 21.5 4.9 29.0 Minneapolis-St. Paul . . . 2.8 9.2 22.7 36.2 47.6 57.6 63.1 60.3 50.3 38.8 25.2 10.2 35.3 MS Jackson . . . . 32.7 35.7 44.1 51.9 60.0 67.1 70.5 69.7 63.7 50.3 42.3 36.1 52.0 MO Kansas City . . . . 16.7 21.8 32.6 43.8 53.9 63.1 68.2 65.7 56.9 45.7 33.6 21.9 43.7 St. Louis . . . . 20.8 25.1 35.5 46.4 56.0 65.7 70.4 67.9 60.5 48.3 37.7 26.0 46.7 MT Great Falls . . . . 11.6 17.2 22.8 31.9 40.9 48.6 53.2 52.2 43.5 35.8 24.3 14.6 33.1 Source: U.S. National Oceanic and Atmospheric Administration, Climatography of the United States, No. 81.
then the sample mean of the data sety1,. . .,ynis
¯ y=
n i=1
(axi+b)/n= n
i=1
axi/n+ n i=1
b/n=ax¯+b
EXAMPLE 2.3a The winning scores in the U.S. Masters golf tournament in the years from 1982 to 1991 were as follows:
284, 280, 277, 282, 279, 285, 281, 283, 278, 277 Find the sample mean of these scores.
SOLUTION Rather than directly adding these values, it is easier to first subtract 280 from each one to obtain the new valuesyi=xi−280:
4, 0,−3, 2,−1, 5, 1, 3,−2,−3 Because the arithmetic average of the transformed data set is
¯ y=6/10 it follows that
¯
x= ¯y+280=280.6 ■
Sometimes we want to determine the sample mean of a data set that is presented in a frequency table listing thekdistinct valuesv1,. . .,vk having corresponding frequencies f1,. . .,fk. Since such a data set consists ofn= k
i=1fi observations, with the valuevi appearingfi times, for eachi=1,. . .,k, it follows that the sample mean of thesendata values is
¯ x=
k i=1
vifi/n By writing the preceding as
¯ x=f1
nv1+f2
nv2+ · · · +fk nvk
we see that the sample mean is aweighted averageof the distinct values, where the weight given to the value vi is equal to the proportion of the n data values that are equal to vi,i=1,. . .,k.