• Aucun résultat trouvé

LEARNING STATISTICS THROUGH BASEBALL Introduction

Several years ago I wrote a little book entitled Learning statistics through playing cards (Knapp, 1996), in which I tried to explain the fundamental concepts of statistics (both descriptive and inferential) by using an ordinary deck of playing cards for generating the numbers. In 2003 Jim Albert wrote Teaching statistics through baseball. What follows can be thought of as a possible sequel to both books, with its emphasis on descriptive statistics and the tabletop dice game "Big League Baseball".

The reason I am restricting most of this presentation to descriptive statistics is that there is no random sampling in baseball (more about this later), and random sampling is the principal justification for generalizing from a sample (a part) to a population (the whole). But it has been said, by the well-known statistician John Tukey and others, that there has been too much emphasis on inferential statistics anyhow. See his classic 1977 book, Exploratory data analysis (EDA) and/or any of the popular computer packages that have implemented EDA.

The price you might have to pay for reading this chapter is not in money (it's free) but in the time and effort necessary to understand the game of baseball.

(Comedian Bob Newhart's satirical routine about the complications of baseball is hilarious!) In the next couple of sections I will provide the basics. If you think you need to know more, watch a few games at your local Little League field or a few Major League games on TV (especially if Los Angeles Dodgers broadcaster Vin Scully is the announcer).

How the game is played

As many of you already know, there are nine players on each team and the teams take turns batting and fielding. The nine players are:

1. The pitcher (who throws the ball that the batter tries to hit)

2. The catcher (who catches any ball the batter doesn't hit and some others) 3. The first baseman (who is "the guardian" of the base that the batter must first run to after hitting the ball)

4. The second baseman (the "guardian" of the next base)

5. The shortstop (who helps to guard second base, among other things) 6. The third baseman (the "guardian" of that base)

7. The left fielder (who stands about 100 feet behind third base and hopes to catch any balls that are hit nearby)

8. The center fielder (who is positioned similarly behind second base) 9. The right fielder (likewise, but behind first base).

The object of the game as far as the batters are concerned is to run counter-clockwise around the bases (from first to second to third, and back to "home plate" where the ball is batted and where the catcher is the "guardian"). The object of the game as far as the fielders are concerned is to prevent the batters from doing that.

Some specifics:

1. There are nine "innings" during which a game is played. An inning consists of each team having the opportunity to bat until there are three "outs", i.e., three unsuccessful opportunities for the runners to reach the bases before the ball (thrown by the fielders) gets there. If the runner reaches the base before the ball does, he is credited with a "hit".

2. Each batter can choose to swing the bat or to not swing the bat at the ball thrown by the pitcher. If the batter chooses to swing, he has three opportunities to try to bat the ball; failure on all three opportunities results in three "strikes" and an "out". If the batter chooses to not swing and the ball has been thrown where he should have been able to bat it, that also constitutes a strike. But if he chooses to not swing and the pitcher has not thrown the ball well, the result is called a "ball". He (the batter) is awarded first base if he refrains from swinging at four poor pitches. (Check out what Bob Newhart has to say about why there are three strikes and four balls!)

In reality there are often many combinations of swings and non-swings that result in successes or failures. For example, it is quite common for a batter to swing and miss at two good pitches, to not swing at two bad pitches, and to eventually swing, bat the ball, run toward first base, and get there either before or after the ball is caught and thrown to the first baseman.

3. The team that scores the more "runs" (encirclings of the bases by the batters) after the nine innings are played is the winner.

There are several other technical matters that I will discuss when necessary.

What does this have to do with statistics?

My favorite statistic is a percentage (see Chapter 15 of this book), and

percentages abound in baseball. For example, one matter of great concern to a batter is the percentage of time that he bats a ball and arrives safely at first base (or even beyond) before the ball gets there. If a batter gets one successful hit every four times at bat, he is said to have a "batting average" of 1/4 or 25% or .

very close to .260 for many, many years. (See the late paleontologist Stephen Jay Gould's 1996 book, Full house, and his 2004 book, Triumph and tragedy in Mudville.)

Similarly, a matter of great concern to the pitcher is that same percentage of the time that a batter is successful against him. (One of the neat things about baseball is the fact that across all of the games played, "batting average of" for the batters must be equal to "batting average against" for the pitchers.) Batters who bat successfully much higher than .250 and pitchers who hold batters to averages much lower than .250 are usually the best players.

Other important percentages are those for the fielders. If they are successful in

"throwing out" the runners 95 percent or more of the time (fielding averages of . 950 or better) they are doing their jobs very well.

Some other statistical aspects of baseball

1. In a previous paragraph I pointed out that the average batting average has remained around .260. The average standard deviation (the most frequently used measure of variability...see below) has decreased steadily over the years.

It's now approximately .030. (See Gould, 1996 and 2004, about that also.)

2. One of the most important concepts in statistics is the correlation between two variables such as height and weight, age and pulse rate, etc. Instead of, or in addition to, "batting average against", baseball people often look at a pitcher's

"earned run average", which is calculated by multiplying by nine the number of earned runs given up and dividing by the number of innings pitched. (See

Charles M. Schulz's 2004 book, Who's on first, Charlie Brown? , page 106, for a cute illustration of the concept.) Those two variables, "batting average against"

and "earned run average", correlate very highly with one another, not

surprisingly, since batters who don't bat very well against a pitcher are unlikely to score very many runs against him.

3. The matter of "weighting" certain data is very common in statistics and

especially common in baseball. For example, if a player has a batting average of .250 against left-handed pitchers and a batting average of .350 against right-handed pitchers, it doesn't necessarily follow that his overall batting average is . 300 (the simple average of the two averages), since he might not have batted against left-handed pitchers the same number of times as he did against right-handed pitchers. This is particularly important in trying to understand something called Simpson's Paradox (see below).

4. "Unit of analysis" is a very important concept in statistics. In baseball the unit of analysis is sometimes the individual player, sometimes the team, sometimes the league itself. Whenever measurements of various aspects of baseball are taken, they should be independent of one another. For example, if the team is

the unit of analysis and we find that there is a strong correlation between the number of runs scored and the number of hits made, the correlation between those same two variables might be higher and might be lower (it's usually lower) if the individual player is taken as the unit of analysis, and the number of

"observations" (pieces of data) might not be independent in the latter case, since player is "nested" within team.

5. "Errors" arise in statistics (measurement errors, sampling errors, etc.) and, alas, are unfortunately also fairly common in baseball. For example, when a ball is batted to a fielder and he doesn't catch it, or he catches it and then throws wildly to the baseman, thereby permitting the batter to reach base, that fielder is charged with an error, which can sometimes be an important determinant of a win or a loss.

A "simulated" game

In his 2003 book, Jim Albert displays the following tables for simulating a game of baseball, pitch by pitch, using a set of three dice (one red die and two white dice). This approach, called [tabletop] Big League Baseball was marketed by Sycamore Games in the 1960s.

Result of rolling the red die in “Big League Baseball."

Red die Pitch result 1, 6 Ball in play

2, 3 Ball

4, 5 Strike

Result of rolling the two white dice in “Big League Baseball."

Second die

1 2 3 4 5 6

First die 1 Single Out Out Out Out Error

2 Out Double Single Out Single Out

5 Out Single Out Out Out Single

6 Error Out Out Out Single Home run

In the first Appendix to this chapter I have inserted an Excel file that lists the results of over 200 throws I made of those dice in order to generate the findings for a hypothetical game between two teams, call them Team A and Team B. [We retirees have all kinds of time on our hands to do such things, but I "cheated" a little by using RANDOM.ORG's virtual dice roller rather than actually throwing one red die and two white dice.] Those findings will be used throughout the rest of this chapter to illustrate percentages, correlations, and other statistical

concepts that are frequently encountered in real-life research. "Big League Baseball" does not provide an ideal simulation, as Albert himself has

acknowledged (e.g., it regards balls and strikes as equally likely, which they are not), but as I like to say, "it's close enough for government work".

You might want to print out the raw data in that Appendix in order to trace where all of the numbers in the next several sections come from.

Some technical matters regarding the simulated game:

1. I previously mentioned that balls and strikes are treated as equally likely, although they are not. Similarly, the probabilities associated with “white die 1” and “white die 2” do not quite agree with what actually happens in baseball, but once again they’re close enough.

2. You might have noticed that Batter A1 followed Batter A9 after each of the latter’s appearances. B1 likewise followed B9, etc. throughout the game.

But it is quite often the case that players are replaced during a game for various reasons (injury, inept play, etc.). Once a player is replaced he is not permitted to re-enter the game (unlike in basketball and football).

3. There were a couple of occasions where a batter swung at a pitch when he already had three straight balls. That is unusual. Most batters would prefer to not swing at that fourth pitch, hoping it might also be a ball.

4. There was at least one occasion where a runner advanced one base when the following batter got a single. Sometimes runners can advance more than one base in such situations.

5. The various permutations in the second table do not allow for somewhat unusual events such as a batter actually getting struck by a pitch or a batter hitting into a “double play” in which both he and a runner already on base are out.

6. Most of the time a ball in play (a roll of a 1 or a 6 with the red die) resulted in an out, which is in fact usually the case. We don’t know, however, how those outs were made. For example, was the ball hit in the air and was caught by the left fielder? Was the ball hit on the ground to the second

As far as the score goes, it doesn’t really matter. But it does matter to the players and to the “manager” (every team has a manager who decides who is chosen to play, who bats in what order, and the like).

7. We also don’t know who made the errors. As far as the score goes, that doesn’t really matter either. But it does matter to the individual players who made the errors, since it affects their fielding averages.

8. A word needs to be said about the determination of balls and strikes. In the game under consideration, if a strike was the result of a pitch thrown by the pitcher we don’t know if the batter swung and missed (which would automatically be a strike) or if he “took” a pitch at which he should have swung. It is the “umpire” of the game (every game has at least one umpire) who determines whether or not a pitch was good enough for a batter to hit.

9. Although it didn’t happen in this game, sometimes the teams are tied after nine innings have been played. If so, one or more innings must be played until one of the teams gets ahead and stays ahead. Again, unlike in basketball and football, there is no time limit in baseball.

10.If the team that bats second is ahead after 8 ½ innings there is no reason for them to bat in the last half of the 9th inning, since they have already won the game. That also didn’t happen in this particular game, but it is very common.

Basic Descriptive Statistics 1. Frequency distributions

A frequency distribution is the most important concept in descriptive statistics. It provides a count of the number of times that each of several events took place.

For example, in the simulated data in the Appendix we can determine the following frequency distribution of the number of hits made by the players on Team A in their game against Team B:

Number of hits Frequency

0 3

1 2

2 4

3 or more 0

Do you see how I got those numbers? Players A5, A6, and A7 had no hits;

Players A4 and A9 had one each (A4 had a double in the seventh inning; A9 had a single in the second inning); and Players A1, A2, A3, and A8 had two each (A1