Baseball Toaster was unplugged on February 4, 2009.
(Or my infallibility already put to the test)
There was an interesting query circulating on the SABR online newsletter do-hickey. It reads as follows:
Subject: Hitting Streak Probabilities
I need some math help. I understand that to compute the chance of an n-game hitting streak you take the probability of a player getting a hit in a game (p) and raise it to the power of the streak you are interested in (n).
So how would you figure out the probability of a player having an n-game streak during a given season (and wouldn't that be a useful number to know)? Would it be by multiplying the chance of an n-game streak by the number of n-game stretches there are in a season?
For example, there are 99 56-game stretches in a 154-game season (games 1-56 through games 99-154), 119 44-game stretches in a 162-game season, and 162 1-game stretches in a 162-game season.
I understand enough to know that the probability of a player getting a Hit in a game is not the same as the number of hits per game a player averages. A player who plays 162 games and gets 162 hits averages 1 hit per game, but that doesn't mean the probability of him getting a hit in a given game is 1. Unfortunately, I don't understand enough to figure out the probability of a player getting a hit in a given game.
But I do know that even a player going 1/500 has a greater-than-zero probability of getting a hit in a given game (doesn't he?), and squaring this (for a two-game streak) is still greater than zero, and multiplying this by 161 (the number of two-game stretches in a season) is still greater than zero.
But for this player, wouldn't a 2-game streak need to have a probability of 0 for the formula to work in the "real world" (rather than theoretically)? And wouldn't having a 1-game streak somewhere in a season need to be the same probability (1) for a player who gets 1 hit in a season and player who gets 257? And don't these points contradict the idea that to get the probability of a player putting together an n-game streak in a season you multiply the chance of an n-game streak by the number of n-game stretches there are in a season?
So do we need to define the season formula so that you are not allowed to have an n when n is greater than the number of hits a player actually had? Or add something about the probability of a given hit-distribution of h-hits over n-games (there is zero probability of distributing one hit over two games)? Or what?
Please forgive any mathematical errors in the above. I was an English major.
Well I thought I'd try my hand and here's what I whipped up:
I was a Math major, so let's see if I can be of assistance.
Let's start from scratch: Assume H is the number of hits for a player in a year, and PA is the number of plate appearances in a year for the same player. The probability that a player gets a hit in a given plate appearance will be represented as P(A) and is defined as P(A) = H/PA.
We will assume that the sample space is defined by the probabilities based on the player's stats for a given year. This is preferable to a player's entire career stats since it directly related to the player's ability at that point in time. It is also preferable to a day or a week's stats since they would be too small to represent the player's probabilities precisely.
Also, we base the probability on plate appearance not at-bat since the player could walk, get hit by a pitch, etc. A player who walks three times and gets a single in his one official at-bat had the same opportunity to record a hit as a player who went 1-for-4.
So the probability of a player getting a hit in a given game, which we will call P(B), is defined P(B) = 1 - (1 - P(A))^pa), where "pa" is the number of plate appearance in a game (and "^" is the only way I can represent "to the power of" in an email). Basically, P(B) is derived by calculating the probability that player does not get a hit (i.e., P(not B)) and subtracting it from 1 (which is the total number of outcomes--100%). Since P(A) is the probability that the player gets a hit in a given at-bat, 1 - P(A) is the probability that he will NOT get a hit in a given plate appearance. Given that there are pa number of plate appearances for a player in a given game, (1 -P(A))^pa) is the probability that player does NOT get a hit in a given game (P(not B)). And P(B) = 1 -P(not B) or 1 - (1 - P(A))^pa).
For example a player who gets a base hit in .250 of his plate appearances (not who has a .250 batting average) and who is at the plate four times has the following probability of for getting a hit in that game:
P(B) = 1 - (1 - .25)^4
= 1 - (.75)^4
= 1 - 0.31640625
= 0.68359375 or 68.4%
So the player would be expected to get a hit in the game 68.4% of the time.
Let's plug in DiMaggio's 1941 numbers to see what we get. He had 193 hits in 139 games and 572 plate appearances. That means that he averaged 4.12 plate appearances per game. So in an average game:
P(B) = 1 - (1 - (193/572))^(572/139) = 0.81617846 or 81.62%
Now, in X consecutive games, what is the probability that the player will get a hit in each game? Let's call that P(C) which is P(B) ^ X. That is, the probability that he will get a hit in a given game raised to the number of games.
To illustrate why, let's use a coin toss. What are the odds that three coin tosses in a row will result in three heads (H)? Let's define the sample space. Here are the possible results:
There are eight possibilities but only one with three heads, HHH. So the answer is one out of eight. But how did we get eight anyway? Given that a coin toss is 50-50, the possible results for each coin toss is two (H or T). There are three tosses, so (1/2)^3 = 1/8. P(HHH) = P(H) ^ 3.
So the probably that DiMaggio would get a hit in 56 straight games in 1941 was P(B)^56. From above P(B) = 0.81617846.
So the answer is 0.81617846 ^56 = 0.0000114806697177021 or .0011% or one in almost 87,103.
But that is based on any 56 games. The 1941 season was 154 games long, but DiMaggio only played 139 and since this is a personal stat he would not break the streak if he sat out a game.
If we wanted to know what the likelihood of his hitting in 56 of those games, we would use combinatorics. But we want to know how many discrete 56-game streaks are possible so we are left to fancy ciphering as Jethro would say. As you indicated in your email, there are 99 possible 56-game chunks for a 56-game hitting streak in a 156 game schedule. This is derived by 154 - 56 + 1 (total games - streak games + 1). For example, the number of nine-game streaks in a ten-game period is two:
(where N = no hit and H = hit).
In DiMaggio's case, he only played 139 games so the possible 56-game chunks one may derive is 139 - 56 + 1 = 84. So P(D) is the probability that DiMaggio will have a 56-game hit streak in 1941. P(D) = P(C) * 84 given that the probability that at least one of two events occurs equals the probability of one plus the probability of the other.
So the probability of DiMaggio getting a 56-game hit streak in 1941 was 0.000964376 or 0.0964376% or about 1 in 1037.
I thought it sounded pretty good. I got numbers and everything. But Carroll Zahn wrote me the following:
Unfortunately, Michael Freiman in the most recent Baseball Research Journal says that Dimaggio had only a 1 in 9545 chance of a 56 game streak in 1941. He has the math and lots of specific examples. Take a look. I[t] did not go through your analysis so I do not know where you differ from Freiman. Good luck.
The bastard! I checked out the article and Freiman starts with hits per plate appearance as the basis but instead of using fractional at-bats, he calculates the odds in 4 at-bats and 5 at-bats and prorates them. I have a couple of issues with his approach but they will have to wait for another day.
Comment status: comments have been closed. Baseball Toaster is now out of business.