There was an interesting query circulating on the SABR online newsletter do-hickey. It reads as follows:
Subject: Hitting Streak Probabilities
I need some math help. I understand that to compute the chance of an n-game hitting streak you take the probability of a player getting a hit in a game (p) and raise it to the power of the streak you are interested in (n).
So how would you figure out the probability of a player having an n-game streak during a given season (and wouldn't that be a useful number to know)? Would it be by multiplying the chance of an n-game streak by the number of n-game stretches there are in a season?
For example, there are 99 56-game stretches in a 154-game season (games 1-56 through games 99-154), 119 44-game stretches in a 162-game season, and 162 1-game stretches in a 162-game season.
I understand enough to know that the probability of a player getting a Hit in a game is not the same as the number of hits per game a player averages. A player who plays 162 games and gets 162 hits averages 1 hit per game, but that doesn't mean the probability of him getting a hit in a given game is 1. Unfortunately, I don't understand enough to figure out the probability of a player getting a hit in a given game.
But I do know that even a player going 1/500 has a greater-than-zero probability of getting a hit in a given game (doesn't he?), and squaring this (for a two-game streak) is still greater than zero, and multiplying this by 161 (the number of two-game stretches in a season) is still greater than zero.
But for this player, wouldn't a 2-game streak need to have a probability of 0 for the formula to work in the "real world" (rather than theoretically)? And wouldn't having a 1-game streak somewhere in a season need to be the same probability (1) for a player who gets 1 hit in a season and player who gets 257? And don't these points contradict the idea that to get the probability of a player putting together an n-game streak in a season you multiply the chance of an n-game streak by the number of n-game stretches there are in a season?
So do we need to define the season formula so that you are not allowed to have an n when n is greater than the number of hits a player actually had? Or add something about the probability of a given hit-distribution of h-hits over n-games (there is zero probability of distributing one hit over two games)? Or what?
Please forgive any mathematical errors in the above. I was an English major.
Well I thought I'd try my hand and here's what I whipped up:
I was a Math major, so let's see if I can be of assistance.
Let's start from scratch: Assume H is the number of hits for a player in a year, and PA is the number of plate appearances in a year for the same player. The probability that a player gets a hit in a given plate appearance will be represented as P(A) and is defined as P(A) = H/PA.
We will assume that the sample space is defined by the probabilities based on the player's stats for a given year. This is preferable to a player's entire career stats since it directly related to the player's ability at that point in time. It is also preferable to a day or a week's stats since they would be too small to represent the player's probabilities precisely.
Also, we base the probability on plate appearance not at-bat since the player could walk, get hit by a pitch, etc. A player who walks three times and gets a single in his one official at-bat had the same opportunity to record a hit as a player who went 1-for-4.
So the probability of a player getting a hit in a given game, which we will call P(B), is defined P(B) = 1 - (1 - P(A))^pa), where "pa" is the number of plate appearance in a game (and "^" is the only way I can represent "to the power of" in an email). Basically, P(B) is derived by calculating the probability that player does not get a hit (i.e., P(not B)) and subtracting it from 1 (which is the total number of outcomes--100%). Since P(A) is the probability that the player gets a hit in a given at-bat, 1 - P(A) is the probability that he will NOT get a hit in a given plate appearance. Given that there are pa number of plate appearances for a player in a given game, (1 -P(A))^pa) is the probability that player does NOT get a hit in a given game (P(not B)). And P(B) = 1 -P(not B) or 1 - (1 - P(A))^pa).
For example a player who gets a base hit in .250 of his plate appearances (not who has a .250 batting average) and who is at the plate four times has the following probability of for getting a hit in that game:
Now, in X consecutive games, what is the probability that the player will get a hit in each game? Let's call that P(C) which is P(B) ^ X. That is, the probability that he will get a hit in a given game raised to the number of games.
To illustrate why, let's use a coin toss. What are the odds that three coin tosses in a row will result in three heads (H)? Let's define the sample space. Here are the possible results:
There are eight possibilities but only one with three heads, HHH. So the answer is one out of eight. But how did we get eight anyway? Given that a coin toss is 50-50, the possible results for each coin toss is two (H or T). There are three tosses, so (1/2)^3 = 1/8. P(HHH) = P(H) ^ 3.
So the probably that DiMaggio would get a hit in 56 straight games in 1941 was P(B)^56. From above P(B) = 0.81617846.
So the answer is 0.81617846 ^56 = 0.0000114806697177021 or .0011% or one in almost 87,103.
But that is based on any 56 games. The 1941 season was 154 games long, but DiMaggio only played 139 and since this is a personal stat he would not break the streak if he sat out a game.
If we wanted to know what the likelihood of his hitting in 56 of those games, we would use combinatorics. But we want to know how many discrete 56-game streaks are possible so we are left to fancy ciphering as Jethro would say. As you indicated in your email, there are 99 possible 56-game chunks for a 56-game hitting streak in a 156 game schedule. This is derived by 154 - 56 + 1 (total games - streak games + 1). For example, the number of nine-game streaks in a ten-game period is two:
(where N = no hit and H = hit).
In DiMaggio's case, he only played 139 games so the possible 56-game chunks one may derive is 139 - 56 + 1 = 84. So P(D) is the probability that DiMaggio will have a 56-game hit streak in 1941. P(D) = P(C) * 84 given that the probability that at least one of two events occurs equals the probability of one plus the probability of the other.
So the probability of DiMaggio getting a 56-game hit streak in 1941 was 0.000964376 or 0.0964376% or about 1 in 1037.
I thought it sounded pretty good. I got numbers and everything. But Carroll Zahn wrote me the following:
Unfortunately, Michael Freiman in the most recent Baseball Research Journal says that Dimaggio had only a 1 in 9545 chance of a 56 game streak in 1941. He has the math and lots of specific examples. Take a look. I[t] did not go through your analysis so I do not know where you differ from Freiman. Good luck.
The bastard! I checked out the article and Freiman starts with hits per plate appearance as the basis but instead of using fractional at-bats, he calculates the odds in 4 at-bats and 5 at-bats and prorates them. I have a couple of issues with his approach but they will have to wait for another day.