Does reliever over-use lead to poor subsequent performance?

by Mitchel Lichtman
December 19, 2007

In a Nov. 30 article in the Philadelphia Daily News by Paul Hagen, Braves senior advisor and former big-league manager Jim Fregosi had this to say about relievers who have a good year followed by a bad year or vice versa:

The biggest reason is, when they have a good year, they’re overused. When they have a bad year, they’re not used at all. So then they can come back and have a good year. It’s that simple. They just get tired from overuse. If they’re in 80 games and they warm up 120 times, that’s a lot. The only one [teams] really take care of in the bullpen anymore is the closer. They always bring him in to start the ninth and if he pitches 50, 60 innings, that’s where he’s at. He’s in 60 games where you have a chance to win.

What Fregosi is suggesting, a view which is apparently (according to the article) shared by other baseball “insiders,” is that relievers who throw a lot of innings in one year will suffer an “over-use effect” such that they tend to perform worse the next year, and relievers who pitch comparatively fewer innings will tend to be “fresh” or perhaps less prone to injury in the subsequent year, and will tend to pitch better.

Part of Fregosi’s theory is that pitchers who pitch a lot of innings in any given year tend to be having a good year and pitchers who pitch fewer innings in any one year tend to be having a bad year. In other words, if you have been pitching well, you tend to get “pushed” by your manager, and if you have been pitching poorly, you tend to languish in the pen. That part makes sense and is in fact true.

Let’s say that at the All-Star Break, you are a pitcher with great numbers thus far and have thrown 40 innings. In the second half, you will probably get lots of chances to pitch and will likely amass a season total of 80 or 90 innings, depending on your role, your team’s needs, etc. If you pitched poorly for the first half of the season, you will tend to get fewer innings in the second half and you might even be released or sent to the minors.

In any case, there is no doubt about two things: Relievers who throw a lot of innings in any one season will tend to have had good seasons and they will tend to be better pitchers in general than their low-inning counterparts. The reason for the second part is that if you are already an established good pitcher, you will tend to throw a lot of innings in any given year even if you are not having a particularly good year. The same is true, in reverse, if you are an established mediocre or poor pitcher.

Now, let’s take a look at what we have so far. First, we have a group of pitchers who are good, are having a collectively good season, and who have thrown a lot of innings in year X, all three of these things being related. Next, we have another group of pitchers who have not thrown that many innings in year X. These guys will tend to be poorer pitchers in general, and they will tend to be having a poor year collectively, at least poorer than the pitchers who have thrown lots of innings.

What happens next year, in year X+1, to both groups? Fregosi and other baseball people have observed, and correctly so, that the pitchers who threw a lot of innings the previous year and had a collectively good ERA, see an increase in their ERA the subsequent year. Likewise, pitchers who threw fewer innings and had a relatively poor ERA in year X will show a decrease or not as marked an increase in their ERA in year X+1. Why is that?

As most of you who either know at least a little bit about statistics or sabermetrics, or both, have probably already surmised, any group of pitchers who show a better than average (for them) ERA in one year will always display a worse ERA in any other year (with a few caveats of course), including the subsequent one. Similarly, pitchers who sport a worse than average (again, for them) ERA in any one year (or any time period), will present an ERA better than that in any other time period, assuming that their true talent or the context has not changed much, due to age, injury, park effects, etc. The reason is regression toward the mean. If you are not familiar with that concept, I suggest you Google it or break out a college-level book on statistics.

Of course, Fregosi and his ilk likely have no idea what regression toward the mean is (with no disrespect intended toward them). All they see are pitchers that throw a lot of innings in one year having their ERAs climb the next year and vice versa for pitchers who throw fewer innings in year X (or so they say). So they naturally conclude that a large number of innings thrown in any one year causes a decline in performance the next year, and that fewer innings pitched causes an increase in performance the next year. In fact, in the Fregosi quote above, he says, “It’s that simple.” If only it were.

Actually, their conclusions might be valid if the year X ERA’s for both groups of pitchers were around the same. In that case, there might be either little regression toward the mean (if those ERA’s were already close to the mean) for both groups, or the regression might be about the same. However, even then (if both groups’ year X ERA were the same), we still might not expect both groups to have the same ERA in year X+1, absent any “use effects.” It might be that each group regresses toward a different mean or it might be that each group regresses a different amount towards the same mean due to the fact that they throw different numbers of innings (remember from Sabermetrics 202, that the amount we regress a player’s performance toward the mean in projecting future performance is, among other things, a function of the sample size, in this case, innings, of that player’s sample performance).

For example, let’s say that the group of relievers who throw lots of innings in any given year are part of a population of pitchers who have a mean ERA of 3.50 and that the group of pitchers who throw relatively few innings in a season are part of a population of pitchers who have an average ERA of 4.00. Well, even if we sample these two groups or some subset of these two groups in any given year and both groups happen to have the same ERA in year X, they would not be expected to have the same ERA in year X+1, because their ERA in year X+1 would be their year X ERA regressed toward a different mean.

An example of the second phenomenon, whereby both groups’ ERA was the same in year X and they both come from a population of pitchers with the same true (mean) ERA, would be if group A, the low innings pitchers, averaged 40 innings each, and group B, the high IP relievers, averaged 80 innings each. In that case, we would regress group B pitchers less than group A pitchers. So, for example, if both groups had an ERA of 3.50 in year X, but their mean ERA was 4.00, group A would have an ERA in year X+1 of, maybe 3.90 and group B, maybe 3.80. Group A is getting regressed more toward 4.00 than group B because they had fewer innings pitched per player in year X.

So basically there is a lot of regression toward the mean phenomena that would cause relievers who threw a lot of innings in year X to have a different change in performance from one year to the next than pitchers who threw fewer innings in year X, and these phenomena would have nothing to do with a “use effect.” In fact, because of regression toward the mean, it is indeed a “simple fact” that pitchers who throw lots of innings in one year will tend to see their ERA climb in the next year and pitchers who throw fewer innings in year X will tend to see their ERA go down in year X+1, or at least not rise as much, regardless of whether there is a “use effect” or not. Of course that does not mean that there isn’t such an effect also —only that we would still see a phantom “use effect” even if none existed, due to regression toward the mean.

The study

I looked at all relief pitchers who pitched from about 1970 to 2007. I only included those who had no starts in back-to-back years. First I split each pitcher included in the study (back-to-back years with no starts) into two groups: One, less than 70 innings in year X (and any number of innings in year X+1), and two, 70 or more innings in year X. There were 3113 pitcher “seasons” (back-to-back seasons actually, with some of them overlapping) in the low innings group (with 1168 unique pitchers) and 1171 in the high innings group (451 unique pitchers).

Group I, the low innings pitchers, averaged 33.7 innings per pitcher in year X and 35.5 in year X+1. In group II, the high innings pitchers, each pitcher averaged 86.7 innings in year X and 64.9 iinnings n year X+1.

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

The year X collective ERA for group I was 4.06, and their ERA in year X+1 was 3.97. So it appears that as a group, relievers who throw less than 70 innings in a season (average of only 33.7 innings) either get a little better the next year or regress toward some mean which is lower than their year X ERA, or a combination. If you know nothing about the phenomenon of regression toward the mean, you would have to conclude that these pitcher simply “got better” in year X+1. You might also conclude that the “reason” they got better was that they threw so few innings in year X.

Now, just to be clear, even with regression toward the mean, I am not saying that these pitchers did not pitch better in year X+1 than in year X. They certainly did. It is just that if regression toward the mean were the only “force” at work here, their performance in year X would have been, by definition, worse than their true talent —in other words, these pitchers as a whole, or collectively, were a little unlucky in year X and then regressed to or at least toward their true talent in year X+1.

The high innings, or group II, pitchers had an ERA of 3.17 in year X and 3.46 in year X+1. So they appeared to pitch worse in year X+1, and they did. Again, at face value, and if you knew nothing about regression toward the mean, you might conclude that they “got worse” in year X+1, and you might further conclude that the reason they got worse was that they pitched so many innings in year X (86.7, which is indeed a lot for a reliever) and suffered from some kind of “over-use effect.”

           N pitchers        N pitcher seasons     Avg. IP Year X  Avg. ERA Year X  Avg. IP Year X+1    Avg. ERA Year X+1
Group I    1168              3113                  34              4.06             36                  3.97
Group II   451               1171                  87              3.17             65                  3.46

However, if you are familiar with the concept of regression, you would know that if you sample a group of pitchers and you find that their collective ERA is less than that of their true mean (the mean of the population of pitchers from whence they come), they will automatically revert back towards that mean and appear to get worse in another sample (in this case, the subsequent year). Likewise, if your sample of pitchers has a collective ERA above their true mean, they will also revert back towards that mean and appear to get better in another sample.

The average mean of all of our relievers (those who had back-to-back years with no starts in either year) in the years sampled, was 3.89. That is approximately the population mean of both groups combined, although not necessarily the mean of each group separately (one group is a sample of the population of relievers that pitches a lot of innings, and the other is from a population of pitchers who throw fewer innings—they probably have different true mean ERAs).

Assuming that that is the mean to regress towards, it is no wonder that the group I pitchers decreased their ERAs in year X+1 from 4.06 to 3.97, or 53% toward the mean, and group II pitchers increased their ERA’s from 3.17 to 3.46, or 40% toward the mean. In fact, since group II pitchers threw more than twice as many IP as group I pitchers (86.7 to 33.7) in year X, we would expect that they would regress less, which in fact they did, 40% versus 52%. (Remember that the larger the sample, the less you expect to regress in another sample of the same players’ performance.) There is no evidence here of a “use effect.” Both groups appear to have pitched pretty much as we might have expected in year X+1.

However, while the direction and approximate magnitude of the results were in fact expected due to this phenomenon of regression toward the mean, that does not answer the question of whether there is also some small “use effect,” such that in the subsequent year, the high innings pitchers might pitch a little worse than expected, after accounting for regression, and that the low innings pitchers might pitch a little better.

Let’s see of we can answer the question of whether there is in fact any small but significant “use effect” with regard to reliever innings in any one season. (Note: I say “significant” because trying to disprove the existence of a very small effect in a sample of data where the sample size is less than infinite, is fruitless. Even if the analysis of the data indicates “no effect,” because of sample error, you can never eliminate the possibility that there is actually a small effect and you made what statisticians call a Type I error. And the smaller the effect you are trying to preclude, the more likely it is that you made a Type I error in rejecting it.)

The first thing we have to do is correct a methodological problem in collecting the above data. Now, I didn’t just realize that I made an error. I wanted to simplify the issue at the outset of this article, so I purposely used a shortcut in organizing and presenting the data.

Looking at the collective ERA of a sample of pitchers in one year and then looking at those same pitchers’ collective ERA in another year is not quite the right way to organize the data for an analysis of this sort. We really need to weight the data such that each pitcher has the same weight in each year. For example, let’s say that we have only two pitchers in our sample, and that one pitcher has an ERA in year X of 3.00 in 80 innings and the other pitcher is 5.00 in 10 innings. Now let’s say that in year X+1, our 3.00 ERA pitcher is still 3.00 but only has 10 innings and our 5.00 pitcher is still 5.00 but with 80 innings. Nothing has changed, other than each pitcher’s IP from year X to year X+1. They both pitched exactly the same in both years, rate-wise. If we were trying to ascertain whether something affected our pitchers such that their collective performance should have gotten better or worse, we would have to conclude that there was no such effect, right (disregarding sample size issues)?

But, using my initial method of organizing and presenting the data, the collective ERA of our two pitchers in year X would be 3.22 (the 3.00 pitcher has many more IP), and in year X+1 it would be 4.78, and we might conclude that something caused our pitchers to perform a lot worse! The problem of course is that we are weighting our pitchers differently in each year. When doing a study where you are comparing collective performances of players from one time period to another, such as when you are doing aging studies, you have to weight each pitcher in each sample of performance the same.

Another way of stating that is that you have to look at the difference in performance for each pitcher from one sample to the other and weight that by the harmonic mean of the two IP samples (the number of IP in year X and the number of IP in year X+1), and then take the weighted average of all those values. The result is the average change in performance of all your players, which is exactly what we want to look at in this study.

The harmonic mean of two values is basically the lesser of the two values “stretched” a bit towards the larger value. (The harmonic mean of two or more equal values is always equal to the regular mean or simple average of the values.) For example, the harmonic mean of 10 and 100 is 18.2. The harmonic mean of 20 and 40 is 26.7. The actual formula for the harmonic mean of two numbers, x and y, is 1/((1/x+1/y)/2), which is also 2*x*y/(x+y), or in English, “The inverse of the simple average of the inverse of both numbers.” The method for computing the harmonic mean of more than two values is the same.

If we use this more correct method in our reliever problem, we get the following results:

All relievers with back-to-back seasons and no starts in either season:

Year X: 3.60
Year X+1: 3.88

Why do we see such a decrease in performance among these relievers? Typically pitcher performance does not change much from year to year, as some get better with age and experience, and others get worse. We might see a slight decrease due to the fact that the league itself generally gets a tad better each year.

However, we have selective sampling here, or a “drop-out” effect. Relievers who throw back-to-back years with no starts are not representative of relievers, or even pitchers, in general. Here is what is happening: In year X, you have a group of exclusive relievers. Let’s say that their collective true ERA is 3.88, and in fact, each individual pitcher has a true ERA of 3.88 also. Some of them will pitch better than that and some will pitch worse. Some of the worse ones will not pitch the next year. They will retire, they will get sent to the minors, etc.

Of course, some of the ones who pitch well in year X will also not pitch the next year, but more of the bad ones will drop out than the good ones. So what do we have left in year X+1? We have a group of pitchers who pitched in year X and year X+1 (the exact group of pitchers we sampled for this study) and who got a little lucky, on the average, in year X (whenever you remove a group of unlucky players from an unbiased sample of players, you will always be left with a group of lucky ones, and vice versa). In fact, the collective actual ERA (as opposed to their true ERA) of these pitchers, once we know that they didn’t drop out in year X+1, is going to be less than their true ERA, hence the 3.60 and 3.88.

In fact, if we look at exclusive relievers (no starts) who pitched in year X but not in X+1 (these pitchers are not part of our sample), we find that they had an ERA of 4.96. So our theory of selective sampling is correct. There were actually more relievers who did not pitch in year X+1 (2002) than who did (1619), although the 2002 who dropped out averaged only 20.9 innings in year X. These relievers were a combination of bad and unlucky.

So right off the bat, we expect that because of selective sampling, the ERA of both groups of pitchers, the ones with high innings pitched in year X and the ones with low innings pitched in year X, is going to increase, without accounting for regression toward the mean (although this is a form of regression) and without accounting for a possible “use effect.” In fact, this is exactly what we see when we look at the group I and II data, using our second, better, method of organizing and presenting the data:

Group I (low IP)

Year X: 4.04
Year X+1: 4.20

Group II (high IP)

Year X: 3.11
Year X+1: 3.54

Interestingly, for Fregosi, or anyone else, to say that relievers who throw fewer innings in one year will appear to improve the next year, is disingenuous—at least according to the parameters of the above model and the data set I am using. It is still true, however, that the high innings pitchers appear to pitch worse in year X+1. But again, we expect that on two fronts. One, that these pitchers were “lucky” in year X, and that is at least partially why they pitched so many innings, and two, that our selective sampling process (“drop-out” effect), as with the low innings pitchers, eliminated some of the unluckier pitchers in year X (they did not pitch in year X+1), such that remaining sample was a little luckier still. In other words, the high innings pitchers, group II, were a little lucky in year X for two reasons, and were destined to regress toward their true, higher ERA, in year X+1.

Still, how can we determine whether there may be a “use effect” as well? That is turning out to be not so easy given all of the selective sampling and regression issues that are wreaking havoc with our data. Let’s see if we can set up some kind of a controlled “experiment” whereby we compare pitchers who pitch a lot of innings in year X to those who pitch fewer innings, while holding constant as much else as we can, such as prior year’s or years’ ERA and prior IP, both of which can be proxies for true talent ERA.

Here is what I did. I restricted the study to relievers who had three consecutive years with no games started in any of the three years (exclusive relievers). I then split them into two groups as before—one group pitched fewer than 70 innings in year X and the other group 70 or more innings in year X. This time, however, I only looked at pitchers who threw less than 70 innings in year X-1 and had an ERA of between 3.50 and 4.50 in year X-1. Presumably the pitchers in the two groups were essentially the same pitchers, with around the same true talent ERA. In fact, looking at all the data for the 3 years, the only difference between the two groups appears to be the number of innings pitches in year X. Well, there was one more difference. In year X+1, the group that pitched more innings in year X (group II) pitched a few more innings in year X+1 than the other group (group I). And of course, as you will see and as we would expect, the ERA’s of both groups in year X+1 is not nearly the same. (In the interest of full disclosure, the pitchers in group I averaged 31.6 years old in year X, and in group II, only 30.4.)

Let’s look at some of the data:

           N pitchers        N pitcher seasons     Avg. IP Year X-1Avg. ERA Year X-1Avg. IP Year X      Avg. ERA Year X
Group I    199               254                   47              3.99             41                  4.04
Group II   66                69                    53              3.94             80                  3.16

So what we have are very similar pitchers in year X-1. They have almost identical ERA’s, and numbers of innings that are very close. Group II pitchers pitch a few more innings in year X-1 (six per pitcher) and pitched a little better (by .05 runs in ERA). Either they are slightly better pitchers, or had a slightly better year by luck alone (and that is why they pitched a few more innings), or some combination thereof—probably the latter. In either case, it doesn’t make much difference since the numbers are so close. Essentially we have pitchers who are around 3.90 to 4.00 in true talent ERA. (Their true talent ERA might be a little higher due to the selective sampling process I wrote about earlier—namely that the luckier pitchers tend to go on to pitch another year, in this case year X. So the 3.99 and 3.94 might represent a little bit of luck.)

As you can see from the above chart, group II pitchers threw a lot of innings in year X, despite having thrown comparatively few innings in year X-1, and despite having a mediocre to average season in year X-1 (ERA of 3.94). What does that tell us? It tells us that they probably pitched so many innings, i.e. they were “pushed,” because they were having a great year (and ended up with a great ERA of course). There is little prior evidence that these pitchers were actually great pitchers, regardless of their great year. If there were, their prior or year X-1 ERA would be excellent as well.

It is possible of course, in fact likely, that these pitchers as a whole improved their true talent, but that is not important right now. (Whenever a player has a performance that is better than his prior performance, and hence our prior estimate of his true talent, we must assume that two things have necessarily occurred—one, that the player was “lucky” and two, that the player’s true talent improved.)

Now, this is a perfect opportunity to test Fregosi’s theory. We have two essentially equal groups of pitchers. One had a great year in year X and were ridden hard, amassing 80 innings per pitcher, almost 30 more than the previous year. The other group had a typical season and thus amassed a typical number of innings (for them), around 45. This is exactly the scenario that Fregosi’s talks about. Not only do we have a group of 66 pitchers (in 69 pitcher seasons) who were used a lot (80 innings), but we also have a group of relievers who were not used to being worked so hard, at least as far as we can tell from their prior year workload (53 innings).

So what happened in year X+1? This is the $64,000 question. Before I reveal the answer, let’s not assume that if there is no “use effect,” that group II will revert back to their year X-1 level (3.94 ERA). Remember I told you that when a group of pitchers posts a better than average ERA in any year, their expected or projected ERA will change for the better, even though a portion of that better performance was by definition, luck-driven.

Let’s look at the group I pitchers first: the ones who only threw 41 innings each in year X. Their year X-1 ERA was 3.99 and their year X ERA was 4.04. They pitched a little worse in year X and that is probably one reason why they were not given such a heavy workload. There is also the selective sampling issue that I discussed earlier (only the slightly luckier pitchers get to pitch another year). And, keep in mind that the average ERA for all exclusive relievers in the years of this study is 3.86. That is a number that all of our relievers will tend to regress toward.

So we have a 3.99 group of relievers in year X-1. They will tend to regress toward 3.86 in year X+1, but because of selective sampling (the “drop-out” effect discussed earlier) and because we chose them because they were not worked so hard (and therefore they probably were having relatively poor seasons), they probably would also appear to get worse in year X. The net result? A 4.19 ERA in year X+1.

In table form, here is what we see for group I pitchers:

Avg. IP Year X-Avg. ERA Year X-1 Avg. IP Year X        Avg. ERA Year X Avg. IP Year X+1 Avg. ERA Year X+1
47             3.99              41                    4.04            40               4.19

If we take a weighted average of the year X-1 and year X ERA’s, we get 4.02. If we regress that 80% (which is appropriate for a total of 80 IP) toward a mean of 3.86, we get a “projection” of 3.89. However, there appears to be an increase in ERA of around .2 runs for all pitchers from year X to X+1 due to the “drop off” effect and aging (pitchers with at least two years under their belt tend to be older pitchers). Since these pitchers will tend to have more “drop-outs” than the group II pitchers, we will increase their ERAs by .3 runs. So that gives us an expected ERA in year X+1 of 4.19. We see an ERA of 4.19, exactly what we expected! There certainly is no evidence that these pitchers benefited from only pitching 40 innings in year X.

Finally, let’s look at the group II pitchers, the ones who were ridden hard in year X because they were having great years: a collective ERA of 3.16.

We’ll do a “Marcel-type” projection for them, as we did with the group I pitchers. In year X-1, they had an ERA of 3.94 for 53 innings each. In year X, a 3.16 ERA with an average of 80 innings pitched. Using the same 2/1 weights we used for the group I pitchers, that is a weighted average of 3.35. If we regress that 70% toward the mean (we have a bigger sample), we get a projection of 3.71. In this group we don’t get as much drop-out, so we will only add .1 run to that, for a final projection of 3.81 in year X+1. What do actually see? 3.62, .19 runs better than our projection. Again, there is no evidence of any “over-use” effect. In fact, if anything, these pitchers appear to pitch slightly better than we expected, barring any “use-effect” at all.

Here are the group II pitchers in table form:

Avg. IP Year X-Avg. ERA Year X-1 Avg. IP Year X        Avg. ERA Year X Avg. IP Year X+1 Avg. ERA Year X+1
53             3.94              80                    3.16            56               3.62

So, we have learned several things today. One, Fregosi’s observations are not even correct. Relievers who throw few innings in any one year do not see their ERA decrease the next year, at least to the extent of the parameters in this study. Two, while he is correct that relief pitchers who are worked hard during a season tend to see their ERA’s increase markedly (in our case, almost half a run), such an increase is fully expected due to two things – regression towards the mean and a “drop-out” or selective sampling effect, such that any pitcher who is allowed to pitch a subsequent year will have had a tendency to have gotten a little lucky in the current year, the same problematic phenomenon we see in aging studies. Finally, if there is a small “use-effect” such that relievers who are worked hard tend to suffer in subsequent seasons as compared to relievers who don’t throw as many innings, it is not evident from the data in this study.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG