Should we even try to predict future runs allowed for relievers?

A few weeks ago, I made this statement about predicting ERA or runs allowed (RA9) for relievers:

I’m more concerned with projecting starters than relievers. Relievers are one of the most if not the most volatile positions in baseball. The changeover at the position is pretty high, and simply using strikeout percentage seems to work really well, by itself, in predicting future runs allowed for relievers.

I made that statement based on small amount of research, that was not enough evidence to support my argument about projecting relievers into the future.

Obviously, major league teams want to predict future reliever performance, or at least attempt to predict it. However, using just one season worth of innings for a reliever is probably never a good idea. The average healthy reliever’s workload ranges between 30 and 80 innings per season, that is a ridiculously small sample size. Using that small of a sample to predict runs allowed in the subsequent season, a sample of similar size, is just not the best idea.

Over the last few weeks, I’ve been working on re-weighting the popular fielding independent pitching (FIP) formula, to make it a more predictive statistic. When creating those weights, I concerned myself with only starting pitchers and came up with this equation for predictive FIP (or pFIP for short):

(17.5*HR + 7*BB – 9*K)/PA + Constant

I expected the weights for this statistic to be different when controlling for just relievers, as opposed to starting pitchers. So, for this article I set out to re-weight pFIP into a form that was most predictive for relievers.

The Study

I took a sample of relievers who had at least 30 innings pitched in Year x and at least 30 innings in Year x+1, over the years, 2007 to 2012 and regressed their Year x predictors against their runs allowed in Year x+1. The predictors I tested were {exp:list_maker}Strikeouts minus walks (K-BB)
Strikeout percentage (K%)
Expected fielding independent pitching (xFIP)
Skill-interactive ERA (SIERA)
pFIP {/exp:list_maker}

To interpret the results, please read this brief overview of the regression measures:

The r-squared tells us the precent variation in RA9 in Year x+1 that is explained by variation in the predictor in Year x.

The root-mean squared error of the estimate also tells about the strength of the predictor. It works sort of like a standard deviation; thus, the lower the standard deviation (or RMSE) the better the model.

Here are the results (n=624):

Predictor r^2 RMSE
pFIP 8.19% 1.267
K% 7.73% 1.27
SIERA 6.99% 1.275
FIP 6.80% 1.277
xFIP 6.34% 1.28
K-BB 5.99% 1.282

My starter version of pFIP was the most predictive of the statistics tested; however, this was not the most interesting thing about these results.

The idea that on a year-to-year basis, reliever’s future runs are very difficult to predict is backed by the r-squareds found in this sample. For each predictor, over 90 percent of the variance in future runs is left unaccounted for.

Also, using strikeout percentage to predict future runs seems to work just as well as anything else was backed by these results. Simply dividing strikeouts by batters faced, was the second most predictive statistic, beating out more complicated and accepted measures like SIERA and xFIP.

The most interesting though, was centered around relievers and walks. I’ve shown that when you control for starting pitchers, a simple predictor of strikeouts minus walks does very well, but here when controlling for relievers, it is the worst predictor.

When comparing strikeouts divided by batters faced (K/PA) and strikeouts minus walks divided by batters faced (K-BB/PA), these results show that for relievers, including walks does not make the predictor better.

Do the inclusion of walks hurt a projection of future runs for relievers… or is this simply a case of selection bias?

The goal of this piece was to re-weight pFIP into a form that was most predictive for relievers. So like I did for the starter version of pFIP, I ran a multiple regression with K/PA, home runs / PA and BB/PA as three separate predictors and regressed those numbers against runs in the subsequent season.

When the regression was run, I found only strikeouts and home runs to be significant predictors. What I mean by this, is the p-value for walks was 0.865. Typically, for a predictor to be considered statistically significant, its p-value should be under .05, so walks were well above this number, while home runs and strikeouts had p-values well below .01.

Thus, this multiple regression resulted in a pFIP for relievers, that completely ignored walks, because they weren’t significant and did not improve the overall r-squared in any way (and actually hurt the adjusted r-squared). The equation for reliever version pFIP that I found was:

(17*HR – 6.3*K)/PA + Constant
*note*–the constant in this case was ~5.00

This equation is fairly similar to the starter version of pFIP, except it ignores walks. Reliever pFIP’s r-squared was 9.42 percent and the RMSE was 1.26, which beats the other predictors tested.

These results were very surprising, so surprising in fact, that I was almost positive that selection bias could be rearing its ugly head.

So, I tested to see if these results would hold true out of sample. I took another sample of relievers with a min. 30 IP in Year x and min. 30 IP in Year x+1, but this time for the years 2002-2007, and tested the same predictors, except this time I added the reliever pFIP I found in the first sample (R pFIP) and labeled the starter pFIP as S pFiP.

Here are the results (n=636):

Predictor r^2 RMSE
K-BB 10.46% 1.39
S pFIP 9.88% 1.395
K% 9.01% 1.402
SIERA 7.92% 1.41
R pFIP 7.82% 1.411
xFIP 6.87% 1.418
FIP 6.10% 1.424

In this sample, my original pFIP and strikeout percentage did very well again. The main point of this test though, was to test for selection bias, not to see how well strikeout percentage would do. The top predictor, strikeouts minus walks, which was the worst predictor for the original sample, almost screams that the results of the first sample were subject to selection bias. Also, the reliever pFIP that I found in the original sample, which excludes walks, plummeted to down the list.

So, I ran a multiple regression for this sample, with walks, strikeouts and home runs as the predictors to see if walks had now become significant. For this sample, both strikeouts and walks had p-values well under .01 (thus, they were significant), while home runs had a p-value of .531 (thus, home runs were not statistically significant). Adding a certain pFIP-esque weight to strikeouts and walks, while ignoring home runs only increased the overall r-squared of simple K-BB from 10.46 percent to 10.56 percent.

Mainly for fun, I combined the two samples into one large sample for the last decade (2002-12), and re-ran the multiple regression. With this sample, all three predictors then became significant, but the overall r-squared was not very good (10.56%) and strikeouts had a standard error that was half of the size of walks and one-fifth of the size of home runs.


The variability in reliever runs is massive. Craig Kimbrel, a workhorse reliever, threw almost 140 innings over the last two seasons. That number of innings is less than a full season for a healthy starter, and a full season of innings for a starter is still a small sample in relative terms. I think the results from this study just reiterated the fact that year-to-year numbers for relievers are incredibly hard to predict, because of the sample size.

This is why when I originally set out to re-weight FIP, I concerned myself with just starting pitchers. The pFIP I found for starters did better than most statistics for predicting reliever’s runs allowed. However, the variability I found in two of the main predictors of pFIP (home runs and walks) made me concerned if those factors should even be included in an analysis of future reliever runs.

For those who care, when using the entire sample, the equation I found for a reliever pFIP was as follows:

(11.5*HR + 3*BB – 7.5*K)/PA + Constant

I would in no way be a proponent of this formula, though. Given that these weights are subject to selection bias, and that strikeout percentage by itself almost does the entire job, the equation is essentially useless. The r-squared I found for this equation, was 10.56 percent, which is not much better than the one I found for strikeout percentage, 9.48 percent.

The best equation I could find still left almost 90 percent of the variance in future runs unexplained, which doesn’t sit well with me. Thus, I would almost always recommend using multiple seasons to project runs allowed for relievers, but I would also say that one should not expect those results to be too pretty either, because the outcome year is still such a small sample, that almost anything can happen.

If you’re stuck with, or choose to, use just one statistic from one season of data to predict future runs for a reliever, just use strikeout percentage, it’d be foolish to complicate matters any further.

References & Resources
All statistics come courtesy of FanGraphs

Newest Most Voted
Inline Feedbacks
View all comments
10 years ago

Well done Glenn, even with the negative results.  A few questions that occur to me: 1) Could the increase in K% over the time period have an effect upon your results?  2) Could the 30+ innings limit bias the results by eliminating certain types of (bad) pitchers?  3) Looking at the variation of the rates could you come up with numbers of PAs at which you would regress 50% to the mean for that stat?  4) Would these be different from the numbers people have found for starters?  5)  When trying to use something like Marcel to predict future performance, would the weights applied to past seasons and the amount of regression to the mean be different for relievers than for starters?

Keep up the good work!

10 years ago

Glen, are closers easier to predict. Or relievers with higher average pLI more predictive than those with lower leverage innings. I would agree that perhaps we should question if we can predict relievers, but perhaps there is more reason to predict higher performing , higher impact relievers.

Glenn DuPaul
10 years ago

@kds I think the increase in K% is huge here.  If I had looked at a sample of say 1992-2002, than I don’t think K-rate would’ve been as effective of a predictor.  I think eliminating the limit would just hurt the predictive measures.  Why would we care to project a reliever who had a FIP over 10 or something in Year X, because he probably won’t be pitching the next season.  I think these numbers do need to be regressed to the mean to some extent, because the sample is just that small.  I haven’t done enough research to say how much though.  I can’t speak for it, but I think Marcel uses different amounts of regression for starters as opposed to relievers.

That’s a really good question.  I would say for the most part, teams would care more about high leverage relievers than guys throwing in mop-up duty. I can’t speak to whether or not those closers would be easier to predict though.