Predictive FIP

Over the last few weeks, I’ve written about the predictive ability of a simple estimator that consisted of simply subtracting walks from strikeouts. In the latest article, I showed that for starting pitchers with at least 120 innings pitched in year x and at least 100 IP in year x+1, strikeouts minus walks (K-BB) was the best predictor of year x+1 runs allowed (RA9).

A few people asked me to expand on this sample in different ways. I’d like to share my findings for two separate questions, before getting into the meat of this week’s article.

First, I was asked to dial the minimum number of innings down to 40 IP, and re-run the numbers. This changed resulted in a sample of 1,025 qualified pitchers. Here are the results:

Predictor r^2 RMSE
SIERA 0.1706 1.137
K-BB 0.1487 1.152
K% 0.1471 1.153
FIP 0.1221 1.17
xFIP 0.1198 1.171
RA9 0.1067 1.18

Here is my explanation for how one should interpret the r-squared and root-mean square error of the residual from last week:

The r-squared tells us the precent variation in RA9 in Year x+1 that is explained by variation in the predictor in Year x.

The root-mean squared error of the estimate also tells about the strength of the predictor. It works sort of like a standard deviation; thus, the lower the standard deviation (or RMSE) the better the model.

These results jibe well with a larger study that Matt Swartz ran when he was testing SIERA.

SIERA came out on top of that study, just ahead of kwERA (a modified form of K-BB); the same as we see here. Swartz explained on Twitter why SIERA would perform better than strikeouts and walks when relievers are included because:

SIERA’s advantage is on periphery- extreme K, GB, FB. More of those are RP.

One other interesting thing we can see from these results is the continued success of simple strikeout percentage, especially with relievers now included.

I was also asked to change the 2008-2012 sample from the 120 IP minimum in Year x to include Year x-1, and set that minimum to 250 innings. For example, a pitcher who qualified would need at least 250 innings over the course of 2007-08 and at least 100 innings in 2009. A total of 283 starters qualified for this sample:

Predictor r^2 RMSE
FIP 0.2356 0.7773
RA9 0.1973 0.7966
K-BB 0.1595 0.8151
SIERA 0.153 0.8182
K% 0.1503 0.8196
xFIP 0.149 0.8202

For this sample, FIP (fielding independent pitching) and RA9 came out ahead of strikeouts and walks. Why?

Colin Wyers has shown that when the sample of innings pitched gets larger, ERA (or in this case RA9) and FIP become the most predictive. This is largely due to the fact that in most cases successful pitchers have staying power with their team; thus, they pitch in front of a similar defense and in the same home park; which can have large positive effects on the predictability of both FIP and RA9.

FIP also benefits from the fact that home runs begin to stabilize between two and three years, for starters. The home run component of their FiP is less volatile in this sample than it is in the case of year-to-year test.

An idea

In my last article, I left out the actual linear regression equations for each predictor. An emailer, RPL, asked to see those equations. The combination of his comments about those equations and the results from the two tests above led me down an interesting path, which I’d like to share.

First, here are the linear regression equations for each predictor for last week’s sample:

{exp:list_maker}K-BB: -7.51 (K-BB) + 5.175
FIP: .508 (FIP) + 2.355
K%: -7.5 (K%) +5.75
SIERA: .573 (SIERA) + 2.055
RA9: .338 (RA9) + 2.92
xFIP: .557 (xFIP) + 2.153
When K-BB is scaled up to RA9 form the equation is: .578 (kwRA9) + 2.038.

Those equations are listed in order of predictive ability (from most predictive to least). RPL smartly pointed out that while FIP was the second-most predictive, it was being used less (smaller slope) than SIERA or xFIP, which performed worst.

This was a very subtle, but important finding. And by finding, what I really mean is a reminder:

FIP is a descriptive statistic that works fairly well as a predictive statistic, not the other way around.

The weights that are used in the FIP formula ((13*HR + 3*BB – 2*K)/IP + constant) come from a within-season regression. FIP correlates walks, home runs and strikeouts in year x to runs in year x, not year x+1. FIP is really an attempt to tell us how a pitcher actually performed within a season, while ignoring other factors, namely defense. It just happens to work out that FIP does a reasonable job of predicting expected runs for a pitcher in future seasons.

So, I ran a multiple regression with strikeouts per inning pitched (K/IP), walks per inning pitched (BB/IP) and home runs per inning pitched (HR/IP) as three separate factors. I did this to not only strip FIP’s three components of their usual weights and test for predictive ability, but also to try to calculate a different set of weights that were more predictive.

I ran a multiple regression that excluded home runs (just strikeouts and walks) for comparison:

Multiple Predictors r^2 RMSE
K,BB,HR 0.2307 0.7875
K and BB 0.1404 0.8312

The multiple regression that included home runs blew all of the other predictors out of the water, and its r was almost 0.5.

This result was really exciting for me. I thought, for a moment, that I had found new weights for FIP that made it more predictive than its current descriptive weights. I then took the results to Tom Tango, who swiftly pointed out the error in my original excitement.

The weights for home runs, walks and strikeouts that I found were ~8.6, ~1.15, ~1.5, respectively. Tango noted that there was almost clearly a selection bias issue, when just using starters for 2008-2012. He made this assertion because with descriptive FIP, home runs are weighted 4.3 times walks, but these results had home runs weighted almost eight times walks.

In an attempt to account for this selection bias, I expanded the sample to 2004 (704 starting pitchers). The K, BB, HR multiple regression of this sample was even more predictive than the original, and rendered more reliable, or I guess acceptable weights.

The simple equation that I came up with for what I’ll call predictive FIP (pFIP) is: ((7*HR)+(1.6*BB)-(2*K)/IP + Constant

For this sample, the constant was 4.73. Please note that this higher constant comes from the fact that this statistic is meant to be an RA9 predictor, not an ERA predictor, and the higher negative weight of strikeouts.

Here’s how pFIP stacked up against the other estimators for this sample:

Predictor r^2 RMSE
pFIP 0.2466 0.8434
K-BB 0.1908 0.8734
SIERA 0.1841 0.8826
FIP 0.1752 0.8874
xFIP 0.1698 0.8903

pFIP was clearly the most predictive statistic for this sample, beating out the descriptive FiP, xFIP, SIERA and simple strikeouts minus walks.

Keys to pFIP

What is this statistic attempting to do?

The goal of predictive FIP is to use three true outcomes as a predictive tool instead of a descriptive tool. The weighting of descriptive FiP has home runs as 6.5 times that of strikeouts and 4.3 times walks. Predictive FIP holds strikeouts constant, but lowers the weight of home runs and walks. As predictive FiP currently stands, home runs are weighted only 3.5 times greater than strikeouts and ~4.3 times walks.

I think this is an important stipulation. The weighting of FIP that is supposed to work as a describer puts the lowest weight on strikeouts, but strikeouts by themselves are the most predictive statistic. pFiP attempts to account for this fact by giving strikeouts more weight. The way FIP works as it stands is a much better describer than this statistic would be, but my goal is prediction, not description.

I also want to say that it’s important that people realize this statistic is still in a “prototype” stage. I regressed the numbers only for a sample of 704 starters from 2004-12. That is hardly conclusive, and could be subject to severe selection bias.

At the same time, I don’t want to belittle this sample, because I used it for few reasons.

First off, I’m more concerned with projecting starters than relievers. Relievers are one of the most if not the most volatile position in baseball. The changeover at the position is pretty high, and simply using strikeout percentage seems to work really well, by itself, in predicting future runs allowed for relievers.

I also picked this sample, because the game is changing. Strikeout rates are at an all-time high, and my good friend James Gentile of Beyond the Box Score showed this week, that the three true outcomes (K,BB, HR) are also increasing to record-breaking rates.

pFIP is based on the three true outcomes and weights strikeouts more heavily than the descriptive statistic.

I’d like to make it a point too, that I am in no way trying to step away from the Occam’s Razor approach that I’ve been stressing, lately. I still think when it comes to projecting pitching statistics, simple is better.

And pFIP is simple. It is just a predictive variation of the usual FIP, which can be calculated by hand, just like this statistic. pFIP is more complex than just using strikeouts and walks, but not by very much and it does a better job of predicting runs in the next year.

Finally, I’d like to leave this idea with the public. As I said, I tested this statistic only on starting pitchers from 2004-12, which is not enough. I’d like to continue to test pFIP, but I first wanted to hear from everyone in the sabermetric community.

Do you think I should continue down the pFIP path? I think it could be really interesting.

References & Resources
All statistics come courtesy of our friends at FanGraphs.

Newest Most Voted
Inline Feedbacks
View all comments
10 years ago

In a word-Yes! Great Job! Why not split the starters and relievers into 2 buckets and see what u get?

Glenn DuPaul
10 years ago

I did run the numbers for the 40 IP sample from 2008-12.  The only issue with it, is that I didn’t have time to expand to 2004, and it still included starters. 

But in case you were wondering, the weights for that sample were 6.3, 1.35, 1.9 for HR,BB,SO.  And those weights beat all other predictors.

10 years ago

In order to be “fair”, you have to test out of sample.

The other metrics were not “tuned” to predict, while yours is.  So, you need to create your metric with whatever sample data you want, but then you have to test it against out-of-sample.

I can “improve” linear weights as a descriptive metric by making the run value of a double 0.66 in order to win the RMSE race, but 0.66 is illogical.

10 years ago

I guess I don’t really see the point of a special stat here. Get a player’s Marcel K, BB, and HR* rates; use those as inputs to generate marcel FIP. Why would you generate a “predictive” stat that ignores predictive information?

(*For simplicity, this could be 5 times current season, plus 4 times previous, plus 3 times previous previous plus 1200 IP of league average for starter/reliever as relevant with no age adjustment. You might want to fiddle with the amount of regression or incorporate an additional past season of data early in the year, but this seems like the way to predict future ERA to me.)

10 years ago

Very interesting.  The Marcel suggestion above is something I favor.  I do not think that going back 8 years is good, the gameplay changes faster than that.  8 years ago, the current ‘hot potato’ passing at second didn’t exist – thus only 1 out was obtained for such infield hits.  Nowdays it’s 2 outs and that is a significant change.  I suspect that pitching has changed in a similar sense, but it is not so easy to follow.

Good predictors, even on only a yearly basis, are far more useful than using different paintbrushes to describe yesterday.

10 years ago

Two quick questions:
-Does it matter to your statistical tests that SIERA and xFIP are ERA-estimators and not an RA9 estimators?

-Since you’re finding a best fit for your K, BB, and HR inputs in relation to each sample you test against, shouldn’t we expect it to outperform the other metrics that have static weightings?

Glenn DuPaul
10 years ago


You bring up two really important points.  I think it does matter that I’m testing against RA9 instead of ERA, the problem is I don’t know how much.  Obviously, the weights of pFIP would change for ERA, as well, but I plan on testing the same sample against ERA. 

Also, you’re right about the second question.  We would expect the weights I found to do the best, because those weights came from the sample.  I’m currently testing a similar sample but with 1996-2004, to see if the we get the same results. 

I plan to have some answers for next week’s article.