# Reinforcing the power of predictive FIP

In October, I introduced a metric called Predictive FIP (or pFIP for short). This metric is a slightly modified version of Tom Tango’s commonly used fielding independent pitching (FIP) statistic.

Tango’s version of FIP is meant to describe a pitcher’s performance in terms of the three true outcomes (walks, strikeouts and home runs). The FIP equation weights each of those three outcomes in a descriptive manner:

**FIP = (13*HR + 3*BB – 2*K)/IP + Constant (typically ~3.20)**

FIP works fairly well as a predictor of future ERA or runs allowed (RA9); thus, many use the statistic to predict, despite the fact that it is not meant to do so. A good way to think about FIP is what a pitcher’s ERA *should* have been, or better yet, what his ERA would be based solely on Ks, BBs and HRs. FIP is not meant to tell us what a pitcher’s ERA is going to be in the future.

I set out to convert FIP from its descriptive form into a predictive metric.

After a few tests and some advice, I changed some of the methodology behind FIP. First, the FIP weights and constant are meant to describe ERA; I decided make pFIP a predictor of runs allowed per nine innings rather than ERA. Second, I made plate appearances (or batters faced) the denominator of the statistic rather than innings pitched.

The result was this equation:

**pFIP = (17.5*HR + 7*BB – 9*K)/PA + Constant (typically ~5.18)**

The major differences between FIP and pFIP come in the weighting of strikeouts and home runs. Strikeouts become more important when predicting future runs, while home runs become less important.

pFIP held up very well against other more commonly accepted “ERA estimators” (including descriptive FIP). That being said, just because something works fairly well does not mean one should not at least attempt to improve it.

A while back, I attempted to reform pFIP by regressing each of its components (Ks, BBs, HRs), to the mean. Strikeouts and walks are less volatile over one to two year samples; thus, their regression was not nearly as significant as the regression for home runs. Interestingly, regressing the components to the mean, did not improve the metric.

My next idea to improve pFIP was to focus only on the home run component of the statistic.

Dave Studeman, the leader of the Hardball Times, converted Tango’s FIP into a version known as expected fielding independent pitching (xFIP).

According to the THT Glossary, xFIP is:

An experimental stat that adjusts FIP and “normalizes” the home run component. Research has shown that home runs allowed are pretty much a function of fly balls allowed and home park, so xFIP is based on the average number of home runs allowed per outfield fly. Theoretically, this should be a better predictor of a pitcher’s future ERA.

The FanGraphs Sabermetrics Library explains how xFIP is calculated:

(xFIP) is calculated in the same way as FIP, except it replaces a pitcher’s home run total with an estimate of how many home runs he should have allowed. This estimate is calculated by taking the league-average home run to fly ball rate (~9-10 percent depending on the year) and multiplying it by a pitcher’s fly ball rate.

Over most small-to-medium samples xFIP is a better predictor of future than FIP; thus, I decided to apply this concept to pFIP.

xFIP simply inserts the expected number of home runs directly into the FIP equation:

**xFIP = ((13*(FB% * League-average HR/FB rate))+(3*(BB+HBP))-(2*K))/IP + constant**

I decided against inserting the expected number of home runs into the pFIP equation with its current weights.

### An attempt to contrive an xpFIP

I took a sample of starting pitchers who had at least 100 innings in Year X and at least 100 innings in Year X+1 for the years 2007-12 (n = 479).

Then, I ran a multiple regression with strikeouts, walks and flyball percentage times the league average HR/FB in Year X against RA9 for each starter in Year X+1. This regression resulted in this regressed or xpFIP equation:

**xpFIP = ((5*FB%*League-average HR/FB rate))+ (9*BB) + (9*SO) )/PA + constant****

**In this case the constant was 5.23**

By estimating the home run total, the home run coefficient of pFIP is only about half of the weights of Ks and BBs, as opposed to being weighted twice as much as those two coefficients in the original equation.

Then, this xpFIP equation was tested against these other ERA estimators:

{exp:list_maker} pFIP

FIP

xFIP

kwERA

SIERA{/exp:list_maker}I ran a linear regression, on the same sample, between each starter’s ERA estimator in Year X and his RA9 in Year X+1.

I used r-squared as the measure of the predictive value of each estimator, and found these results:

Predictor | r^2 |
---|---|

pFIP | 18.50% |

xpFIP | 17.78% |

kwERA | 17.73% |

SIERA | 15.63% |

FiP | 15.33% |

xFIP | 14.82% |

This new xpFIP equation did fairly well, beating almost all of the other estimators tested. However, regressing the home run component hurt predictive ability of the original pFIP; which was the strongest predictor.

Before scrapping the idea of regressed home runs in pFIP completely, I tested the equation on a different sample. I used the same minimum requirements (100 IP) and the same estimators and ran the same linear regression for the years 2002-07 and found these results:

Predictor | r^2 |
---|---|

pFIP | 19.19% |

SIERA | 16.56% |

FiP | 16.33% |

xpFIP | 16.03% |

kwERA | 15.79% |

xFIP | 15.29% |

The xpFIP equation did not predict future RA9 nearly as well for this sample. My original pFIP equation did significantly better than the other ERA estimators at predicting future RA9.

*Why does the pFIP with a regressed home run component do worse than the non-regressed pFIP?*

It’s interesting that the statistic that uses actual home runs is more predictive than the regressed version, despite the random variation that affects home run numbers.

My best guess for the reason behind this finding has to do with survivor bias. It has been shown that some pitchers have the ability to suppress home runs and consistently have lower than average HR/FB rates. I think it is entirely possible that a fair number of pitchers who are allowed to throw 200+ innings over the course of two seasons have some ability to control their home run rates.

Also there is the issue of park factors. The majority of these players did not change teams during the span of two seasons. It makes abstract sense that a pitcher who made half of his starts in a park that suppressed home runs would have a lower than average home run rate over those two seasons, and vice versa for a pitcher in a home run-friendly park.

I think it’s well within the realm of possibility that regressing the home run component of pFIP would benefit the statistic when looking at pitchers who change teams between Year X and Year X+1.

### pFIP vs. ZIPS

At this point, I’m pretty confident in the strength of pFIP as a predictor.

However, I had always simply assumed that projection systems were more useful, as they consider many more factors other than just the three true outcomes, when attempting to project future runs for pitchers. (Although, this Matt Swartz article caused me to be a little uncertain about that opinion.)

So, mainly for fun, I compared pFIP’s RA9 projections for last year (2012) to the RA9 projections of the popular ZIPS projection system.

First, I looked at a sample of every pitcher who threw at least 100 innings in 2011 and at least *one* inning, in 2012 (n=137) and compared how well each system (or metric did) at projecting future RA9:

Predictor | r^2 |
---|---|

pFIP | 17.72% |

ZIPS | 14.65% |

Much to my surprise, pFIP explained over three percent more of the variation in RA9 than ZIPS. However, my minimum inning threshold for 2012 (one!!) was admittedly silly.

Thus, to eliminate some outliers and converted relievers, I set the minimum threshold in 2012 to be at least five games started in the season (n=118). I found these results:

Predictor | r^2 |
---|---|

pFIP | 19.84% |

ZIPS | 17.20% |

This change improved the predictive ability of both systems, and closed the gap slightly between pFIP and ZIPS. Interestingly though, pFIP still came out ahead of the much more sophisticated system.

This is very obviously a small sample. I looked at starting pitchers in only one season; thus, it could have been pure luck that pFIP was a better predictor of future runs than ZIPS. Also (and more importantly) ZIPS and other projection systems are built to predict many more factors (IP, GS, Ks, BBs, etc.) than just runs.

At the same time, I think these two short studies (regressing home runs and comparing to ZIPS), do a fair job at reinforcing the strength of this simple predictive re-weighting of the FIP equation.

**References & Resources**

All data comes courtesy of FanGraphs

Glenn,

One question on the ZIPS comparison. Were any 2012 data used in determining the coefficients in the pFIP model used in the comparison?

When I determined the coefficients I used 1996-2012, so short answer is yes.

That’s part of why I admitted the comparison was more fun than anything else. At the same time, over those years tested the coefficients were fairly stable, so if 2012 wasn’t included the pFIP model would be almost exactly (if not exactly) the same.

It will be interesting to see how pFIP holds up against other projection models for the 2013 season.

Looking at the pFIP formula, the HR weight is about twice that of BB and K. Why not simplify the formula and just use (K-2HR-BB)/PA ? There’s no big reason it has to be on the RA scale.

@dcs The constant is what really puts pFIP on the RA9 scale. You can use the same weights and regress to ERA and you’ll see similar results. As for simplifying the weights, using (2*HR + BB)- K))/PA will give you almost identical results: 07-12: The r-squared goes from 18.50 to 18.67 (the simple model improves it) 02-07: r^2 moves from 19.19 to 18.75. The difference is essentially negligible. The more complicated weights are meant to improve the model slightly, as home runs aren’t exactly twice as important as Ks and Ks are slightly more important than walks. However, I think… Read more »

The more I see these articles defending FIP and related matters, the more I find your research interesting. Nice job! My stats is very rusty, so maybe you can explain if I understand things correctly. Firstly, I understand that by your methodology, your pFIP is better than most other, some by a little, but a lot better than FIP or xFIP. The very interesting point there is that xFIP is less descriptive, suggesting that HR/FB is not the standard that was thought. Second, no matter which system is used, predictive ability is less then 20% (this is where my stats… Read more »

“The constant is what really puts pFIP on the RA9 scale.”

Yes, it is a phony number, added to another MUCH SMALLER number to somehow make the sum look more baseballish, whatever that means.

@rubes not sure what you’re trying to say? FIP is a very small number that doesn’t look baseballish until you add a constant.

That small number can tell you a lot, even before the constant is added, it also can tell you a lot more than a pitcher’s RA9 in the previous season.

Hi Glenn, Consider my two posts a friendly flame on behalf of the Miguel Cabrera for MVP crowd, all the right and just Bissingerians of the baseball universe. I am making the complete criticism of the FIP stat and your work with FIP over the fact that all results by and large closely resemble the constant, and that there really isn’t a reason for the constant. Effectively, you are comparing Zips results with the numbers 3 and 5. Likewise plugging in FIP to calculate pitchers WAR is pretty much starting out with a fake data set. The constant IS the… Read more »

@ogc Yes, no matter what statistic you use, even pFIP, over 80 percent of the variation in future RA9 is left unexplained. I’ve found that closing that 80 percent gap is nearly impossible. Park factors, defense and strength of opponents could go some of the way in explaining that unexplained portion, but I don’t know how far they would go. The problem is explaining one season of RA9 is very difficult, because the sample is so small, random variation plays a large part. There’s the issue of BABIP, which in a 150+IP 75 percent of the number is random. Defense… Read more »

@rubes I honestly don’t know what to say. You can run the comparison for pFIP with ZIPS or with the other ERA estimators without the constant and simply use the coefficients and you’ll get the exact same result. The constant is just a scaling point, that is an attempt to make the information look like ERA or RA9. FIP is the same statistic whether or you add 3.2 to it or not. It is no way a fake data set. Tango’s FIP takes what actually happened (HRs, BBs, Ks) and weights those outcomes by their run values to get a… Read more »

Glenn, Sorry, no – thanks again for your reply, in any case. I fully appreciate you have re-jiggered the calculation to be basis Plate Appearances Against vs. the other denominator which is IP (How many outs achieved). Yes, one doesn’t even have to consider whether Glenn has done awesome work or not with the changing around of the component multipliers,… Because the simple fact is the results have gone from 3something to 5something, mainly reflecting the change in this size of the constant more than anything else. Changing the size of the constant is much more significant than changing 3Walks… Read more »

Yes, well Bissingerian law technically prevents me from asking, but if it could be explained how Glenn arrived at the constant 5.18 in one instance, vs. 5.23 in another? A point in the right direction would do, rather than a technical explanation an average fan can’t understand. I haven’t run the numbers on say Matt Cain’s 2012 through these various calculations, but I don’t have to. He goes from 3something to 5something. And much like an old-time scout discounting a RHP due to lack of height, these models don’t really do any more than FIP to clarify the future for… Read more »

Yeah, you might want to actually run those numbers. Hint: even though pFIP is on the RA9 scale, which is ~1.08x higher than the FIP/ERA scale, his pFIP from 2012 stats still starts with a 3.

Yes, I guess a simpler way to say it, is that it is fine to scale the various results of whatever FIP calculation using a constant, but when the results are aplied/compared elsewhere, the constant, which is a lot larger number than the calculation result, becomes the data.

Glenn’s adding 2 to the constant is by far the most significant / impactful part of this exercise.