# FIP, In Context

For decades, pitchers have been measured by their earned run average (ERA): the number of “earned” runs charged to them per nine innings pitched. Pitcher ERAs are incredibly inconsistent from season to season, though, largely because so much of what causes runs to score is outside a pitcher’s control.

More recently, baseball researchers have focused on so-called defense-independent pitching statistics (DIPS) to try to better isolate the factors that a pitcher can help control. Voros McCracken is credited with starting the movement, but Tom Tango is responsible for the most widely-used DIPS-type equation: Fielding Independent Pitching (FIP). FIP, in turn, has spawned a legion of other estimators seeking to improve upon its simple formula, often seeking different ends. Well-known derivatives include xFIP and SIERA; other recent efforts include TIPS and BERA.

However, none of these metrics is able to consider the context of each underlying event. They don’t account for each batter the pitcher faced, the number of times the pitcher faced that batter over a season, the catcher to whom the pitcher threw, or the umpire behind the plate. They also don’t consider how each event was affected by the stadium in which it occurred, the handedness of the pitcher and the batter, or the effect of home-team advantage. Nor do they account for a pitcher throwing in a loaded division, as opposed to a pitcher running up his stats against lesser competition. This both limits their overall effectiveness and, in particular, their usefulness with smaller sample sizes.

To help address these issues, this article introduces Contextual FIP, or cFIP. (I recognize that “cFIP” is a commonly used shorthand for the FIP constant that puts FIP on an ERA scale. Unfortunately, I haven’t thought of a better name, so cFIP it is. Sorry.) Building on the mixed-model approach we developed at Baseball Prospectus for Called Strikes above Average (CSAA), cFIP seeks to provide this missing context. Each underlying event in the FIP equation — be it a home run, strikeout, walk, or hit by pitch — is modeled to adjust for, as appropriate, the effect of the individual batter, catcher and umpire; the stadium; home-field advantage; umpire bias; and the handedness relationship between pitcher and batter present during each individual plate appearance.

cFIP has multiple advantages: (1) it is more predictive than other pitcher estimators, especially in smaller samples; (2) it is calculated on a batter-faced basis, rather than innings pitched; (3) it is park-, league-, and opposition-adjusted; and (4) in a particularly important development, cFIP is equally accurate as a descriptive and predictive statistic.

The last characteristic makes cFIP something we have not seen before: a true pitcher quality estimator that actually approximates the pitcher’s current ability. I recommend both its use and its further refinement.

### FIP, and Its Variants

The generally-accepted equation for FIP, published at FanGraphs, is as follows:

FIP is quite useful in its current form. Among starters who qualified for the ERA title last season, on a scale of 0.0 (none) to 1.0 (perfect), the weighted correlation of their FIP to their ERA was 0.71. Among pitchers who threw 40 innings or more, the correlation was 0.70. When trying to decide if a pitcher’s terrific stretch is “for real” or a likely fluke, FIP is one of the first places knowledgeable baseball fans look.

FIP has also been extended in various ways. xFIP, conceived by Dave Studeman, operates on the premise that while a pitcher’s flyball rate is a skill, the number of fly balls that leave the stadium often is not. So, xFIP replaces a pitcher’s home run rate with his flyball rate times the league-average home run-per-fly ball rate. xFIP has less descriptive value but more predictive value than FIP. Other researchers have developed estimators that apply different weights to strikeouts, reward sequencing, and add additional components, such as batted-ball data. SIERA, developed by Matt Swartz, is the most widely followed of these and functions reasonably well in predicting future runs allowed.

But these metrics also share important limitations. First, as currently designed, their formulas target a “deserved” ERA or some similar measure of runs allowed. This is a traditional, but increasingly questionable, goal. Earned run average, of course, charges a pitcher only for runs that were “earned” in the opinion of the official scorer. Earned or not, the runs-allowed system charges pitchers with the full weight of runners they put on base but that a subsequent pitcher allowed to score; the latter pitcher is charged nothing.

This distinction made sense decades ago when we couldn’t allocate the likelihood of run-scoring between two pitchers. That is no longer true. Moreover, runs-allowed uses runs per nine innings as its denominator rather than total batters faced. Although this distinction does not always make a difference, a pitcher who consistently allows four or five runners over the course of three outs is simply not as good, and will not be as successful in the long term, as a pitcher who usually retires all three batters he faces. In summary, we should not be calibrating our run estimation metrics to ERA if we can avoid it.

Instead, we should be using RE24, a sabermetric improvement that unfortunately sounds like a pharmaceutical. Rather than whack one pitcher or the other with the entire consequence of a handed-off runner, RE24 debits a departing pitcher solely for the run expectancy of the situation left behind, and similarly debits a reliever only for the runs scored in light of that pre-existing expectancy. The reliever who gets out of an inherited jam will be credited accordingly. And in RE24, runs are runs, regardless of whether they are “earned” or not.

RE24 is not perfect, either. It does not consider defense and holds a pitcher fully responsible for everything that happens on the field on a play. But these shortcomings are equally true of ERA, and are no reason to avoid RE24. In this article, we will use RE24 per plate appearance (RE24/PA), not ERA, to compare the abilities of these metrics to one another. RE24 is published at FanGraphs.

The second and larger problem is the focus of this article: that, as discussed above, the underlying statistics used by FIP and its brethren lack context. Home runs are the same in Petco Park as they are at Coors Field. A walk to a .250 wOBA hitter is the same as a walk to a .400 wOBA hitter. Striking out Matt Carpenter is as impressive as striking out Javier Baez. And brushing a batter who crowds the plate is the same as drilling an ordinary hitter in the back.

The inability to consider each event individually also limits the effectiveness of existing attempts to make park and league adjustments. Variants like “FIP-,” “xFIP-,” and “ERA+/-” do try to account for park and league. But because they can’t consider each plate appearance individually, they are forced to make broad assumptions across a pitcher’s seasonal statistics. As far as I can tell, these statistics assume that half of each pitcher’s innings were pitched at home (not necessarily true), and that the remainder of his games were played at stadiums whose run-scoring environments cancel each other out (also not necessarily true).

These metrics, therefore, apply a park scoring factor to half of a pitcher’s innings commensurate to his home stadium and assume the remaining events all occurred in a league-average scoring environment. These assumptions may be close enough for missile work, and we’ll see below that they do produce a slight improvement in the results as compared to their original metrics. But we can now do better, and we should.

Recently, Baseball Prospectus published an article I wrote with Harry Pavlidis and Dan Brooks that introduced Called Strikes above Average (CSAA) to help measure catcher framing. The article endorsed the use of mixed models to account for the context of each plate appearance involved in a player’s season. Mixed models allowed us to dramatically improve our understanding of catcher framing, and mixed models can open new doors in other areas of baseball research as well.

With a mixed model, we can introduce context to FIP while retaining much of its simplicity. A mixed model can simultaneously weight every plate appearance in a season to determine whether a pitcher is truly home-run prone or primarily a victim of stadium and schedule. We find out whether a pitcher is actually a strikeout master or just carving up shark bait. We find out these things in far fewer plate appearances than other metrics require. And what we will find, in the end, is that cFIP harmonizes descriptive and predictive DIPS, allowing us to estimate the pitcher’s true pitching talent during a particular season.

### The Approach

For an explanation of how mixed models work, please review our CSAA article at Baseball Prospectus. The models I created here are similar to the CSAA models, and the specifications for each of them are in the Appendix for those who wish to try out the code for themselves.

As with CSAA, I selected a generalized linear mixed model. I used the free R computing environment (3.1.2) and the freely-available **lme4** package. I then downloaded, from Retrosheet, every plate appearance from the 2011 through 2014 seasons and excluded those that did not include a terminal batter event. I specified models for each current component of FIP — home runs, walks, strikeouts and hit by pitch — and applied a mixed model to readjust those components for each pitcher in light of the adjusted circumstances of each plate appearance.

The revised numbers for each pitcher, as compared to a league-average (“null”) probability, were then multiplied by the standard FIP coefficients and summed. The resulting number is converted to a “minus-style” metric on a scale of 100 with a standard deviation of 15. For the 2014 season, all 183,929 plate appearances were modeled, and the context-adjusted FIP (cFIP) of each pitcher was collected. The model takes about 15 minutes to run on a season’s worth of data.

### cFIP in Action

Because cFIP is on a 100 “minus” scale, 100 is perfectly average, scores below 100 are better, and scores above 100 are worse. Because cFIP has a forced standard deviation of 15, we can divide the pitchers into general and consistent categories of quality. Here is how that divides up for the 2014 season, with some representative examples:

Representative Examples, 2014 Season |
---|

cFIP Range |
Z Score |
Pitcher Quality |
Examples |

<70 | <-2 | Superb | Aroldis Chapman (36/best), Sean Doolittle (49), Clayton Kershaw (57), Chris Sale (63) |

70–85 | <-1 | Great | Zach Duke (72), Jon Lester (75), Mark Melancon (75), Zack Greinke (82) |

85–95 | <-.33 | Above Avg. | Hyun-jin Ryu (87), Francisco Rodriguez (88), Johnny Cueto (89), Joba Chamberlain (90) |

95–105 | -.33 < 0 < +.33 | Average | Tyson Ross (95), Sonny Gray (96), Matt Barnes(99), Brad Ziegler (104) |

105–115 | >.33 | Below Avg. | Brian Wilson (106), Tanner Roark (107), Nick Greenwood (111), Ubaldo Jimenez (112) |

115–130 | >1 | Bad | Edwin Jackson (116), Jim Johnson (120), Kyle Kendrick (124), Aaron Crow (125) |

130+ | >2 | Awful | Brad Penny (130), Paul Maholm (131), Mike Pelfrey (132/worst), Anthony Ranaudo (132/worst) |

I’ve provided a mix of starters and relievers for each approximate category. Obviously, it is more impressive for a starter to achieve each category than a reliever, because a starter pitches so many more innings. Any cFIPs under 70 are, for starters, basically your Cy Young candidates, provided they pitch enough innings: Kershaw, Sale, Corey Kluber, Yu Darvish, Jose Fernandez, and Max Scherzer. We might as well include Phil Hughes and David Price, who checked in at exactly 70.

But, go ahead and check out the cFIP scores for yourself. I’ve posted the results for every pitcher in baseball for 2011, 2012, 2013, and 2014. Review, compare, and discuss to your heart’s content.

Where does cFIP disagree with other DIPS metrics the most? Those comparisons are easiest to make with FIP- and xFIP-, as they are not only also on a “minus” 100 scale, but also represent two of the most popular metrics that try to account for park and league. Compared to FIP-, here are a few significant disagreements among pitchers with 170-plus batters faced. Remember, lower numbers are better, and higher numbers are worse.

Significant Differences Between FIP- and cFIP- |
---|

Player |
FIP- |
cFIP- |
Difference |

Ernesto Frieri | 151 | 90 | -61 |

CC Sabathia | 123 | 94 | -29 |

Brett Anderson | 69 | 99 | +30 |

By far the biggest gulf belongs to Frieri, whom cFIP identifies as incredibly unlucky last year, with his struggles better explained by the quality of his opposition and ballparks. The team that saw through this and signed him as a bounce-back candidate was — surprise! — the Rays. At $800,000 plus incentives, the Rays seem poised to capitalize on yet another inefficiency.

Sabathia is a similarly interesting candidate. Although his performance last year was ugly, at a FIP- of 123, cFIP sees him, even blind to his injury issues, as a still-above-average pitcher who ran into a buzz saw of circumstances.

On the other end of the spectrum, there is Anderson, signed with some fanfare by the Dodgers this offseason. Anderson had a sparkling 2.99 FIP in limited action, which looks impressive at first glance when you consider it was achieved with the Rockies. However, cFIP is not buying it, viewing him as a purely average (99) pitcher considering his opponents and even his ballparks. cFIP thus sees Mr. Friedman as having probably overpaid for Mr. Anderson, but the Dodgers, for whatever reason, seem to be trusting his FIP to combine with some actual good luck on injuries.

The disagreements between cFIP and xFIP- are less extreme, which is unsurprising given that xFIP operates on a tighter distribution than FIP. Nonetheless, there are still some notable disagreements, particularly with starters:

Significant Differences Between xFIP- and cFIP- |
---|

Player |
xFIP- |
cFIP- |
Difference |

Max Scherzer | 83 | 68 | -15 |

Phil Hughes | 84 | 70 | -14 |

Hisashi Iwakuma | 75 | 88 | +13 |

xFIP is a funny thing: while it often grants appropriate compassion to victims of bad flyball luck, it also refuses (by design) to credit pitchers who excel at minimizing flyball damage. That certainly seems to describe Scherzer and Hughes, both of whom were punished by xFIP’s typical regression toward league average on fly balls. cFIP does not see it that way, and in fact finds their 2014 performances to be downright exceptional in light of the competition and ballparks they faced.

The same cannot be said for Iwakuma, about whom cFIP is more skeptical. Granted, a cFIP of 88 is nothing to sneeze at; Iwakuma is still an above-average pitcher. But cFIP seems to feel Iwakuma should have performed much better given his pitcher-friendly home ballpark (Safeco Field) and some of the terrible in-division teams (Rangers and Astros) he got to face last year.

### The Proof

cFIP certainly talks a good game: context-adjusted, 100-scale, batters-faced as a denominator: these all sound promising. But what is the best way to compare its effectiveness to current DIPS statistics? The answer to that question is, for me, the most fascinating part of this study.

Let’s start with the basics. As most of you know, statistics are commonly divided into two general categories: descriptive and inferential. Descriptive statistics describe what has happened in the past, whereas inferential (a.k.a. “predictive”) statistics are focused on drawing inferences about the future from the limited information we have now. Until now, pitcher metrics have forced baseball researchers to choose between those two characteristics.

That is about to change. To understand why, we need to review the descriptive and predictive abilities of all these metrics.

### Descriptive Power

To compare descriptive power, let’s look at the average in-season performance of the various estimators, correlating to Fangraphs’ RE24/PA. At the suggestion of Tom Tango, I’ll also include kwERA (formula: (K – (BB-IBB+HBP)) / PA). We’ll average a four-year sample for each:

Average In-Season Estimator Performance, 2011-2014 |
---|

Metric |
2011 |
2012 |
2013 |
2014 |
Mean |

RA9 | -0.92 | -0.92 | -0.94 | -0.93 | -0.93 |

ERA- | -0.92 | -0.93 | -0.94 | -0.92 | -0.93 |

ERA | -0.89 | -0.90 | -0.92 | -0.90 | -0.90 |

kwERA | -0.84 | -0.81 | -0.83 | -0.86 | -0.84 |

FIP- | -0.72 | -0.73 | -0.73 | -0.70 | -0.72 |

FIP | -0.70 | -0.70 | -0.72 | -0.69 | -0.70 |

cFIP |
-0.66 |
-0.62 |
-0.62 |
-0.61 |
-0.63 |

SIERA | -0.64 | -0.63 | -0.60 | -0.62 | -0.62 |

xFIP- | -0.62 | -0.61 | -0.59 | -0.62 | -0.61 |

xFIP | -0.61 | -0.60 | -0.59 | -0.62 | -0.61 |

This chart considers all pitchers with 170-plus batters faced, which is approximately equivalent to at least 40 innings pitched. The pitching metrics all have an inverse correlation with Fangraphs’ RE24, so remember -1.0 is the highest possible score, showing a perfect negative correlation, with 0.0 still the worst (meaning no correlation).

Not surprisingly, RA9 (raw runs-allowed per nine innings) ties as the most accurate, with ERA’s park / league adjusted cousin, ERA-, also providing the tightest correlation to RE24/PA. Because runs-allowed metrics tell you simply how many runs crossed the plate, this is to be expected. After ERA, which does only slightly worse, we have kwERA at -.84, with FIP and FIP- checking in -.64 and -.65, respectively. cFIP registers an average of -.63, whereas SIERA and the xFIPs bring up the rear by a small amount.

If we do a weighted correlation of all pitchers to RE24, including those who faced as few as one batter, here are the results:

Average In-Season Estimator Performance Correlated to RE24, 2011-2014 |
---|

Metric |
2011 |
2012 |
2013 |
2014 |
Mean |

RA9 | -0.91 | -0.88 | -0.93 | -0.88 | -0.90 |

ERA- | -0.90 | -0.86 | -0.92 | -0.86 | -0.89 |

ERA | -0.88 | -0.85 | -0.91 | -0.86 | -0.88 |

FIP- | -0.71 | -0.72 | -0.72 | -0.72 | -0.72 |

FIP | -0.70 | -0.71 | -0.72 | -0.72 | -0.71 |

xFIP- | -0.58 | -0.60 | -0.60 | -0.60 | -0.60 |

xFIP | -0.57 | -0.59 | -0.59 | -0.60 | -0.59 |

SIERA | -0.55 | -0.56 | -0.57 | -0.54 | -0.56 |

kwERA | -0.56 | -0.53 | -0.55 | -0.52 | -0.54 |

cFIP |
-0.57 |
-0.53 |
-0.56 |
-0.49 |
-0.54 |

The pecking order is very similar to before, except cFIP and kwERA now move to the bottom in performance. Is this concerning? Actually, it’s not, for reasons that we’ll see in a moment.

### Predictive Power

So, let’s talk about the other side of the coin: the ability to draw inferences about the future. For most people, this is where the rubber hits the road. There is something to this sentiment. We already know how many runs came across the plate. What we usually want to know is if there is any reason the pitcher’s results should change. And so, much of the discussion of DIPS metrics tends to revolve around each estimator’s ability to predict runs allowed by a pitcher in future seasons.

We’ll use the so-called “Year+1” test for these metrics: how well they do in predicting RE24 per plate appearance (RE24/PA) in the pitcher’s next season. I took the harmonic mean of the total batters faced from consecutive seasons for each pitcher. Again, we are looking for a range of -1.0 (best) down to 0.0 (worst). Here are the results, first for pitchers with 170-plus batters faced, then with all pitchers in a given season:

Predicting RE24/PA, 2011-2014, Min. 170 TBF |
---|

Metric |
2011-12 |
2012-13 |
2013-14 |
Mean |

cFIP |
-0.46 |
-0.37 |
-0.37 |
-0.40 |

SIERA | -0.42 | -0.35 | -0.38 | -0.38 |

kwERA | -0.45 | -0.37 | -0.32 | -0.38 |

xFIP- | -0.36 | -0.34 | -0.36 | -0.35 |

FIP- | -0.42 | -0.29 | -0.32 | -0.34 |

xFIP | -0.37 | -0.30 | -0.36 | -0.34 |

FIP | -0.41 | -0.23 | -0.33 | -0.32 |

ERA- | -0.36 | -0.23 | -0.25 | -0.28 |

RA9 | -0.36 | -0.20 | -0.26 | -0.27 |

ERA | -0.35 | -0.20 | -0.25 | -0.27 |

Predicting RE24/PA, 2011-2014, All Pitchers |
---|

Metric |
2011-12 |
2012-13 |
2013-14 |
Mean |

cFIP | -0.36 | -0.32 | -0.28 | -0.32 |

SIERA | -0.31 | -0.29 | -0.28 | -0.29 |

xFIP- | -0.29 | -0.31 | -0.28 | -0.29 |

xFIP | -0.28 | -0.29 | -0.28 | -0.28 |

FIP- | -0.30 | -0.24 | -0.23 | -0.26 |

FIP | -0.29 | -0.20 | -0.24 | -0.24 |

RA9 | -0.25 | -0.16 | -0.19 | -0.20 |

ERA- | -0.24 | -0.17 | -0.17 | -0.19 |

ERA | -0.24 | -0.16 | -0.18 | -0.19 |

kwERA | -0.21 | -0.17 | 0.03 | -0.12 |

cFIP is the winner here, which is what we would expect from a metric that benefits from considering the context of each plate appearance. So, going forward, cFIP appears to be a better choice than any other metric for predicting future runs allowed.

At the same time, none of these performances is something to write home about. A -0.40 isn’t bad, necessarily, but it may not be that much better than simply picking random numbers or just projecting everyone to be average.

Why does every metric do so poorly in predicting next-season RE24/PA? Much of it probably is the time-honored concept of regression to the mean: those who went up will generally come down, and vice versa. Predicting exactly which players will obey this rule and how severely they will obey it is quite difficult.

I would like to propose a different approach for predicting future performance. I believe the ideal goal of a pitcher estimator should be to estimate the inherent quality of the pitcher, not to specifically estimate the pitcher’s future runs allowed. Future results, after all, are a combination of pitcher quality + circumstances. (And random variation, but that is always present.)

I would argue the best way to account for the latter is through projection systems, not pitching estimators. Projection systems like PECOTA, Steamer and ZIPS are designed to take into account circumstances like changes in team, stadiums, injury history, and the like. They can also explicitly incorporate a general regression factor as part of these complicated adjustments.

Thus, if we are looking for an accurate estimator of pitcher ability, what we should be considering is not how the estimator predicts future run expectancy, but how the estimator correlates with itself in consecutive seasons. After all, we already know from our descriptive analysis how well these metrics correlate in-season to run expectancy; what we really want to know in projecting future results is whether the metric is accurately assessing the same qualities in the pitcher. We find that by testing the metric against itself out of sample. The hypotenuse of that analytic triangle — the translation of the pitcher metric to future run expectancy — then can be implemented in an actual projection system.

To do this, I compared the season-to-season correlation of the various run estimators for all pitchers with at least 170 batters faced. These figures will be positive, because we are correlating the metric to itself. So, the best score is 1; the worst is 0. Here is what I found:

Season-to-Season Correlation, Run Estimators, 2011-2014 |
---|

Metric |
2011-12 |
2012-13 |
2013-14 |
Mean |

cFIP |
0.64 |
0.64 |
0.63 |
0.64 |

kwERA | 0.61 | 0.63 | 0.60 | 0.61 |

SIERA | 0.58 | 0.58 | 0.58 | 0.58 |

xFIP | 0.55 | 0.53 | 0.55 | 0.54 |

xFIP- | 0.55 | 0.54 | 0.54 | 0.54 |

FIP- | 0.48 | 0.42 | 0.45 | 0.45 |

FIP | 0.49 | 0.38 | 0.45 | 0.44 |

RA9 | 0.39 | 0.24 | 0.26 | 0.30 |

ERA | 0.36 | 0.26 | 0.23 | 0.28 |

ERA- | 0.35 | 0.28 | 0.22 | 0.28 |

cFIP is the clear winner, followed by kwERA, SIERA, the xFIPs, the FIPs, and last of all by RA9 and the ERAs, the latter of which, as most of you already knew, have very little value in predicting future run expectancy.

cFIP’s victory in predicting future performance is an impressive feat. Remember SIERA and xFIP both are designed to excel at prediction, because they disregard the most volatile factor of all (past home runs) and replace it with a regression component. cFIP, however describes only the actual events that happened — including home runs — and still beats both SIERA and xFIP. kwERA is also quite impressive.

But that’s not all. Predictability is usually most important in small sample sizes. ‘X’ pitcher has had two good months. Will he probably continue to pitch that well? This is where cFIP really shines. Including pitchers with as few as one batter faced, look at the year-to-year correlation of cFIP to itself, as compared to other metrics:

Season-to-Season Correlation, CFIP, 2011-2014 |
---|

Metric |
2011-12 |
2012-13 |
2013-14 |
Mean |

cFIP |
0.55 |
0.55 |
0.51 |
0.54 |

SIERA | 0.29 | 0.34 | 0.31 | 0.31 |

xFIP- | 0.28 | 0.35 | 0.27 | 0.30 |

xFIP | 0.27 | 0.35 | 0.27 | 0.30 |

kwERA | 0.27 | 0.42 | 0.16 | 0.28 |

FIP | 0.16 | 0.23 | 0.29 | 0.23 |

FIP- | 0.15 | 0.23 | 0.27 | 0.22 |

ERA | 0.06 | 0.08 | 0.13 | 0.09 |

ERA- | 0.05 | 0.08 | 0.14 | 0.09 |

RA9 | 0.05 | 0.07 | 0.13 | 0.08 |

cFIP crushes the other metrics. Far better than any other estimator, cFIP predicts how capable a pitcher will be in the near future as compared to his current performance. Because of its contextual adjustments, cFIP retains the vast majority of its strength even when low-sample pitchers are included. The predictive value of cFIP is clear.

### The Pitcher’s True Talent

Having explored descriptive and predictive tendencies, it’s time to move on to the next step. Before I go further, it’s important to note that I don’t think the author of any current estimator — even xFIP or SIERA — would claim they are purporting to estimate any pitcher’s true talent in their metrics. Rather, they are focused on better describing either what caused a pitcher’s runs allowed, or predicting how his current results will regress in the future. cFIP, however, allows us to be bolder: it permits us to estimate the pitcher’s true talent in the components we are measuring.

When is a pitcher quality estimator actually isolating true talent? My answer is this: when there is a substantial similarity between the estimator’s descriptive and predictive power. If an estimator is truly isolating a pitcher’s talent, there should not be much difference between the two. If an estimator is doing well in one aspect and poorly on another, then it is not estimating a pitcher’s true ability: rather, it is over-fitting past results to better explain what happened (primarily descriptive) or under-fitting past results to minimize future error (primarily predictive).

There is nothing wrong with choosing statistics that skew one way or the other on the descriptive-predictive spectrum, particularly when the author is transparent about which way the statistic swings. But a statistic that is notably skewed one way or the other is not accurately evaluating pitchers’ actual ability.

The degree of similarity between a metric’s descriptive and predictive power reduces to simply taking the mathematical difference between the two. Note that I am defining “predictive” through my preferred “estimator predicting itself” method. I’ll use absolute values to keep things simple, and please remember that a lower differential is better. Here is how our estimators stack up with pitchers who have faced 170-plus batters:

Descriptive vs. Predictive Power, Min. 170 TBF |
---|

Metric |
Descriptive |
Predictive |
Differential |

cFIP |
0.63 |
0.64 |
0.01 |

SIERA | 0.62 | 0.58 | 0.04 |

xFIP | 0.61 | 0.54 | 0.06 |

xFIP- | 0.61 | 0.54 | 0.07 |

kwERA | 0.84 | 0.61 | 0.22 |

FIP | 0.70 | 0.44 | 0.26 |

FIP- | 0.72 | 0.45 | 0.27 |

ERA | 0.90 | 0.28 | 0.62 |

RA9 | 0.93 | 0.30 | 0.63 |

ERA- | 0.93 | 0.28 | 0.64 |

And here is the same comparison for all pitchers, regardless of sample size:

Descriptive vs. Predictive Power, All Pitchers |
---|

Metric |
Descriptive |
Predictive |
Differential |

cFIP |
0.54 |
0.54 |
0.00 |

SIERA | 0.56 | 0.31 | 0.24 |

kwERA | 0.54 | 0.28 | 0.26 |

xFIP | 0.59 | 0.30 | 0.29 |

xFIP- | 0.60 | 0.30 | 0.30 |

FIP | 0.71 | 0.23 | 0.49 |

FIP- | 0.72 | 0.22 | 0.50 |

ERA | 0.88 | 0.09 | 0.79 |

ERA- | 0.89 | 0.09 | 0.80 |

RA9 | 0.90 | 0.08 | 0.82 |

cFIP is the winner and is by far the most consistent in its descriptive and predictive assessments of pitcher performance. In other words, at all times, cFIP does by far the best job at assessing a pitcher’s true underlying ability (within the components it considers). The other statistics consistently overfit past performance relative to each player’s true talent when evaluating in-season performance. Again, there is nothing wrong with that: they are trying to explain what happened and, to varying degrees, doing a good job of it. But what they are ** not** doing is consistently revealing the true talent of the pitcher on the mound, particularly in small samples.

### Conclusion

Although cFIP is an exciting development, I consider it to be the beginning, not the end, of our efforts to bring better context to baseball statistics. If CSAA brought mixed models out into the open, then cFIP demonstrates we have many other potential applications for them. In that regard, there is no reason why xFIP, SIERA, and other promising efforts like TIPS and BERA cannot themselves be reworked within a mixed model framework. When so reinforced, they may very well surpass cFIP. This is particularly true of kwERA, particularly if researchers are comfortable classifying its ability to project true talent as arising solely from its strong predictive ability. I hope other baseball researchers make these efforts, and to help them do this, I have provided the model specifications for the underlying cFIP components in the Appendix.

I look forward to a robust discussion of what cFIP means and how it can make baseball analysis better. For the time being, cFIP gives us a glimpse of the world into which we are headed.

### References & Resources

- Special thanks to Tom Tango, Harry Pavlidis and Dan Turkenkopf, all of whose suggestions made this a much better paper. Any remaining errors are solely mine.
- Bates D, Maechler M, Bolker B and Walker S (2014). _lme4: Linear mixed-effects models using Eigen and S4_. R package version 1.1-7
- R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
- The indispensable Retrosheet.

### Appendix: The Models

For home runs, I used the following equation:

HR.2014.glmer <- glmer(HR ~ stands*throws + stadium + (1|batter) + (1|pitcher), data=HR.2014, family=binomial(link=’probit’), nAGQ=0)

This is a generalized linear mixed model. The output is whether the plate appearance ended up in a home run (1,0). The fixed-effect variables that were considered included **stands*throws** (an interaction between the batter’s side of the box and the pitcher’s handedness) and **stadium** (the park in which the home run took place). The random effects are the batter and pitcher involved.

The mixed model computed a conditional mode for each pitcher, in each season, as to whether they made a home run more or less likely than average, and to what extent. As with CSAA and the other models in this paper, the probability of a home run for each pitcher was subtracted from the null probability of a home run under average circumstances, to isolate the net home run probability contributed by each pitcher above or below average for the season. The same process was used for all the other components evaluated.

For (unintentional) walks, I used the following model:

BB.2014.glmer <- glmer(BB ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher) + (1|umpire), data=BB.2014, family=binomial(link=’probit’), nAGQ=0)

For hit-by-pitch events, I used this model:

HBP.2014.glmer <- glmer(HBP ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher), data=HBP.2014, family=binomial(link=’probit’), nAGQ=0)

Finally, for strikeouts, this model was used:

K.2014.glmer <- glmer(K ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher), data=K.2014, family=binomial(link=’probit’), nAGQ=0)

This is remarkable, fascinating research. Thank you for this!

Considering this metric is scaled to 100, would it be correct to read cFIP similar to other metrics like wRC+, as in “Player X has a cFIP of 85, making him 15% better than average”?

Thanks much. As to cFIP, that is correct. 85 is 15% above league average and 115 is 15% below league average.

People often make that same statement about wRC+, as you did, but I think it technically may not be correct. This is because wRC+ does not, as I understand it, force everyone onto the same standard deviation like cFIP does. As a result, having a wRC+ of 115 does not necessarily mean you are the same amount different from league average as a wRC+ of 85 would in the other direction. I hope that makes sense.

I really like the way you have forced a standard deviation. This has been missing in plus/minus stats. I tend to prefer a percentile approach. On the other hand, this approach is closer to how the community is accustomed to viewing these metrics. So I like your approach.

Mind. Blown.

SEASON-TO-SEASON CORRELATION, RUN ESTIMATORS, 2011-2014

What exactly is this one telling us? Predicting league average performance for every pitcher for every year would give a y2y correlation of 1.0. That would be a bad estimator, though.

I’m not sure I follow your reasoning. Since the pitchers actually would proceed in the following season to diverge from average quite a bit among themselves in actual results, the correlation would not be 1.0 or anything close to it.

I think I’m just confused what you’re correlating here. Is the first cell (0.64) showing the correlation between observed cFIP in 2011 and observed cFIP in 2012?

Correct, yes. I apologize if that was confusing.

So I just created a new metric called ERA4. ERA4 = 4.00 for every pitcher in our dataset in 2011. Also, ERA4 = 4.00 for every pitcher in 2012. When I correlate ERA4 from 2011 to ERA4 from 2012, I get a correlation of 1.0. What does that tell us about ERA4?

Additionally, say I have another hypothetical metric where correlation between 2011 and 2012 is 0.99. Is that good? Is 0.9 good? Is 0.64 good?

*(actually I guess the correlation would be undefined because SD(ERA4) = 0, but I think you catch my drift)

Well, it probably tells us that ERA4 is a shitty metric because it can’t tell the difference between Clayton Kershaw and Clay Bucholz.

This work is great. I can’t wait to see future developments of this type of research.

The first sentence of your conclusion stated exactly what I was thinking.

“

Although cFIP is an exciting development, I consider it to be the beginning, not the end, of our efforts to bring better context to baseball statistics.“Very very cool.

I’m interested in how much of the improvement is due to controlling for context (batter, park handedness) and how much of it is due to regressing/shrinking each component the proper amount. So, one problem with xFIP (and I say this as a fan) is that it has one component it shrinks all the way and two it doesn’t shrink at all. The mixed model shrinks everything an appropriate amount given the sample size.

Basically, I’m wondering how would cFIP perform if you had the shrinkage part:

K.2014.glmer <- glmer(K ~ (1|pitcher), data=K.2014, family=binomial(link=’probit’), nAGQ=0)

and how would if perform if you just had the controlling for context part (meaning that you made pitchers a fixed effect):

K.2014.glmer <- glmer(K ~ pitcher + bat_home + stands*throws + stadium + (1|batter) + (1|catcher), data=K.2014, family=binomial(link=’probit’), nAGQ=0)

Thanks Jared.

I have wondered about the best way to isolate the effects on various aspects myself. So far, I think it probably consists of comparing various types of pitchers with various PAs and outcomes and seeing how much they get shrunk.

My concern about making pitcher a fixed effect is that the standard errors could get pretty wild given how small some of the samples are. The mixed model deals with them more efficiently and without the same limitations that fixed effects have.

I’ll continue thinking about ways to isolate the various effects. What we know for now is that this method seems to work better than others. Figuring out exactly *why* it ends up being better, and then tweaking to take advantage of those aspects would be an important next step.

For assessing predictive value/accuracy, why not use cross-validation? My guess is that it is too computationally intensive for mixed models (given that the model already takes 15 min to run).

Absolutely. LOOCV would take weeks. 10-fold CV wouldn’t be much better. Ditto with the alternative of simulation. Computing power still has a way to go in this regard . . .

This is excellent work Jonathan. Since mixed models let you control for a lot of different things, I’m wondering whether hits can be brought into the equation. You could perhaps use Base Runs instead of FIP. The hits part of he equation still has to be shared with fielders, but controlling for team, ballpark and batters can help us separate the two a little more.

Thanks for taking the time to do all of this research.

I guess we could say that Palmer and Thorn cared about context because I think in the original Total Baseball adjusted OPS (or production as they called it) was compared to the league average but it took into account the fact that batters did not face their own pitchers (and for adjusted ERA the reverse was assumed for pitchers)

So it looks like you are refining that quite a bit.

Two questions

In (K – (BB-IBB+HBP)) / PA)

are IBBs included in the PA count? What about sacrifice hits?

If we used cFIP to calculate pitcher’s WAR, how much would the numbers change? Would some pitchers go up alot and would some go down alot? It might be interesting to see who is affected the most.

On your questions, yes on both; I applied the formula that was suggested. I suppose those could be excluded but wouldn’t expect it to make much difference.

I don’t know the effect on WAR because I don’t know how FanGraphs does it, exactly. But I expect the pitcher would end up getting a lot less “responsibility” for various events than they currently are being assigned.

Thanks

How about sFIP = situational FIP?

Interesting work, but the 2014 numbers had something like 2/3rds of pitchers being average (cFIP=100) or worse (251 under 100, 650+ total). Also, the non-pitcher, pitchers all seemed to do pretty well: J P Arencibia, Adam Dunn, etc.

I also wonder if there is an effect from AL DH vs. NL – 2014 seems to show that also.

On the one hand, it makes sense for starting pitchers because AL lineups which an SP faces won’t have a pitcher batting. On the other hand, this exaggeration should be removed for RPs because RPs face pitchers far, far less than “normal” – which is why I’d think there would be a lot more even distribution of NL relief pitchers in the top ranks.

How about sFIP, situation FIP?

Nice name…

This may be a dumb question but are there guys that would continuously outperform there cFIP? Mark Buerhle and RA Dickey come to mind off the top of my mind.

Great stuff! Three questions:

1) Is there any improvement if you interact batter-handedness and park? Seems intuitive to me that pitching to a right-hander in Yankee Stadium is very different from pitching to a lefty in Yankee Stadium.

2) Why are umpires in walks and not strikeouts?

3) Do you see any noticeable changes if you add catchers and umpires to the HR model? It seems intuitive to me that if they affect the probability of a ball or strike, that effect would also affect the probability of a HR (it’s harder to hit a home run from 0-2). That said, the effect is likely very small, so it might not affect things noticeably even if it’s measurable.

Thanks Frank.

(1) Good idea on the proposed interaction. I’ll have a look.

(2) Umpires affected walks and not strikeouts (interestingly).

(3) Nope: catchers and umpires had no noticeable effect on the likelihood of a home run occurring.

Overall, I started out with the same master set of predictors for all of the equations and took out the ones that appeared to have no effect on the outcome.

Great questions and thoughts.

“Overall, I started out with the same master set of predictors for all of the equations and took out the ones that appeared to have no effect on the outcome. ”

That’s pretty much the definition of over-fitting. What if over the next four years umpires appear to have an effect on BBs but not Ks? Baseball will have changed, or your orig. model was overfit to the data?

*Ks, not BBs

Excellent work. What a fun read.

I’m not a statistician by any stretch, so please excuse any ignorance in the following train of thought.

I just wondered if there was a better way to evaluate each metric’s descriptive vs. predictive powers. I understand that calculating the difference between the two correlations will tell you how balanced it is, but it doesn’t really tell you how effective it is at doing both. For instance, if you take it to the extreme, if I came up with a metric that had correlation coefficients of .03 and .03, it is perfectly balanced in its ability to describe and predict, but it sucks at doing both. Wouldn’t adding them up and THEN subtracting the difference take care of this problem?

Thanks, and once again, great work!

Thanks Phil.

You are correct; one could conceive of a nonsense metric that would “fool” the system a bit. By limiting it to what I considered legitimate metrics, I didn’t encounter that problem.

Your proposal is an interesting one. One could also try to consider descriptive, predictive, and differential as part of a three-dimensional Euclidean distance from some zero and/or one, or square some of the directions but not others for a squared Euclidean distance. Arguably, though, it all amounts to finding a systematic way to give one a result that seems right for whatever reason. I’ll continue thinking about this, but I appreciate the suggestion.

Thanks for reading.

Craig Kimbrel… cFIP 21

this is great stuff…

Quick question on stats that use 100 as average. So the 21 he posted is relative to the average in which he is included. What if we were to compare himself to the league average excluding his season? Is it meaningful?

It’ll be interesting to observe how the stastically inclined baseball fan community receives improved stats like these… I have no prediction other than that established interests of any kind are default resistant to change.

Bill, there are 189k plate appearances in mlb over a season, so removing one player, particularly one with a fairly limited number of batters faced, is unlikely to make a difference. I wouldn’t rule out the notion that a truly transformational player with a ton of plate appearances could tilt a needle a bit for everything else, but I haven’t checked, say, the effect of excluding Barry Bonds or something like that. Interesting idea, though.

That’s a whole lot of work. Well done. Been fiddling with pitching metrics myself for a while, and I think that you really got something here. I actually used your cFIP metric to figure out the Twins pitching situation (here if interested: http://tenthinningstretch.blogspot.com/2015/03/looks-like-twins-had-only-one-awful.html) and I think that really makes sense.

So good job 🙂 Would love to see a formula to quick-calculate it though…

I’ll have a look; thanks for using it.

There can’t be a quick way to calculate it because it needs to consider all available plate appearances in baseball during the subject time period to adjust them for each other. What I *could* do is post the full R code for cFIP on something like GitHub for others to play with.

The problem is that the Retrosheet querying uses the function provided by Jim Albert and Max Marchi in their terrific book “Analyzing Baseball Data with R,” and without their permission, I’m not inclined to post that, in part because I think people should be buying their book and supporting their work. Maybe I’ll think of a compromise.

Very impressive work, congratulations and thank you.

Out of curiosity, is there any reason you chose to work with FIP specifically over any of the other metrics?

Given SIERA’s performance in the descriptive vs predictive comparison, I’d be really curious to see what a contextual version of it might look like.

This really is fantastic work, congratulations and thank you.

Just curious, was the choice to use FIP as the base metric because it’s the most popular or because you personally felt it the best estimator?

I’d be very interested in seeing a contextual SIERA, given how well it performed relative to the other metrics in the Predictive/Descriptive comparison.

Thanks much.

I used FIP just because it struck me as the most obvious place to start. I do believe cFIP is highly useful as is, but it also is meant to function as a “proof of concept” for further extensions to other estimators.

As I noted in the article, I’m also interested to see how the adjustments would affect these other estimators, although as we start to incorporate batted ball events, which are more volatile, we may start losing our symmetry with descriptive and predictive power. But, maybe that fear is unfounded. We’ll see.

Makes the Cards look like geniuses getting rid of Shelby Miller and Joe Kelly, leaving them with a starting staff with all below 100.

i made a histogram of the 2014 cfip data to see what the distribution looked like because i was unsure of what forcing a std dev would do. The distribution is non-normal and has a long tail towards lower values. doesnt the non-normal distribution make the z-scores confusing to interpret? i think most people would use zscores/std devs from the mean to indicate relative rarity, but with skewed data, that intepretation doesnt stand. for instance, for 2014, there are 79 pitchers with cFIPs between 115 and 130, but only 50 pitchers between 70 and 85. Id much rather see cfip on the runs allowed scale and have the actual avg and std dev reported.

All the minus stats have this problem, but the actual era, RA/9, fip and re24 values are relatively normal. Are you reporting Pearson correlations? Im curious what would happen if you used Spearman instead.

Great comments. Thank you.

The cFIP values are definitely skewed. As with all individual accomplishments in baseball, they are not normally distributed in the population. We have a minority of stars who drive stats up, we have a number of replacement-level types who help drag them down, and we have a bunch of guys in the middle who end up around “the average” when the outliers are taken into account.

The question of how we nonetheless best present their relationships to each other is important and all of them have disadvantages. Certainly, I don’t think rating people by logs is going to win any converts. And although the skew might support taking more of a percentile / median approach, I’m not sure that ultimately tells people more that they want to know. My approach uses the PitchIQ method employed by Dan Brooks and PitchInfo over at Brooksbaseball, and while it forces the data into a standard deviation of 15, that at least in my mind is better than the other 100-scale stats, which do not force any standard deviation, and thus have people confused into thinking that the two sides of 100 are symmetrical when they in fact are not.

I was doing weighted Pearson correlations, and I agree that Spearman is worth a look to double-check.

At the very least, may I suggest using a capital C (CFIP) to distinguish it from cFIP while you’re trying to think of something better?

At the very least, may I suggest using a capital C (CFIP) to distinguish it from cFIP while you’re trying to think of something else?

Thanks so much for publishing.

I suggest changing the name to sFIP-

Don’t you think xxFIP from breaking blue is better ERA estimator?

http://www.breakingblue.ca/2013/12/18/one-estimator-to-rule-them-all-xxfip-part-3/

Sanity spot check with Frieri fails. It shows that his parks were def. not hitter’s parks on average:

http://www.fangraphs.com/statsd.aspx?playerid=5178&position=P

In fact, that is one of the more pitcher-friendly schedule of parks an RP could have.

As for quality of opposition, Frieri had an opponent OPS of .705. Average opp. OPS for all relief pitchers with 30+ IP and 0 GS was ~ .703. I.e., he did not face a difficult schedule of batters.

http://www.baseballprospectus.com/sortable/index.php?cid=1816827

What gives?

Or does your system somehow reward Ps who give up big hits to good hitters and get bad hitters out? And those who get blown up disproportionately at hitter’s parks (vs. park-adjusted expectations)? That would be odd.

Jonathon-

There was mention of a follow-up article re: CSAA and interesting cases like Javy Lopez? Is that still in the works? Also, when is BBpro going to have the CSAA data for catchers implemented.

Finally, as a related question to what somebody asked earlier: how will the CSAA data affect pitcher WAR at BBpro (as opposed to FIP based versions)? If a pitcher threw 20% of the pitches received by a plus 20-run framer, do we subtract 4 runs saved from the pitcher or regress first? How will this work?

Hey Matthew,

Could you go ahead and post this question on the BP CSAA article? We can get back to you specifically. In general, I expect it to go live before the season starts.

I would, but don’t you have to pay for membership to post comments?

5,000 words is about ten times what one would expect in an explanation of a new stat.

Sorry, I couldn’t follow it.

It’s a highly complex stat. Too many moving parts rarely work well together. And 5,000 words? If it takes that long to explain a new stat, it’s not going to catch on.

The 40 mostly positive comments above disagree with you.

Great work. I’ve been dreaming about what I’ve called “everything-adjusted” metrics for decades.

Like others, I can’t wait to see this methodology applied to SIERA and the like. My question is, in the meantime, how predictive will, e.g.,

xFIP- + (cFIP – FIP-)

be of a future cxFIP?

Unfortunately, I don’t think you can add them together like that, as they live on different scales. But, it is a nice thought. And, since cFIP performs better than those other two, I wouldn’t bother using them at all for any predictive purpose.

I just read this. I wan’t to play and tinker with it so bad.

This comment feels very sensual.

I expect cFIP will be up in a github repository before too long, and then you can do whatever things you desire to the code.

Is there anywhere we can find cFIP stats for the current season as it unfolds? Please don’t make me do all the math myself… 🙂 Great work, by the way.

John, here you go: http://www.baseballprospectus.com/sortable/index.php?cid=1826545