# Standard deviation and ERA estimators

Sabermetrics is my passion, but that does not mean it always has been.

One of the main reasons I became interested in studying baseball statistics was fielding independent pitching (FIP) and ERA estimators. Over the past few months, my interest in ERA estimators turned into an obsession. During that time, I developed predictive FIP (pFIP), an ERA estimator of my own.

In almost every test that I ran, pFIP came out ahead (higher correlation) of other more established ERA estimators. This lead me to believe that pFIP was the best ERA estimator currently available, but that in no way meant that the metric was without its own flaws.

Below I listed the standard deviation for pFIP, FIP, ERA and two other ERA estimators (SIERA and xFIP), for pitchers who threw at least 100 innings in a season for the years 2010-12:

Metric | STDEV |
---|---|

ERA | 0.862 |

FIP | 0.652 |

SIERA | 0.510 |

xFiP | 0.502 |

pFIP | 0.387 |

Unsurprisingly, on a single season basis, ERA has the widest distribution, while pFIP has the tightest.

Quite honestly, the fact that pFIP’s standard deviation is significantly smaller than xFIP and SIERA (which are known for typically having small standard deviations) was a cause for concern.

In Colin Wyers’ piece on SIERA and other ERA estimators this issue is discussed in great detail:

In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.

Simply producing a lower standard deviation doesn’t make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn’t provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.

My understanding of Colin’s argument is that metrics like xFIP and SIERA “crudely” regress each pitcher to the mean which would lead to a higher correlation (lower RMSE), but at the same time may not be an accurate measure of a pitcher’s true talent level.

It is evident that of the four ERA estimators discussed in this piece, pFIP has the largest regression to the mean for each individual. This fact brings me to a question whose answer I’ve found myself switching sides on countless times.

### What is the point of an ERA estimator?

There are two answers to that question that could hold serious weight in an argument:

{exp:list_maker} To be the best at predicting (highest correlation with) future ERA

To be the best representation of a pitcher’s true talent level{/exp:list_maker}

It would be nice if an ERA estimator came along that could fulfill both of those requirements, but I would argue that that is not the case.

For the first (high correlation with future or next season ERA), the estimation should be seriously regressed to mean. But when one is attempting to estimate a pitcher’s true talent level, should that regression be as harsh? At lower innings pitched totals there should be some, but not nearly as strong as when the goal is to simply predict future ERA.

My main issue with the true talent level idea for ERA estimators is how difficult it actually is to calculate that number. An ERA estimator that reflected a pitcher’s skill should be able to account for all of the possible factors within the pitcher’s control and weed out all of the other factors around the pitcher correctly. The problem is that it is nearly impossible to do.

In the extreme, relievers throw so few innings on a per-season basis that by the time they throw enough innings for us to get a fair idea of their ERA talent, years will have passed. And in all likelihood, their true talent will have changed. Even for starters who throw more innings, their true talent level is tough to decipher out of all the different factors that go into run prevention.

The consensus at this point is that the estimator with a standard deviation as wide as one’s true talent ERA and a high correlation with future ERA is the best at measuring true talent. However, there are issues with this approach too.

Pinning down an exact number for the standard deviation of a pitcher’s true talent ERA is difficult. This issue was raised in a FanGraphs Community Post by Steve Staude. He showed that from 200 to 1,000 innings pitched, the standard deviation of ERAs range from 0.8 down to 0.5, as the innings increase.

I think most would agree that true talent does not reveal itself at the 200-inning mark, but then where? 500? 750? 1,000?

Most pitchers never reach 1,000 career innings; many do not reach 500. It also takes most pitchers at least three seasons to reach 500 innings, and it seems reasonable that an individual’s talent level could change significantly over the course of those seasons.

For a moment though, let’s ignore that and look to Wyers’ original piece to see that ERA true talent seems to be revealed somewhere between 400 and 500 innings. According to Staude’s study, the standard deviation of ERA between 400 and 500 innings ranges from about 0.65 to 0.6; thus, it would make logical sense that an ERA estimator with a high correlation with future ERA and a standard deviation of around 0.6 or 0.65 would be the best true talent estimator.

Interestingly, if we look at the standard deviation that I found for FIP in this article, it falls right in line with that logic. FIP has a higher correlation (in small to medium samples) with future ERA than ERA and it has a standard deviation that is similar to “true talent” ERA. This assumption also falls in line with a trend we often see: A pitcher’s career FIP lines up fairly closely to his career ERA.

The fact that most logic would lead one to conclude that FIP is best true talent ERA estimator we have available fascinates me.

Why? Because the structure of FIP is in no way meant to predict future ERA.

FIP is commonly used in that fashion because it does a fairly good job of predicting future ERA, but that is not the statistic’s purpose. FIP is meant to be a **describer** of a pitcher’s performance that is scaled to look like ERA. It’s best described as a what a pitcher’s ERA **should have** been. That type of description may make FIP sound similar to a true talent evaluator, but it is in no way correlated or meant to describe future performance.

This idea brought about the birth of pFIP.

pFIP regresses the components of FIP (strikeouts, walks, home runs) to predict future performance rather than describe of past performance. In plain English, that idea sounds great and interestingly the math works out, too.

FIP’s more volatile components (walks and homers) receive a fair amount of regression, while strikeouts (the least volatile) receive little or no regression and these regressions result in a stronger correlation with next season ERA than when simply using FIP.

But is pFIP really saying anything about a pitcher’s true talent level?

I would argue that it may give one some of indication of a pitcher’s talent, but it is not a true talent evaluator. If you look at either the pFIP equation, the standard deviation of pFIP or an individual’s numerical pFIP, what the statistic actually does becomes very clear.

Essentially, pFIP starts each individual’s ERA projection at the same point (the mean ERA) and then moves each number slightly away from the mean based on the player’s individual peripherals. This strategy works great when your goal is to predict with the highest rate of success, but it does not give you a great idea of a pitcher’s actual true ERA or skill.

Thus, when one decides to evaluate pFIP as a statistic one must return to our original question: What is the purpose of an ERA estimator?

pFiP is essentially useless if you’d like to evaluate a pitcher’s talent level, but if your goal is to predict next season’s ERA then pFIP will serve you well. However, if predicting future ERA is the only real purpose of pFIP then is there any real reason for the statistic?

Projection systems are a very real thing, and their goal (at least from what I understand) is to do exactly what pFIP does; project future performance. I’ve shown before that pFIP is fairly comparable to projection systems when looking at overall correlation with next season runs (or ERA). Although, I’m fairly certain that simple correlation with the next season is **not**the best way to test how well a projection system works, let’s say for a moment that it is.

Is pFIP really better or equivalent to a projection system?

The short answer is quite obviously no, but the evidence behind that assessment is fairly educational.

I’m not saying this is true, but consider a fantasy scenario where pFIP has exactly the same correlation with future ERA as an average projection system. How would we test which one was actually doing a better job? A good starting point would be to consider the standard deviations of the projected ERAs.

I looked at the standard deviations of ERA projections for three projection systems (Marcel, Bill James and ZIPS) for the years 2010-2012 for pitchers who were projected to have at least 100 innings in that season:

Projection System | STDEV |
---|---|

Bill James | 0.514 |

Marcel | 0.520 |

ZIPS* | 0.657 |

*ZIPS does not project playing time, so Marcel’s playing time projections were used for the pitchers in the sample.

Under the assumption that pFIP and the projection systems have similar or equivalent correlations, it would seem that projection systems do a better job at really projecting future performance/skill as their distributions are wider.

This should not really be too surprising, as projection systems take a great deal more information into account than pFIP. Also projection systems are even more useful as they project playing time and counting stats as well as the rate stats (like ERA, FIP, etc.)

This all brings me back again to my original question: What is the point of an ERA estimator? Or to be more clear, if we have projection systems, then what is the point of ERA estimators?

I can think of only two answers to that question.

The first is that some people don’t trust or find utility in projection systems and thus find sanctuary in using much simpler ERA estimators, which are still fairly predictive and easier to understand. The second is that ERA estimators should be a reflection of a pitcher’s true talent level, which, of course, is almost impossible to define.

**References & Resources**

All statistics come courtesy of FanGraphs.

Well done, Glenn.

I wrestled with the question of the point of an ERA estimator too. My conclusion was that it should be to estimate true talent level… but I also think that good prediction of future ERAs is a major sign that you have a grasp on their true talent level. There’s a lot of noise there, though, so it can’t be the sole basis.

I think over a career, an estimator should also line up well with career ERA, as with enough IP, career ERA becomes an even better indicator of true talent level (as you say).

That was the logic(?) behind the BERA ERA estimator in my article that you linked, anyway—I regressed against a combination of next-year and career ERA (y0 BERA vs. y1 ERA, and career BERA vs. career ERA).

But you can’t get a good feel for true talent level based off of one season, really. So a projection system that takes multiple years of data into account is going to be better than an ERA estimator that uses only 1. But an ERA estimator that uses multiple years… that might not be too far off from a projection system.

Any thoughts?

Thanks for the response, Steve. I’ve toyed with the creation of an ERA estimator that takes multiple years into account. The only issue with it though, is that it came out very similar (huge regression) to pFIP, because I was regressing the numbers upon one season. I think in order to create a true talent ERA estimator, one would have to regress the components upon 400 innings (or so) of data. This would result in far less regression and possibly an idea of how to project true talent ERA. I imagine this is similar to what projection systems do though.

Thanks, this article comes along at just the right moment. I’ve been thinking of some of the aspects of what you mention in the article as I shirk my own personal responsibilities. The origin of this has to do with moves made by the Angels this offseason. My original preference was for them to resign Greinke rather than Hamilton and viewed the contracts fairly equal at $25 million per year. However, being open to different strategies, I’m wondering if the moves made hadn’t increased the possible number of wins vs. signing the new Dodger. In other words, I read an article that asks if Blanton gives up 4.5 runs a game but does exactly that each and every game is that better than a lower ERA pitcher who is more volatile in outcomes. The article didn’t answer that question, but it made me curious. Can’t say I figured out the solution yet.

@MGL

Thanks for dropping by. There’s a lot here to digest, but I’ll do my best to get to all of your main points.

1.) Anytime that I mention correlation or correlation with future ERA, I’m referring to next season ERA, I apologize for not being more clear. I also agree that the best predictor of future performance and true talent level are for all intents and purposes the same thing. However, like you refer to with defenses, park factors, etc. I’m not positive that a metric or projection system that is meant to project one season of data can really reflect true talent level, as it takes more than one season for a pitcher’s true talent to reveal itself.

2.) My main idea with looking for the standard deviation of ERA true talent was to find a rough estimate. While I probably did not go about finding that rough estimate in the best fashion, I’m glad I came close to the number you’ve found.

3.) It’s hard for me to say whether I think pFIP is a “statistic” or not. The only difference between pFIP and FIP Is that pFIP regresses K,BB, and HR against future runs rather than runs within season. The problem with this regression though, as you point out, is that walks and home runs will be regressed a good deal in this equation, thus, two of the three components have non-zero, less than 100% regressions.

4.) Your last point is really the question that I was trying to bring up, as I don’t really have an answer to it.

What is the tradeoff between correlation and spread in a projection?

I think your explanation of how you decide whether you regress enough when making your projections makes a great deal of sense and is the type of interesting/educational response that I was hoping to provoke.

Anyways, thanks again for the in-depth response, MGL.

i dont understand why these stats cant be easily figured and used by coaches? i coach high school baseball and use these stats during the season as a predictor for future performance for that season and the next season. this wwould be a great stat, but i majored in mass media management and education…

MGL, you’re a master. I’d love to hear your thoughts on my FanGraphs Community Research article that Glenn talks about here.

I came to the same conclusion you did regarding the ideal standard deviation in an ERA estimator—about 0.5 earned runs. I said that 0.5 implied that about 2.2% of pitchers have a true ERA under ~3.1, which sounds about right to me (maybe the SD could be a tiny bit higher, though).

Glenn—yeah, you can’t completely rely on single season data for either the basis or the comparison for a formula, I think—there’s just as much noise in the season you’re trying to match up with as the one you’re forecasting from. I think also taking the big picture data into account can help weed out some of the nonsense.

Gelnn, nice job bringing up and discussing some interesting questions!

Here are some comments I have regarding these issues/questions:

1) Please, please do not use the term “correlation” without explaining or mentioning what you are correlating with what!

For example, “In almost every test that I ran, pFIP came out ahead (higher correlation) of other more established ERA estimators.”

I assume that you mean “higher correlation” with next year’s ERA, but it is by no means obvious. It could mean correlation with the same year’s ERA or next year’s pFIP (or FIP or whatever metric you are discussing). You do the same thing a little later. Later again, you do mention “correlation with next year’s ERA” but it is again not obvious that is what you are referring to earlier.

“To be the best at predicting (highest correlation with) future ERA.

To be the best representation of a pitcher’s true talent level.”

Those are exactly the same things. Except they are not. But they are.

For all practical purposes, when we do a projection, we are trying to estimate a player’s true talent. So in that sense, those two things are indeed the same thing.

The ONLY reason why they are not the same thing is that future ERA includes park factors, defense, change in opponent talent, aging (and perhaps a change in true talent due to injury, a learning curve, etc.), etc.

But, again, for all intents and purposes, they are the same thing. A lot of what you discuss suggests that they are quite different. I reject that notion, unless you are talking about the aforementioned park effects, defense, etc.

2) The idea of of the spread in true talent of pitchers is a tricky one. First of all, you need to be more specific. Of course you are referring to major league pitchers, however, you need to specify whether you are including fringe pitchers – ones who pitch a few innings here and there and it is determined that they are not major league caliber pitchers. Are you including them in your sample/definition? How are you weighting them? Are you assuming that the best and worst pitchers, in true talent ever represent the tail ends of a 3 SD normal curve?

Looking at empirical data is probably not the best way to estimate the spread in pitching talent in MLB. If you look at all pitchers who threw at least 500 inning in their careers, you have 2 problems. One, 500 innings only includes lots of random fluctuation which will increase your variance of talent (remember that variance in any finite group is luck variance + talent variance, and in measuring observed variance, you are measuring the sum of the two). Two, even with a minimum of 500 innings, you are eliminating those fringe pitchers who probably have a worse true talent level than any of the pitchers in your sample.

You can mitigate the first problem by looking at 1000 inning career min pitchers, but then you exacerbate problem two and introduce another problem, which is that talent changes over a career (the longer a career, the more it changes) so your spread is going to be larger than it would be if each pitcher’s career was one static true talent. For example, almost no pitcher over 1000 IP is going to have an ERA 1.5 runs or so less than league average even though that is a legitimate talent level (for a few peak years at least).

Honestly, the best thing to do is to do a rough estimate. For example, my experience in doing projections for 20 some odd years and in closely watching and analyzing baseball for almost 30 years i s that true talent is right around league average plus or minus 1.5 runs, which happens to be a SD of true talent of around .5 runs! So you did actually get around the right answer – sort of by semi-accident though!

3) A pitching (or batting, or defensive, etc.) stat which regresses components is not really a stat. I mean, come on! That is how we do projections! We take a player’s multi-year stats and regress the components (we can use one year as well). That is not a “stat” per se, is it? Can I go even further and age adjust my stat and call THAT a stat? Even though FIP is in essence a “regressed” stat, at least it regresses some things 100% (BABIP) and other things nothing (HR, K, and BB), thus we can still call it a “stat”. But to take a stat and then apply different (not zero and not 100%) regressions to different components and then call that a “stat?” And then to tout your stat as better than all other stats? Nah! Of course your stat is going to be better! It is a projection algorithm! Can ZIPS and Pecota and Oliver call their projections a “stat?” If they did, it would be the best stat out there, at least as far as prediction (projection) purposes!

TBC…

4) Finally, the whole idea of the spread of your stat or projection versus its accuracy in terms of correlation or RMS, that is a very tricky issue. It is true that you want as wide a variance as possible in your projection or projection stat, while at the same time you want as high a correlation or as low a RMS square as possible, but by no means is it clear what kind of balance is best. Perhaps that was your point, I am not sure.

Sure, if your stat has too small a variance it is not a good one (the ultimate bad stat that might have a decent correlation with what you are trying to predict is a league average one) even if the correlation or RMS is good. Similarly, if the variance is large but the correlation is very bad, then the stat is probably worthless. And yes, you never want a stat or projection where the variance is higher than the variance of the true talent of the population especially if all the players in your projections are projected for some major league playing time (if you include lots of minor league players who have little chance of making it in the majors, then the variance of your projections might legitimately be larger than the estimated variance of true talent in the population.

For example, if my projected major league pitchers have a variance of .75 runs and it is a reasonably normal distribution, then I am essentially saying that there are some pitchers with true talent ERA’s of 1.25 and 5.75 (in a league where the mean ERA is 3.50), which we know is not true (remember that a valid projection is NOT supposed to project any specific player to have a good or bad performance that is luck based – it is only supposed to represent the spread of expected true talent).

So, for example, if the ZIPS pitcher projections truly represent players that are expected to pitch at least 50 IP in the majors, then his variance is definitely too high. A simple solution to that is to adjust his regressions (however he does them) to make them more aggressive.

In fact, each year, I look at all my projections after I break them into groups. For example, I will break my pitchers into 5 groups: Very low ERA, low ERA, average, high and very high. Say, the average in each group is 2.5, 3, 3.5, 4, and 4.5. If the actual ERA in each group was 2.7, 3.3, 3.5, 3.8, and 4.3, then I am not regressing enough, and I make adjustments to my model. If the actual were 2.3, 2.8, 3.5, 4.2, and 4.7, then I am likely regressing too much (which is a good problem to have!). Or I can simply look at my SD of projections. If it is larger than the SD of true talent (estimated as .5 or so), then I am not regressing enough. If it is smaller, then I am regressing too much or my model is not that great. If the SD is too little, then I will wait and see how my group do.

Anyway, that is all I have to say for now. This is a difficult subject to discuss on many levels. As I said, you did a pretty good job bringing up some interesting areas of discussion and research…