# Evaluating The 2014 Projection Systems

The 2014 season is in the books. The San Francisco Giants once again reign as the World Series Champions. Most baseball people are looking toward the offseason and 2015. Projections are key to our understanding of both the offseason and upcoming 2015 season. There are a lot of systems to choose from, and if you’re like me you have used all at one point or another, often interchangeably. We should, however, be sure we have a good understanding of each system and how they actually work and perform. I will look back at 2014 and evaluate which projection system can be crowned the 2014 champion. First, a little background on each of the projection systems I examined.

### Background

Most projection systems work largely the same way, with only minor variations. For instance, most use three to four seasons of data to calculate their forecast. However, there are nuances to each that make them unique. FanGraphs does an excellent job explaining the differences among the projection systems here. I will briefly summarize.

#### Marcel

Created by Tom Tango, Marcel is the simplest among the projection systems. Marcel uses only major league data, giving heavier weights to more recent seasons. It takes into account age and regression towards the mean. Marcel does not project players with no major league experience. Marcel gives explicit instructions to assign league average projections to unprojected players. Because of this, Marcel projects far fewer players than the other systems.

#### ZiPS

Created by Dan Szymborski, ZiPS uses weighted averages of the previous seasons. It takes into account batting average on balls in play when regressing player performance. It adjusts for age by finding historical player comparisons.

#### Oliver

Created by Brian Cartwright at The Hardball Times, Oliver also uses weighted averages to project players. Oliver differs in that it calculates major league equivalencies by taking in the raw numbers and adjusting based on park and league.

#### PECOTA

Property of Baseball Prospectus and developed by Nate Silver, rather than using weighted averages, PECOTA uses a system of historical player comparisons to calculate its projections.

#### Steamer

Created by Jared Cross, Dash Davidson and Peter Rosenbloom, Steamer also uses a system of weighted averages. Steamer differs in that it weights different components differently and regresses some more heavily than others. Steamer does not explicitly take aging into account.

### Methodology

#### Data Sources

It is worth mentioning where these data came from. I downloaded the 2014 actual data via FanGraphs and removed all players who pitched that season, apologies to Adam Dunn. ZiPS and Oliver both came to me via Will Larson and the Baseball Projection Project. Due to rounding issues with third party data sources, Jared Cross himself provided me with the Steamer projections. Marcel forecasts, no longer produced by Tom Tango, are unofficially maintained by David Pinto who makes them publicly available. PECOTA was simply downloaded from Baseball Prospectus.

#### What Metric?

The first step was to find a common metric to look at. Given the common statistics projected by all systems considered, wOBA seemed like the obvious choice. Some projections were kind enough to include sac flies, but most did not, leaving us with the just walks, hit by pitch, singles, doubles, triples, home runs, and plate appearances. Using the 2014 wOBA coefficients, I arrived at this simplified formula:

*(BB(.69) + HBP(.72) + S(.89) + D(1.28) + T(1.64) + HR(2.14))/(PA)*

#### Merges and Missing Players

Next I had to assign unique identification to all players. This is always an arduous task, ahem Chris Young, but I was able to match most players. There was a distinction to be made between players who simply weren’t projected and players I failed to correctly match to the actual 2014 data. The percentage of total plate appearances that were not matched was pretty small. Players who were not projected or not matched were given a wOBA projection 20 points below league average. This is close to the actual mean for that subgroup of players. The Marcel projections unprojected/unmatched players were given a projection of league average performance. This table summarizes the results of the merges.

Projection Systems, Merges and Misses, 2014 |
---|

System |
Players Unprojected |
PA per player |
PA Unprojected |
PA total |
Share |
Given |

Actual | 0 | 0 | 0 | 0 | 0.00% | N/A |

ZIPS | 34 | 78 | 2,645 | 177,967 | 1.49% | Mean-.020 |

Steamer | 27 | 76 | 2,042 | 177,967 | 1.15% | Mean-.020 |

Oliver | 20 | 78 | 1,562 | 177,967 | 0.88% | Mean-.020 |

Marcel | 152 | 137 | 20,830 | 177,967 | 11.70% | Mean |

PECOTA | 62 | 74 | 4,602 | 177,967 | 2.59% | Mean-.020 |

The fact that more players were missing from PECOTA could be my fault for not matching well enough; it could also be that these players just weren’t projected by PECOTA. Take it for what you will, but this is a potential source of bias. The overall portion of plate appearances not matched was so small that whatever we projected these players at hardly affected the results.

#### Common Baseline

Correctly merging my data sets was the bulk of the work, yet there was still more to be done before we could have fun with it. To compare systems we need to adjust them to a common mean. We only care how a player performed in the context of the projections league average. If Mike Trout was projected to hit .425 in a league with a projected .340 mean we want to count this as the same as if he were projected to hit .400 in a league with a projected .315 mean. In 2014, disregarding pitchers, the properly weighted league wOBA was .315. To do this I first calculated the population mean of players who actually played in 2014, weighting by their 2014 plate appearances, then I filled in the missing players with a projection 20 points below the projected weighted mean. Then the mean was recalculated and scaled up to .315.

### Results

And now the part you’ve been waiting for, the results. This part was simple enough; I calculated the mean absolute error, weighted by plate appearances, for each of the five systems. This is how they fared projecting the league population.

Overall |
---|

System |
Mean |

Actual | 0.0000 |

ZIPS | 0.0274 |

Steamer | 0.0277 |

PECOTA | 0.0279 |

Oliver | 0.0280 |

Marcel | 0.0289 |

The projection systems all did pretty well and, as usual, are relatively close together. ZiPS takes home the crown with the lowest mean absolute error. While Marcel comes in last, the result demonstrates Marcel’s original intended purpose; it shows us that a simple projection system will get us most of the way there in the aggregate.

What is more interesting however, is how the models performed on different subsets of players. I have split the players into groups based on experience and age as well as binary identifiers breakout and breakdown.

#### Experience

At the heart of it, what to do with past performance is the question all projection systems are trying to solve. Thus it is natural to group based on career playing time. I bunched players into three categories: rookies (0-300 plate appearances), middlers (300-1,800 PA), and veterans (1,800+ PA).

Rookies (n=126) |
---|

System |
Mean |
wOBA |

Actual | 0.0000 | 0.2909 |

Steamer | 0.0282 | 0.2977 |

ZIPS | 0.0290 | 0.2973 |

Oliver | 0.0293 | 0.2973 |

PECOTA | 0.0303 | 0.2973 |

Marcel | 0.0346 | 0.3069 |

The rookies are interesting because for the most part these projections were going off minor league data, so we want to see who best translated minor league performance to major league performance. Steamer was able to break away from the pack here and projected these unknown quantities quite impressively. On the whole, Steamer was not far off its overall performance. Meanwhile, due to the fact that it does not take into account minor league data at all, Marcel unsurprisingly over-projected this group of players and performed the worst.

Middlers (n=217) |
---|

System |
Mean |
wOBA |

Actual | 0.0000 | 0.3127 |

PECOTA | 0.0273 | 0.3092 |

ZIPS | 0.0278 | 0.3080 |

Oliver | 0.0282 | 0.3107 |

Steamer | 0.0285 | 0.3094 |

Marcel | 0.0296 | 0.3102 |

The middling players are those with some major league experience, but not the full range that most of the projections like to use. PECOTA did well here, perhaps because history is a better indicator than the other systems’ algorithms on such a small sample of major league data. Again, without the full three seasons of data available, Marcel lagged behind the pack.

Veterans (n=296) |
---|

System |
Mean |
wOBA |

Actual | 0.0000 | 0.3221 |

ZIPS | 0.0269 | 0.3239 |

Steamer | 0.0271 | 0.3230 |

Marcel | 0.0271 | 0.3203 |

Oliver | 0.0276 | 0.3221 |

PECOTA | 0.0278 | 0.3231 |

The veterans have over 1,800 career plate appearances and for the most part have played the full three seasons to be used by the projections. This is where Marcel got to shine. Marcel slips in right into the middle of the pack here tied with Steamer. Notice now the best, ZiPS, and the worst, PECOTA, are only .009 away from each other compared to .064 with the Rookies and .023 with the middlers. When players rack up larger sample sizes, we can project them much more accurately.

No one system stands out above ZiPS, our overall winner, in any bracket. When it comes to experienced players you can’t go wrong, use whichever system is available to you, but when it comes to the rookies steer clear of Marcel and opt for ZiPS if you can.

#### Age

Age is another interesting subgroup to look at because different projections handle player aging differently. PECOTA and ZiPS rely on historical comparisons while Marcel uses an age factor. Steamer does not explicitly take aging into account at all. I examined how the projections did for really young players, really old players, and each age in between.

Mean Absolute Error |
---|

Age |
Actual |
PECOTA |
Marcel |
Oliver |
Steamer |
ZiPS |

24 and Below | 0 | 0.0297 | 0.0311 | 0.0315 | 0.0288 | 0.0294 |

25 | 0 | 0.0292 | 0.0317 | 0.0267 | 0.0311 | 0.0288 |

26 | 0 | 0.0319 | 0.0343 | 0.0323 | 0.0318 | 0.0329 |

27 | 0 | 0.0292 | 0.0325 | 0.029 | 0.0285 | 0.0283 |

28 | 0 | 0.0283 | 0.0261 | 0.0238 | 0.0281 | 0.0271 |

29 | 0 | 0.0309 | 0.0343 | 0.0331 | 0.0308 | 0.0324 |

30 and Above | 0 | 0.0252 | 0.0252 | 0.026 | 0.0249 | 0.0243 |

Mean wOBA |
---|

Age |
Actual |
PECOTA |
Marcel |
Oliver |
Steamer |
ZiPS |

24 and Below | 0.3108 | 0.3066 | 0.3135 | 0.3133 | 0.3100 | 0.3135 |

25 | 0.3016 | 0.3040 | 0.2995 | 0.3030 | 0.3062 | 0.3010 |

26 | 0.3230 | 0.3090 | 0.3085 | 0.3096 | 0.3094 | 0.3074 |

27 | 0.3143 | 0.3148 | 0.3148 | 0.3164 | 0.3166 | 0.3154 |

28 | 0.3202 | 0.3179 | 0.3197 | 0.3187 | 0.3152 | 0.3173 |

29 | 0.3159 | 0.3157 | 0.3133 | 0.3092 | 0.3115 | 0.3123 |

30 and Above | 0.3168 | 0.3215 | 0.3194 | 0.3191 | 0.3205 | 0.3201 |

There are a few things to take away from examining the breakdown by age. The first is that projections do better on older players. Given our results above, this should come as no surprise. It appears that for all systems the challenges of dealing with player aging are more than outweighed by the advantage gained by the added data to draw on.

Among the oldest players ZiPS takes the cake with Steamer coming in second and PECOTA and Marcel tied for third. There seems to be no direct connection with the performance among these players and the specific method used to account for aging. ZiPS did the best by using historic equivalences, but Steamer came in second without explicitly looking at age at all. PECOTA, also using historic equivalences tied exactly with Marcel who uses a simple age factor.

Interestingly Oliver, which does better than Marcel overall, struggled with the extremes. Oliver came in dead last among the youngest and among the oldest players. This indicates that perhaps Oliver needs to revise the way it takes into account age.

Overall ZiPS does the best on older players and Steamer does the best on youngsters. However, the difference isn’t large enough for me to want to go out of my way to use two different projection systems for young and old players. I’ll stick to our overall winner and use ZiPS.

#### Breakout and Breakdown Players

Another thing we might be interested in is how each system did at predicting the extremes. I examined how each system fared in projecting breakout players. I defined a breakout player as a player whose wOBA increased by 30 points or more from 2013 to 2014. These could be young guys coming into their own or veterans coming off of an injury-plagued season.

Breakout |
---|

System |
Mean |
wOBA |

Actual | 0.0000 | 0.3411 |

PECOTA | 0.0374 | 0.3080 |

Marcel | 0.0409 | 0.3054 |

Oliver | 0.0413 | 0.3031 |

Steamer | 0.0373 | 0.3081 |

ZiPS | 0.0381 | 0.3078 |

All projections will be cautious to (or won’t at all) project something drastically different from what they have done in the recent past. Thus, none did a very good job predicting a breakout. This is one area where it may be better to use subjective measures.

We do, however, see two similar systems that use historical equivalences, ZiPS and PECOTA, do well. However, Steamer did the best without using historical equivalences. With both the systems that incorporate historical equivalences doing well we might want to start to think there is something to equivalency systems doing better at predicting large swings in performance.

Similarly, I looked at how each system did on the opposite type of players, players who experienced breakdown seasons. A player was flagged to have a breakdown season if his 2014 wOBA had decreased 30 points or more.

Breakdown |
---|

System |
Mean |
wOBA |

Actual | 0.0000 | 0.2885 |

PECOTA | 0.0383 | 0.3120 |

Marcel | 0.0408 | 0.3128 |

Oliver | 0.0397 | 0.3158 |

Steamer | 0.0384 | 0.3122 |

ZiPS | 0.0378 | 0.3136 |

Again we see the same three systems as the top performers, this time with ZiPS taking the number one spot, and PECOTA as the runner up. This is more evidence that when looking for big advances or declines in players, using equivalences might be the way to go.

As a whole, the systems do a slightly better job at predicting breakdowns than breakouts, but not by as much as I would have expected. Intuitively it makes sense that a breakdown is easier to predict than a breakout, but in reality both are challenges for algorithms that tend towards the mean. It appears that systems like ZiPS and PECOTA will do slightly better for this type of player. If you are looking at players you think are about to do something unusual, I would tend toward ZiPS or PECOTA.

### Variation

One note in defense of Oliver is that Oliver’s projections have the largest variation, which may be preferable to some when choosing a projection system. Think about if you have one system that projected all players to be league average and another that varied. If they both had the same absolute error, you would prefer the one that includes more variation; the one with no variation essentially tells you nothing (and if you adjust to league average like I did, it *does *tell you nothing). This final table summarizes the variation of the adjusted projection system. The mean for all the systems was .3152 after being adjusted to league average, what is important here is the spread in the projections.

Adjusted Projections Summary |
---|

System |
St. Dev. |

Actual | 0.0444 |

ZIPS | 0.0285 |

Steamer | 0.0287 |

Oliver | 0.0317 |

Marcel | 0.0266 |

PECOTA | 0.0280 |

### Conclusions

Overall projection systems may be improving as their creators revise their algorithms. In Tango’s study on 2007-2010 projections Marcel was right in the mix with the others, but four years later we see some more separation. No matter how you slice it, all these projections do a fine job and the differences between them are subtle, but not negligible. We saw that among inexperienced players Marcel struggles and should be avoided. For players who have racked up lots of at-bats and years, it’s hard to go wrong, but ZiPS performed the best. Again ZiPS proved to be the best when looking at breakouts and breakdowns with PECOTA also doing well. This gives credence to the idea that historical equivalences are especially useful for predicting players who are about to break from a normal trajectory.

Given all of the above we can decisively say that in 2014 ZiPS did the best job. It performed the best overall and in most of our subsets while never stumbling into the bottom half. You can’t go wrong with any of these systems, but as we look ahead to 2015, I’ll be using ZiPS.

### References and Resources

- ZiPS and Oliver provided Baseball Projections Project.
- PECOTA provided by Baseball Prospectus.
- Marcel provided by David Pinto.
- Steamer provided by Jared Cross.
- The general structure of my evaluation was based on a 2010 study on projections by Tom Tango.

Do you know how the average of the projections held up?

Good question. I just calculated the average and re-ran the numbers. Here are the overall results:

Actual 0.0000

ZIPS 0.0274

Steamer 0.0277

PECOTA 0.0279

Oliver 0.0280

Marcel 0.0289

Average 0.2702

When I calculated the average I only averaged players who were projected by all systems. Players missing in at least one system were given the 20 points below league average projection.

That’s 0.02702 for the average, right? That’s another data point in favor of averaging projections.

Dan,

Nice work. I have a concordance table of player IDs for Retrosheet, BRef, MLB, BP, Davenport and DMB, if that would help with your matching up players.

-Rob

You should have included Clay Davenport projections (www.claydavenport.com). Over at Tango’s site, his 2014 projections were the best at projecting the Team Standings.

Quick note on Steamer. I wrote Steamer does not directly take age into account, but hat is no longer the case. The current version does!

Yeah, I can’t remember whether we always (in the very first year, 2007, I think) had an aging curve but that was an early addition so we’ve had that for a while since at least, I think, 2008.

Scratch that. I have my years wrong! In 2009 we didn’t use an aging curve and started to in 2010.

Is there a reason to only adjust the mean and not for st dev at the start? It seems like you’re halfway to comparing z-scores, and by only using the mean the data gets misrepresented.

That would be my question as well…

Thanks for the comment, there have been studies done using z-scores in prior years that have been good, but I decided not to go down that route. I didn’t want to use z-scores mainly to maintain interpretability. Also the variation within each projection is important, but their baseline mean is not. Its a matter of preference and I chose to keep things in terms of wOBA.

Are pitcher results coming in a future piece?

That would require a whole new wave of merges to deal with, so nothing is on the docket, but perhaps down the road.

Would have been interesting to see how bill James did, since he the main guy for the sox.

Nice work. Really enjoyed this. Have you thought about comparing the projections for metrics other than wOBA — such as K%, BB%, ISO, etc? Might be interesting to see if one system outperforms the others for any of these specific stats.

Thanks Chris! wOBA is my favorite and captures most of the common elements projected by all the systems I looked at. If I find time this evening I can run it for walk rate and see what comes up. I imagine looking at ISO won’t be all that more insightful that looking at wOBA.

Thanks a ton for doing all of the legwork–very compelling!

Interested in why you chose mean absolute error as the metric of choice for evaluating the systems. Was there any thought to using RMSE or R-squared?

Thanks Joshua, I chose the absolute difference mainly for the interpretability. I just ran the regression of each projection against the actual (weighting by actual plate appearances). Here are the R-squared values.

Actual 1.000

ZIPS 0.330

Steamer 0.316

PECOTA 0.304

Oliver 0.307

Marcel 0.240

Average 0.318

Strikes me that the headline for this ought to read: “Best projection system explains one-third of observed variance.” I don’t see the difference in R-squared between three systems (.33 to .30) being particularly meaningful.

Very interesting. I’m actually surprised at how low, relatively speaking, the R-squared values are.

Yes it is disappointing, but not surprising. There will always be a lot of random variation and previous work has already shown that most projection systems are all pretty similar. Any differences are slight.

Great article. An interesting followup for 2015 might be to examine how the various projection systems fared in regressing 2014 breakouts compared with those players’ actual 2015 performance.

What about the (non)projection system of predicting everyone to be league average?

Then, you could use the league average projection as ‘replacement level’ and then create a Value Over Replacement Projection metric 😉

For those who think you should adjust for the spread as well: that’s wrong, you can’t.

Imagine that you have one system that has a spread of wOBA forecasts of .280 to .380 and another has a spread that is twice as wide, .230 to .430.

So, if you align them the same, you’d compress one and/or expand the other.

Indeed, imagine you have a third one that has a range of .320 to .340.

You CANNOT adjust the spread, which is tantamount to adjusting the slope, which is also why you CANNOT do r-squared.

RMSE, yes. Correlation, no. The slope cannot change.

What do you mean you “cannot do R=squared”? It’s a regression, R-squared is the standard metric for determining proportion of variance explained by a model.

I’ve read your comment on your blog Tom. I am completely unconvinced.

You posit two models, both with a R-squared of 1.0, but one with a functional form of the actual result to the predicted value of y=x and the other with a functional form of the actual result to the predicted value of y=f(x). It is true that the first model is superior to the second, but this is PURELY because it is more parsimonious.

Imagine, however, that the first model, the simple one, instead had an R-squared of 0.6. Good, but with a great deal of unmeasured variance. Even though its RMSE might be lower than RMSE of the initial output of the second model, the output of the second model transformed through y=f(x) will still give a superior prediction (because R-squared equals 1.0).

A regression model is attempting to explain variance. R-squared is the appropriate statistic to determine how well this is accomplished.

Tangotiger – right. I would think right approach is taking the actual vs projected differentials and RMSE the “line of consistency/model” within each projection system. Lowest the value, most consistent, best job

There were 20 MLB hitters that Oliver did not have a projection for? I’d appreciate if you could send me a list.

Nice work Dan.

When computing the average of all the systems you should not substitute 20 less than mean for any player who is not projected by ALL the systems. That really waters down the result (depending upon how many players that is of course).

You should only do that for players who are not found on ANY system.

I wonder how just using 2013 data compares?

Maybe I’m just speaking as a fantasy player here but it seems like this stuff always comes in a batch stat–wOBA here. If I want to know which system does the best at, say, average or steals wOBA tells me little, and those are areas where some systems probably take less care.

R and RBI are important for Fantasy, and these forecasting systems aren’t necessarily designed for that.

Indeed, these systems aren’t even necessarily designed to forecast plate appearances or IP.

Nice work. Very interesting.

I know what you’re talking about in the difficulty of combining these things.

I maintain a “Player ID Map” that helps make the combining of players easier. The map gives the IDs of each player from most of the major systems. For example, I keep Fangraphs, BP, and others. Or you might also check out the Chadwick register or Crunchtime Baseball.

We all maintain slightly different info and update at different frequencies.

PECOTA projects a large number of players who are not included on the spreadsheet. And feel free to contact me anytime for help with mapping issues.