# What is WAR good for?

The quick answer to the title’s question is, of course, *not* absolutely nothing. What wins above replacement, or what is more commonly referred to as WAR, is actually good for is much more complicated and involved.

This past weekend, I had the honor and pleasure to present a paper at the 2012 Saber Seminar, a charity baseball conference in Boston that raises money to benefit the Jimmy Fund. I’ve cut down that paper a good deal and made some modifications to it based on the feedback I received at the conference. Essentially, I converted the paper into a form that would work well as an article at The Hardball Times.

Almost three years ago, the managing editor of FanGraphs, Dave Cameron, wrote a post entitled, “WAR: It Works,” that showed the correlation between 2009 projected wins based on WAR and actual wins. The correlation was high (r=.83), and Cameron used that finding to back this interesting statement:

WAR isn’t perfect. But given the known limitations and the variations in how contextual situations impact final record, it does an awfully impressive job of projecting wins and losses.

Keep Cameron’s quote in mind; I’ll get back to it.

I think it is really exciting that WAR has begun to get some mainstream attention. Baseball-Reference’s WAR is found directly on ESPN’s main statistical leaderboards. WAR has even been featured on SportsCenter.

It might be just be an inherent quality within the sabermetric community’s “cult” mentality, but in my opinion there has been an inverse relationship between WAR’s mainstream acceptance and the weight that the statistic holds among sabermetricians. When Cameron wrote the post I referred to earlier, he noted that WAR faced detractors within the sabermetric community, even at that time:

(WAR) faces a decent amount of skepticism from people who don’t trust various components for a variety of reasons – they don’t like the numbers that UZR spits out for defense, they don’t believe in replacement level, or they believe that pitchers do have control over their BABIP rates.

WAR has been gaining acceptance, but some of the internet’s best sabermetric minds are distancing themselves from the statistic more than ever, especially as a single-season metric.

The goal of my paper for the Saber Seminar was to evaluate the ability of WAR to describe performance in a given season, as well as to predict future performances in a subsequent season.

The first analysis was the same as Cameron’s original study, with a few modifications. I expanded the sample and used Baseball-Reference’s WAR instead of FanGraphs’ version.

First, here’s a quick reference to the differences between B-R’s WAR and FG’s. Secondly, I modified Cameron’s sample to include five randomly selected teams per season (80 teams in total) during the full-season Wild Card era, 1996-2011 (1995 was a slightly shortened season). I then took the cumulative WAR for each of those teams and regressed it against their actual win total.

Baseball-Reference’s version of WAR uses a baseline winning percentage of .320; thus, a team with zero WAR, or an entirely replacement-level team, would expect to win roughly 52 games over the course of a 162-game season. Essentially, if WAR does a correct job at explaining where wins come from, the linear regression equation should be close to or exactly:

*WINS = 52 + 1.0*WAR*

So, for each WAR a player contributed to his team, the team should win one more game above the 52-win baseline. Simple, but effective.

Here are the results of the 80-team sample regression:

The first thing that jumps out from the regression is how well the samples fit to the projected linear equation. I expected the slope of the trendline to be around 1.0, and it came out to be 0.97, while I expected the intercept to be around 52 and it came out to 52.7—very close.

The correlation coefficient, r, from this sample (.91) was higher than the one from Cameron’s study (.83). Also, that correlation can be converted into an r^2 of .83, which simply means that 83 percent of the variance in wins is accounted for by WAR. That is amazing.

Some of the detractors to Cameron’s original study argued that projected wins based on WAR weren’t useful, mainly because he showed that Pythagorean Record had a .91 correlation with wins, which was higher than Cameron’s WAR correlation. Interestingly, the correlation that WAR had in my study was identical to the one for Cameron’s Pythagorean record.

Also, Cameron’s study calculated one standard deviation of difference between WAR and actual wins to be over six wins (6.4), but my sample’s standard deviation is under three wins (2.91). In this sample, 42 of the 80 teams were within three wins of their projected WAR total. Cameron noted that 18 of his 30 teams (67 percent) were within six wins, while in this sample 67 of the 80 teams (84 percent) were within six wins.

The main reason sabermetricians have been distancing themselves from the importance of single-season WAR values is that single-season defensive metrics have a crazy amount of variability, so many people don’t trust them. The defensive statistic that has received the most criticism this season has to be Defensive Runs Saved (DRS), a statistic published by Baseball Info Solutions.

DRS data are used in calculating Baseball-Reference’s WAR, but that data only dates back to 2003; thus, my sample had WAR data which used two different defensive metrics. The 1996-2002 portion of the sample used Sean Smith’s Total Zone Rating (TZR).

I checked to see if there was a significant difference between the WARs based on the two different metrics, mainly because the critiques of DRS put enough doubt in my head that there was a good chance that the DRS portion of WAR had thrown off the sample:

TZR (1996-2002): WINS = .94*(WAR) + 53.37, r= .88; r^2 = .78; p < .001

DRS (2003-2011): WINS = .99*(WAR) + 52.1, r = .94; r^2 = .88; p < .001
The results for both samples are very good, and this shows that a defensive metric that has been so critiqued, like DRS, and is a major aspect of WAR does not render the statistic useless, but instead had a very high correlation and an almost perfect *WINS = 52 + 1.0*WAR* regression equation that I was looking for.

WAR does a very good job of describing what has happened in a given season; however, that isn’t always very useful. Predicting outcomes in future seasons is almost always more important (valuable) than describing what has happened before, so I decided to test the predictive value of single-season WAR.

I took a random sample of 30 teams (five per season) from 2006-2011 and summed their WARs in the previous season to project their win totals in the subsequent season. For example, I would calculate the cumulative 2010 WAR of the 2011 Toronto Blue Jays’ roster and then regress that total against their actual wins in 2011.

Some critical assumptions to the model were an assumed half-WAR (0.5) reduction for players who were declining (>age 30 in the outcome season) as well as a replacement-level or 0.0 WAR assumption for rookies. The replacement-level assumption may seem a little flawed, because rookies like Brett Lawrie come up and put up 3.0-plus win seasons in their first big league campaign. But for every Lawrie, there are a dozen rookies who play at or below replacement-level in their rookie campaign.

Here are the results of predictor year WAR’s (ex., 2010 Blue Jays) ability to project outcome year wins (ex., 2011 Blue Jays):

The results were statistically significant, with a decent correlation of .59. That correlation also means that only 35 percent of the variance in outcome year win totals is accounted for by the previous year’s WAR total. Also, the linear regression equation was nowhere near the expected WINS = 52 + 1*0WAR. Instead, the equation had an intercept of almost 64 wins, with a slope of just 0.68.

To reiterate WAR’s descriptive strength, I ran a regression between the outcome-year WAR (ex., 2011 Blue Jays) and outcome-year wins (ex., 2011 Blue Jays), for this sample, in the same manner as the original sample of 80 teams:

The results of this regression were essentially identical to the results of the original study; the correlation, r, stayed at .91, and the linear regression was very close to expected, with a slope of 1.02 and an intercept of 51.93 wins.

Single-season WAR is quite obviously much better at describing what has happened than what will happen. I keep emphasizing the fact that some sabermetricians have begun to put very little weight or trust into single-season WAR results, but at the same time there are many sabermetricians who may overuse or overvalue the statistic. I’ve read time and time again in either trade or contract analyses that a certain player is going to provide a 3.0-5.0 win improvement for his new team based on his WAR from the year before. That conclusion is most likely incorrect. Take for instance, this extreme example:

The Red Sox signed Carl Crawford prior to the 2011 season. Crawford was worth over six wins (6.6 WAR, in 2010, and many wrote that Crawford would be a six-win improvement for Boston in 2011. As we all know, Crawford underperformed considerably for the Sox, accumulating zero wins over the course of the 2011 season, a perfectly replacement-level season. I think this example reiterates the point that baseball is so difficult to predict.

On a season-to-season basis, there is too much variability and uncertainty to possibly attempt to say what happened the season prior is definitely what will occur in the subsequent season. Weighted projection systems like Oliver, PECOTA, ZIPS and others are much more capable of looking at the full picture and projecting into the future than one season of data can.

### Conclusion

The fact of the matter is the small sample size of certain metrics that go into WAR do not represent the true talent level of a player, especially defensively. This fact has caused many to claim that single-season WAR is useless, because it doesn’t reflect the true talent level of an individual.

But why should a single season of WAR reveal to us the true talent level of any player? How many times is one season of work enough to uncover the actual talent of any baseball player?

**Never**. Fluke seasons happen all of the time and are part of what makes baseball great.

While some are correct in saying that certain metrics aren’t a reflection of true talent, others make the claim that single-season defensive metrics are utterly useless because they are largely based on context and sequence of fringe defensive plays; thus, they tend to fluctuate greatly from season to season. But quite honestly, that is true of any baseball statistic. Traditional statistics like ERA, RBI, and even advanced statistics like wOBA and FIP have large sequence and context factors.

Consider this scenario:

Player A: An average defensive player over his career randomly has a great defensive season.

Player B: An average offensive player over his career puts up gaudy offensive numbers out of nowhere.

Player A’s WAR from that defensive season will be dismissed by the vast majority as being useless and incorrect, and his WAR will be ignored because of the “bad or incorrect data” being used to measure his defense. Quite oppositely, Player B’s WAR will be accepted as hard fact, and his numbers are either considered a fluke or a “breakout” campaign. This doesn’t make a whole lot of sense.

The misconception of true talent level versus where wins come from is where the analysis of WAR as a statistic falls apart.

The last sentence brings us full circle back to Cameron’s quote, which I cited earlier. Here it is another time:

WAR isn’t perfect. But given the known limitations and the variations in how contextual situations impact final record, it does an awfully impressive job of projecting wins and losses.

This quote jibes very well with my point, except with one little modification to the wording. Cameron says WAR does an impressive job of “projecting” wins and losses. I would reword this part to what I think he meant: WAR does an awfully impressive job of *describing* where individual wins come from for a team.

Single-season WAR does a phenomenal job at doing what it says it does. Single-season WAR should not be used to predict win totals or even WAR in a subsequent season. Single-season WAR also is not supposed to reflect the true talent level of a player, which I think is far and away the largest flaw in the way people interpret the statistic. If WAR did reflect true talent, every player would have the same WAR that perfectly encompassed how much value his talent should bring to his team every single year.

Even in the various definitions of WAR, the words “true talent level” never pop up:

-Our definition at THT is:

(WAR) is a metric that combines a player’s contributions on offense and defense and then compares him to the appropriate replacement level for his position.

-FanGraphs definition of WAR:

Wins Above Replacement (WAR) is an attempt by the sabermetric baseball community to summarize a player’s total contributions to their team in one statistic.

-Baseball Prospectus definition of WAR(P):

Wins Above Replacement Player is Prospectus’ attempt at capturing a player’s total value. This means considering playing time, position, batting, baserunning, and defense for batters, and role, innings pitched, and quality of performance for pitchers…Prospectus’ definition of replacement level contends that a team full of such players would win a little over 50 games.

-Baseball-Reference’s definition of WAR:

The idea behind the WAR framework is that we want to know how much better a player is than what a team would typically have to replace that player.

The consensus seems to be that WAR is how much value (WINS!!) a player contributes to his team over the baseline of a player who could replace him. WAR does not reflect the true talent level of a player, but instead it describes how many wins an individual player contributes on the actual field, and in that aspect it works spectacularly well.

**References & Resources**

All WAR data comes courtesy of Baseball-Reference

Dr. George J. DuPaul was a co-author on the original paper

“Some critical assumptions to the model were an assumed half-WAR (0.5) reduction for players who were declining (>age 30 in the outcome season) as well as a replacement-level or 0.0 WAR assumption for rookies.”

Was any consideration given to a more sophisticated approach to calculating the expected WAR for the next year? I would think that enough WAR data by age would be available to generate a table of WAR aging factors by age.

That was definitely something that I considered. The issue is with variability and with players who are over 37 or so. I think with aging curves we see a slight decline that begins between 28-31 that begins to gets steep really quickly. The reason I think the deductions get so steep though is only great players are still around the MLB at that age, so they’re coming down from seriously high WAR’s.

Also, as I think I showed and many others have shown that WAR varies greatly from year to year, so I tried to put together a model that was as simple as possible. I also took a random sample of a little over 200 players over the age of 30 during that time span and the average WAR decline was 0.44 and I rounded to 0.5 to keep things simple.

You’re probably right that using different aging factors could’ve made single-season WAR more predictive, but I’d be surprised if it affected the results my a great amount.

There’s probably a good chance

You could use those several factors to predict WAR instead of just using what they did in the previous season(s). You could use age (as mentioned above), average games played/missed due to injury, and overall totals (to show how much wear and tear… sometimes age is not as big as a factor. i.e. someone who has played full time in MLB from 21 years of age may wear down faster that became a regular at the age of 25/26). You could do all of this using Multiple Regression Analysis, and it will give you a handy-dandy calculation. I don’t mind someone else working on this calculation and running with it, as long as I get credit for it’s use . I would do the calculations, but I’m on vacation for a few more days and do not have access to minitab at the moment.

That was essentially my point, though. Projection systems do weight things like playing time, debut, park environment, injury history and a lot more to get a handy-dandy calculation. And I think, for the most part, they do a very good job of projecting WAR.

My point was that using simple WAR addition isn’t enough, which I think we see far too often, like I said with the Crawford example.

Also it looks like my first comment got cut off, the full statement was “there’s probably a good chance that the results would’ve been better, but by only a negligible amount”

Glenn – In your graph entitled “Outcome Year WAR Regression” I am not sure what the blue diagonal line is supposed to represent but if it is supposed to be the regression line it is clearly in the wrong place. And as a general point, when you are presenting multiple graphs that are illustrating similar relationships as you do here, don’t change the scale of the axes as you do with the x-axis in these graphs.

I also took a random sample of a little over 200 players over the age of 30 during that time span and the average WAR decline was 0.44 and I rounded to 0.5 to keep things simple.It is wrong to presume that WAR decline is only going to happen over the age of 30 for two reasons. First, actual decline of skills for both pitching and fielding has been shown to occur beginning at age 23 and with some offensive skills before age 30. More importantly, survivor bias is going to cause a decrease in average year 2 WAR for players of all ages that are still remaining on year 2 MLB rosters.

“there’s probably a good chance that the results would’ve been better, but by only a negligible amount”It’s not good form to dismiss the intelligent criticism of your commentors when you have no idea whether their suggestions would result in a negligible improvement or a substantial improvement without further study.

Peter thanks for the comments. The regression line that you’re talking about was fitted by hand, so that’s why it’s slightly off intersecting where the actual regression is supposed to .

My mistake on the different axes, I have changed the graphs will submit them to the editors if you think this is necessary.

Also, the survivor bias comment makes sense, again I was trying to keep the model simple, and maybe I shouldn’t have gone with the decline, but instead just looked at kept the single-season WAR as the predictor for everyone, because that was the point I was trying to make.

You’re right that I shouldn’t have said that there would be negligible improvement because that’s just my opinion. I didn’t think I dismissed his criticism, because I think his suggestion would improve the results, just not sure by how much.

Thanks for the comments though, much appreciated.

Interesting article.

In a paper I wrote over 20 years ago for the Proceedings of the Casualty Actuarial Society, I examined MLB won-lost records from 1901 to 1960.

One minor result was that the previous years winning percentage had about 65% credibility for estimating the following year’s winning percentage. In the current context that would be:

next years wins = (0.65)(prev. years wins) + 28.35.

A season is a bit longer, 162 rather than 154 games, and

other things have also changed. So the 65% credibility factor may have changed a little. But this should still a good base case predictor to which to compare more complicated predictors.

Your use of the previous years WAR, includes the effects of roster changes from one year to the next, something the naive method I described does not take into account. Thus it should do a somewhat better than the naive method.

One thing worth mentioning is the average error or root mean squared error of using your regression equation to predict the next years wins. The naive method had a root mean squared error between 11 and 12 games.

I actually ran a multiple regression that began with previous year wins for the paper that was presented at the Saber Seminar, but I thought it was a little too math-y for a THT article.

Interestingly, with just a sample of 30 teams I got almost the exact same output as you, with 63% credibility: Next Year Wins = (.63)*(prev. year wins) + 28.96. This sample was small so it probably would have become closer to your results had I expanded the sample. Also when predictor year wins was added in as a block in the multiple regression, the r only went from .58 to .64, which wasn’t significant.

in the last comment I meant to say predictor year WAR not wins, did not affect the r by a significant amount

I appreciate the enlightening article, but a bit off topic from the article, what do you think of the seemingly growing tendency of people in the media to try to use WAR as almost an “MVP-measure”? Do you feel WAR is best used for his type of analysis between given players in a certain year? Where do you think WAR fits into an MVP discussion versus other statistical analysis? Would appreciate any insight or opinion you have.

Oops. Sorry, Glenn. Just came across your other article “Andrew McCutchen: more valuable than we may have thought”. I think this probably answers most of the questions I just asked in my previous post, eh?

Joe,

My McCutchen article wasn’t exactly a new definition of how I think the MVP should be selected. I would say that WAR is the starting point, but you should also look at who surrounds the players (% of total team WAR) and then I’m also I huge advocate that the player’s team should make the postseason. The award is most valuable player, not most outstanding players, and I don’t see how a player could be most valuable if his team doesn’t make the playoffs (although I know there are a ton of factors involved in making the playoffs).

I will add though that if the Angels don’t make the playoffs this year (a distinct possibility) that I’d have a hard time picking an MVP that wasn’t Mike Trout for the AL

Amazing article. Very good for trying to explain to people why WAR shouldn’t be as hyped up as people think it is. I’m not anti-sabermetrics, but I think SOME sabermetrics can be useful for baseball. However, I still believed (Call me old-fashioned, if you will) that one thing that sabermetrics will never be able to predict or quantify is all the subjective variables that exist within baseball. No two at-bats are ever the same. How can sabermetrics weigh the importance of different at-bats? How can they difference between an AB that’s at the last game of the season an bottom of the 9th, compared to one mid-season in the 4th inning?

Baseball is a subjective sport. Emotion plays a crucial part in it. As Hawk Harrelson said (Though this might be extreme): “The only stat that matters is TWTW (The Will To Win)”.