﻿ Win Probability Added Above Replacement | The Hardball Times

Zach Britton was great in 2016, but was he really the American League’s best pitcher? (via Keith Allison)

Short Version

WPA, used as a value metric, is incomplete. You have to build in replacement level, including bullpen chaining, to get the full story. These adjustments, which are commonly accepted parts of WAR, shift value to starting pitchers relative to high-leverage relievers, while keeping the win expectancy framework of WPA at the heart of the value metric. With readily available data we can use some basic algebra to convert WPA to WPA-Above-Replacement, or WPAAR.

Much Longer Version

Thanks to Zach Britton’s near-perfect season, a reliever received real Cy Young support. In the saber community, his candidacy was supported by WPA (Win Probability Added) where he led the majors with +6.2 wins (and Andrew Miller finished second with +4.8 wins.) WPA-backed arguments mimic those that tout Mike Trout as the obvious MVP: “whoever helped their team win the most games is the most valuable player.” The problem with WPA-based MVP arguments, however, lies in the assumption that the WPA leader is the player who helped his team win the most games. Even after deciding that the win probability framework is the one you want to use, WPA is just a win probability metric, not the win probability metric, and, as I’ll lay out below, it’s an incomplete win probability metric.

Background: Win Probability and Leverage

Win probability (or win expectancy) is the baseball version of the little percentages next to the cards in televised poker. Given the state of the game, and an expectation of “all” the possibilities that could occur the rest of the game, how likely is each team to win? Win probability added is the change in win probability after an event occurs. When Jose Bautista hits a home run, the Blue Jays are more likely to win, and that change in expectancy is credited to Bautista. Add up all these little changes over a full season, and you have a player’s WPA.

WPA is very similar to WAA, Wins Above Average, except for how the wins are tallied. WPA uses win probabilities, WAA uses linear weights. In the middle is REW (Run Expectancy Wins). Run expectancy, like win expectancy, uses the game situation to calculate the change of each play. The difference is that run expectancy only takes into account runners on base and number of outs, while win expectancy also accounts for inning and score. Linear weights doesn’t care about context of an event at all, using the average value across all possible contexts. REW, like linear weights, use a runs-per-win converter to translate runs into wins. Win probability starts with wins as the unit.

To summarize:

Metric Summary
Linear weights Run expectancy Win probability Championship probability
Context/leverage None Runners on base, outs Inning, score, runners, outs Standings, inning, score, runners, outs
Question answered On average, across all situations a PA might occur in, how many runs does a single add? How many more runs do we expect to score this inning because of this single? How much more likely are we to win this game because of this single? How much more likely are we to win the World Series because of this single?
Common stats wRAA, RAA, WAA (converted to wins) RE24, REW (converted to wins) WPA cWPA
cWPA References:
B: http://baseballanalysts.com/archives/2009/04/championship_wp.php
C: http://www.hardballtimes.com/the-top-10-plays-of-2016-according-to-championship-wpa/

In all of these expectancy metrics, there is an inherent assumption that some situations are more important than others. For example, an at-bat in a tied game in the ninth inning matters more than in a six-run game in the fifth. It matters more because the outcome of the at-bat has a bigger influence on the outcome of the game. Mathematically, the average change in win expectancy is larger in the first example – there are wider swings. The difference between a strikeout and a home run is quite wide in a tied game in the ninth, while the difference is negligible in a six-run game in the fifth. And you know that intuitively, because your heart is racing. This “average change in win expectancy” is known as leverage. Every situation can be assigned a leverage value using similar math to expectancy metrics. Each expectancy metric has its own version of leverage, according to the context it cares about.

If you’ve heard of leverage, it’s most likely the one associated with win expectancy, but there’s also base-out leverage, championship leverage, etc. (Linear weights does not have an associated leverage, since outcomes have no context in linear weights.) FanGraphs reports a few aggregated stats measuring win expectancy leverage. pLI averages a pitcher’s average leverage across all plate appearances. inLI averages leverage across the first pitch of an inning a pitcher started. gmLI averages leverage across the leverages of the first pitch a pitcher makes in a game. exLI cares about the leverage when a pitchers exits. When calculating reliever WAR, wins above average based on linear weights (or FIP or ERA) is multiplied by LI to give relievers who pitch more important innings more credit for their runs prevented.

Background: Bullpen Leverage Chaining

Finally, while closers pitch high-leverage innings and deserve a lot of credit for doing so, their replacements aren’t replacement-level relievers, but instead are setup guys. When a closer goes down, the guy added from Triple-A is given mop-up duty, not the closer role, while everyone else moves one step higher on the ladder. The closer is replaced by the setup guy, the setup guys is replaced by the 7th inning guy, all the way down the line. All those little changes add up to yield the actual value of the closer. To account for this, we give half credit for the higher leverage innings of good relievers. Why half? Because that’s what makes the math work out – there’s a longer explanation and an example calculation here if you are interested in said math. Closers usually deserve to close because they’re excellent relievers, but replacing them with setup guys doesn’t hurt the team as much as their raw leverage and WPA numbers suggest.

Background: Replacement Level

Again, what all these probability/expectancy stats have in common is that they are relative to average. You can interpret that as the league summing to zero net wins, or that each player is compared to an average player. But we don’t use wins-above-average very often, because it’s incomplete. It doesn’t account for the value that an average player provides over a replacement level player. It says that a 0 WAA player over 10 plate appearances was just as valuable as a 0 WAA player over 600 plate appearances. But you’d rather have the second player, because the first requires you to find another 590 plate appearances at league-average rate. That’s not easy, and not cheap. That’s the reason why we usually use WAR (Wins Above Replacement), building in the value of an average player above and beyond that of a replacement level player. This can be more than a two-win difference for full-time players.

As you can probably guess, these adjustments comparing an average player to a replacement level player significantly decrease the value of high-leverage relievers when judged solely by WPA. But these are all adjustments that we already make in WAR and are commonly accepted. By using win probability above replacement, we’re still giving bullpen aces lots of credit for their higher-leverage performances, just not as much as raw WPA claims.

The New Stuff: Converting WPA to WPAAR

So, what’s the solution? I’m going to call it Win Probability Added Above Replacement, and calculate it using the 2016 versions of Zach Britton and Jon Lester (the top starter by WPA) as examples. The two main adjustments are for bullpen chaining and differing replacement levels of starters and relievers.

• Start with WPA. For Britton, this is +6.1. For Lester, this is +4.6. Because the former is a higher number than the latter, many people make claims like “WPA says Zach Britton was more valuable than Jon Lester.” The purpose of this article has been to highlight the context missing in that interpretation of the two numbers. I’d go so far as to say it’s plain wrong.
• Adjust pLI (leverage index) halfway toward 1. This is the bullpen chaining adjustment. For Britton, a pLI of 1.8 becomes 1.4. For Lester, .94 stays at .94, since he’s a starter. WPA is giving Britton full credit for the situations he pitched in, when he really only deserves half.
• Move WPA toward average (zero) by the ratio of LI_adj/LI. For Britton, that ratio is 1.4/1.8 = 78%, and 78% * 6.1 = 4.8. For Jon Lester, no change from +4.6. Because Britton only deserves half credit for the high-leverage situations he finds himself in, his WPA is adjusted down.
• Credit the player for the value of league-average performance over replacement level. For starters, that’s about 2 wins per 180 innings. So Jon Lester gains 2*202/180 = 2.2, for a total of 6.8 WPAAR. But since reliever replacement level is approximately league average, there’s no extra credit for Britton. He stays at +4.8.

In total, Jon Lester gains 2.2 wins due to replacement level, while Britton loses 1.4 wins due to replacement level and chaining. Britton’s 1.5 win lead in WPA over Lester becomes a 2.0 win deficit in WPAAR. Here’s the top 25 leaderboard from 2016.

Name Team IP WPA WPAAR delta
Jon Lester CHC 202 4.6 6.8  2.2
Johnny Cueto SF 219 3.8 6.2  2.4
Max Scherzer WAS 228 3.6 6.1  2.5
Kyle Hendricks CHC 190 3.9 6.0  2.1
Justin Verlander DET 227 3.5 6.0  2.5
Clayton Kershaw LAD 149 4.2 5.8  1.7
Jose Fernandez MIA 182 3.2 5.2  2.0
Tanner Roark WAS 210 2.9 5.2  2.3
Aaron Sanchez TOR 192 2.9 5.1  2.1
Masahiro Tanaka NYY 199 2.6 4.8  2.2
Chris Sale CHW 226 2.3 4.8  2.5
Zach Britton BAL  67 6.1 4.8 -1.4
Jose Quintana CHW 208 2.4 4.7  2.3
Madison Bumgarner SF 226 1.9 4.4  2.5
J.A. Happ TOR 195 2.2 4.4  2.2
Rick Porcello BOS 223 1.8 4.3  2.5
Noah Syndergaard NYM 183 2.1 4.1  2.0
Corey Kluber CLE 215 1.7 4.0  2.4
Cole Hamels TEX 200 1.8 4.0  2.2
Julio Teheran ATL 188 1.9 3.9  2.1
Andrew Miller – – –  74 4.8 3.8 -1.0
Marco Estrada TOR 176 1.8 3.8  2.0
Jake Arrieta CHC 197 1.6 3.8  2.2
Carlos Martinez STL 195 1.6 3.7  2.2
Rich Hill – – – 110 2.4 3.6  1.2

If you want to see the whole list, which displays more of the data, you can see it here.

Now, I don’t actually suggest using WPA for starting pitchers, as their leverage is heavily dependent on run support and timing of the runs scored in the game, which are clearly not pitching skills (for more on not using WPA for starting pitchers, read these three pieces at The Book blog). A better approach is to use a different, more traditional WAR metric for starting pitchers, even if you want to compare them to the WPAAR numbers of relievers. If we remove starting pitchers from the WPAAR leaderboard above, here’s how relievers stack up:

2016 WPAAR Leaders, Full-Time Relief Pitchers
Name Team IP WPA WPAAR delta
Zach Britton BAL 67 6.1 4.8 -1.4
Andrew Miller – – – 74 4.8 3.8 -1.0
Sam Dyson TEX 70 3.6 2.6 -0.9
Dan Otero CLE 70 2.1 2.5  0.4
Mark Melancon – – – 71 3.1 2.4 -0.7
Jeremy Jeffress – – – 58 2.9 2.3 -0.6
Roberto Osuna TOR 74 2.8 2.2 -0.5
Aroldis Chapman – – – 58 2.7 2.2 -0.5
Robbie Ross Jr. BOS 55 1.8 2.0  0.2
Will Harris HOU 64 2.3 1.9 -0.4
Mychal Givens BAL 74 1.9 1.9 -0.1
Seung Hwan Oh STL 79 2.2 1.8 -0.4
Joe Blanton LAD 80 1.9 1.7 -0.1
Blake Treinen WAS 67 1.7 1.7  0.0
Cody Allen CLE 68 2.1 1.7 -0.4
A.J. Ramos MIA 64 2.1 1.6 -0.5
Mauricio Cabrera ATL 38 1.9 1.6 -0.3
Ryan Buchter SD 63 1.7 1.6  0.0
Addison Reed NYM 77 1.8 1.5 -0.2
Brad Hand SD 89 1.7 1.5 -0.2
Tyler Lyons STL 48 1.1 1.5  0.3
Kenley Jansen LAD 68 1.8 1.5 -0.3
Kelvin Herrera KC 72 1.7 1.5 -0.3
Nate Jones CHW 70 1.9 1.5 -0.4
Tyler Thornburg MIL 67 1.9 1.4 -0.4
Matt Bush TEX 61 1.6 1.4 -0.2
Peter Moylan KC 44 1.1 1.4  0.3
Jeurys Familia NYM 77 1.8 1.4 -0.5

Additionally, WPA does a poor job of parsing defensive credit between pitching and fielding (as in, it doesn’t do it). A fielder making a great play is credited to the pitcher under WPA, when really the pitcher should be held accountable for the quality of the batted balls he gave up, while the fielder is credited or debited value from that point depending if he makes the play. With the growing popularity and availability of Statcast data, this splitting of WPA credit between pitchers and fielders might be possible.

Conclusion

After adjusting WPA to account for replacement level and bullpen chainging, Zach Britton remains one of the top five most valuable pitchers in the American League in 2016, and only Justin Verlander is significantly ahead of him. But the lead he held in WPA has disappeared. WPA is a fine metric, but it’s incomplete. You can’t forget replacement level and all of its repercussions. With WPAAR, I think we have a metric that is more closely aligned with a pitcher’s true value.

References & Resources

Inline Feedbacks
Bill
6 years ago

I just want to say that the bullpen chaining argument, as presented here (I haven’t yet read the linked piece that describes it) sounds utterly bogus to me.

The argument, as I follow it, is that because an AAA replacement is given the lowest leverage innings, and the setup guy is thus given the highest leverage innings, the closer should effectively be compared against a “setup guy-replacement” as opposed to a “AAA replacement” as is typically meant. It’s effectively shifting your defined replacement level up because of the other guys in the bullpen.

But the reason that argument fails is that, in this idealized bullpen, the setup guy and all the guys behind him are now worse than the guy they just replaced one rung higher on the ladder. So when the closer is lost and the entire ladder is shifted up one, even if the AAA replacement is at the bottom of the ladder, *every rung on the ladder still loses run-value* compared to having the nominal closer at the top.

Lets examine this using hypothetical metrics and values. Lets say we have 8 bullpen pitchers in a ladder, where each rung on the ladder provides 1/8 less “leverage-adjusted value per inning above replacement” for some arbitrary value metric, where the bottom rung is 1/8 and the top closer is at 1. Thus we have a bullpen that looks like this:

1
7/8
6/8
5/8
4/8
3/8
2/8
1/8

Now when we knock off the closer, the ladder is shifted up:

7/8
6/8
5/8
4/8
3/8
2/8
1/8
0

Instead of crediting the closer with “1/8 above replacement”, he should indeed be credited for the 1/8 lost at every rung, or 1.

…………………………………………..

Okay, having read the linked piece, I agree that the argument there does, theoretically account for value lost at all rungs of the ladder, but I still have serious issues with the “ERA weighted by LI” methodology. LI is just an index-valued measurement of leverage, not some way to adjust run production. Just because a situation was subject to 1.8 higher times WPA-shifts in either direction doesn’t mean that 1 run production there is worth 1.8 run production at a average-leverage time (in other words, I’m saying WPA-shift magnitude *isn’t* a linear function of the relative importance of runs at the time, or rather the inverse function). Furthermore, the part where he uses leveraged-runs to compute WAR on a 10 runs/win basis is completely meaningless, since the *whole point* is that high leverage runs prevented are more valuable than 1/10th of a win. Finally, the formula of regressing LI towards 1 is completely meaningless, derived as some sort of heuristic that “chaining is somewhere between chaining and replacement”, and leads to something like a replacement reliever being given a 0.8 LI instead being given *more credit* simply by being put into meaningless game situations.

I agree that starter WPA is meaningless (since they always start games, they can’t choose what LI they pitch into, unlike relievers (where “they” of course means the manager)), and I also agree that a replacement reliever is a significantly different beast than a replacement starter, but I’m not buying “regress reliever LI towards the mean to adjust for chaining” as an argument, or perhaps more saliently, even if I do buy it as an argument, I certainly do not accept that rattletrap hand waving of “half the LI for relievers!” as a suitable adjustment.

Overall I’m quite intrigued by your article here and I definitely would like to see further research into applying WPA to relief pitching(/defense of course), but I think the bullpen chaining thing needs to be addressed in much further detail and mathematically sound reasoning than BBS article from 2009 uses.

Mike
6 years ago

The reason for the chaining has to do with a different effect, I think. This is something that was discussed on Fangraphs a lot during the playoffs in regards to when to use Andrew Miller.

Assume for a moment a game where the score is 2-0 after the 1st inning. The starter goes 6 innings, then 7th, 8th, 9th are each individual relievers.

Because each new inning closes a higher percentage of remaining opportunity to score, the WPA for each pitcher will go 9th, 8th, 7th regardless of who they face. Think about it like this:

The perfect 7th inning removes 1/3 of the opposition’s remaining outs to come back with.
The perfect 8th removes 1/2 of the oppositions remaining outs.
The perfect 9th removes all of the opposition’s remaining opportunities.

This means that the 9th inning will produce higher WPA numbers regardless of the fact that all 3 pitchers did the same thing (and in a face-nearly-the-minimum scenario like this, its probably the 7th inning guy who has the hardest job, as he’s most likely to face spots 1-5 in the batting order in a low-baserunner game).

Because of this, replacement level for each role has to reflect the fact that the WPA is not distributed evenly.

This isn’t the author’s given reason for his adjustment, but it’s probably a better one than the one he explains.

studes
6 years ago

Right. Number of outs left is another way of saying that leverage increases during the game, as measured by leverage index. I think the author is assuming people understand what leverage index is and how it works.

Sky Kalkman
6 years ago

I agree that the ability to give different quality relief pitchers different average LI’s is what requires the chaining. That’s not different from what I wrote in the article, but it could have been called out more directly.

I’ll point out that what you say about leverage increasing as the game goes on is true *given the assumption that the game remains close*. If each reliever gives up one run instead of none, the leverage is likely to get lower over the rest of the game. And if the game doesn’t start close, leverage will go down over the rest of the game if the score doesn’t change.

Sky Kalkman
6 years ago

1. Glad you came around on the chaining 😉 You’re exactly right that rep level for a closer isn’t the setup man, but the combined shifting of all relievers up the ladder.
2. “Just because a situation was subject to 1.8 higher times WPA-shifts in either direction doesn’t mean that 1 run production there is worth 1.8 run production at a average-leverage time” — yes, that’s actually exactly what it means! And you, yourself say it two sentences later: “the *whole point* is that high leverage runs prevented are more valuable than 1/10th of a win.” You can either multiply runs by LI and then convert to wins, or you can convert runs to wins and then multiply by LI. Same thing. (WPA already measures leveraged wins directly, of course.)
3. “Finally, the formula of regressing LI towards 1 is completely meaningless, derived as some sort of heuristic that “chaining is somewhere between chaining and replacement”, and leads to something like a replacement reliever being given a 0.8 LI instead being given *more credit* simply by being put into meaningless game situations.” I agree the 50% regression is imprecise, but that doesn’t make it inaccurate. There is room for theoretical improvement. The second sentence is wrong — a bad reliever (with a negative WPA) will be LESS valuable as he’s given higher LI appearances, because the negative number is multiplied by a larger LI.
4. This “half the LI” thing is what BRef and FG use in their WAR calculations for relievers, FYI. You can go read about their rationale for it at their sites. There’s also a bunch of stuff out there at Tangotiger’s old blog, and some by Rally/Sean Smith.

Sky Kalkman
6 years ago

In point #2, you’re write that it’s the inverse, I misread what you wrote. But that just means we’re in agreement, I think. Do you think I implied the other way somewhere?

studes
6 years ago

Chaining can be a complicated subject. As an example, it can be applied to position players too. For instance:

Let’s say your catcher is a 3.0 talent and his backup is a 1.0 talent, both over 600 PAs. And to make it realistic, let’s say that your catcher has 400 PAs and his backup has 200 PAs in Season A, for 2.33 WAR in total from catchers.

What if your primary catcher is injured all of Season B? Based on the catcher’s WAR, you might say that the team will lose 2.0 WAR. But the backup catcher becomes the primary catcher and a replacement catcher becomes his backup. The result, based on same playing time, is 0.67 WAR, or 1.67 less for the team. Not 2.0.

That’s the impact of chaining. There is a difference in playing time between the primary catcher and the backup catcher, which means you can’t just say that 3-1=2.

Sky is applying the same logic to relievers. Since we’re using WPA to begin with, you should have bought into the concept that certain runs allowed mean more than other runs allowed (if not, you probably don’t belong here at all). On a WPA scale, the runs that relievers allow don’t all have the same impact and Leverage Index is the way we capture the difference. It’s the impact of runs allowed vs. playing time; LI vs. PA.

Perhaps there is a better way to capture this chaining impact, but LI minus 1 seems to work pretty well. I’d be interested to know how else you’d approach it. After all, there is hardly a consensus about replacement level in general.

Sky Kalkman
6 years ago

And there’s leverage considerations for hitters, too! Middle-of-the-order lineup spots tend to hit in slightly higher leverage situations, because there are likely to be more runners on base in front of them. WAR doesn’t give them this leverage credit. Should it? I’d argue yes, because just like a dominant reliever deserves to pitch in higher leverage situations, a top hitter deserves to hit in the middle of the lineup. Should it give them all of the leverage credit? Probably not. How much credit, then? Uhh, err, I’ll get back to you.

studes
6 years ago

Just use RE24 for batters! That’s my preference, anyway.

Andy
6 years ago

Now we’re starting down a slippery slope. Players on teams with great offense also hit with runners on base more often, on average, than players on lesser offensive teams [think Josh Donaldson vs Mike Trout, 2015]. That is probably going to translate into higher average leverage, though it depends on the average runs allowed by the team, too. In this case, higher leverage production clearly correlates with Runs and RBI, context-dependent stats that I thought no saber would be caught dead praising.

tz
6 years ago

Studes, for your catcher chaining example, I assume we should be using the value of the average backup catcher in the league for the replacement-level calculation, correct?

Peter Jensen
6 years ago

Sky – I don’t think you should be multiplying WPA by leverage index. WPA already is computed with the exact leverage for the situation. So relievers are already correctly valued using WPA. Leverage Index was devised to be able to correct for other metrics that don’t include all the factors that WPA uses.

Studes – Using RE24 for batters is correct for descriptive metrics but you can actually calculate run value added by line up position if you want to correct for the variation of opportunities by lineup position. Linear weights is better for predictive metrics unless you are a believer of clutch hitting.

Sky
6 years ago

Peter — I’m multiplying WPA by [(LI+1)/2)]/LI. If LI is 1.8, that’s [1.8+1]/2]/1.8 = 1.4/1.8 = .78. This ratio will always be less than 1, shifting WPA towards zero. The reasoning is that while WPA correctly values the change in WP for any event, relievers don’t deserve full leverage when determining their value to the team — their replacements would also be “given” those leverage opportunities. (It’s the same logic as the LI adjustment in reliever WAR.) Think of it as WPA/LI, but instead of removing all the leverage of each PA, just a bit is removed.

Peter Jensen
6 years ago

Thanks for the explanation Sky. I knew you were multiplying by (LI+1)/2, but I had missed that you were then dividing the result by LI.

studes
6 years ago

Thanks Peter. To me, WAR is a descriptive stat, which is why I’m all in on RE24. On Twitter, Sky asked me if I like RE24/boLI. That’s one way of correcting for variations. Unfortunately, my memory doesn’t work so well. I remember that I wasn’t a fan of boLI, but I can’t remember why. Getting old sucks.

John Autin
6 years ago

I haven’t fully digested the theory, and I tend to doubt there can be one metric that lets RPs be gauged against both RPs and SPs. But I admire the effort, and the clear explanation. Thanks, Sky!

jsolid
6 years ago

In the move from WPA to WPAAR, there is a major change that went unmentioned but deserves to be highlighted. Team WPA is the sum of individual WPA, but this is not the case with WPAAR. This same shift happened in the move from Win Shares to WAR.

I am waiting, like you, for StatCast-based WPA, or SCWPA, or SCWPAAR. I imagine we are pretty much there; every launch angle/exit velo should by now should have a pretty well established probability distribution of outcomes, allowing us to calculate the expected value of a batted ball.

Actually, we probably could go one step further, and build those probability distributions for the pitches themselves, based on location, speed and movement. Naturally you encounter some problems around the extremes; there is only one Zach Britton slider. It would be interesting to see if it’s possible to build a model to predict what the value of those extreme/unique pitches should be.

This would lead to another interesting Delta, the difference in expected value based on incoming pitches versus outgoing batted balls. This would provide a quantitative metric for which pitchers are hit more (or less) than their stuff would suggest.

Yeezy Shoes Fake
6 years ago

We’re back to recap the most noteworthy sneaker releases of the weekend ahead, ensuring that you stay plugged into the source and are able to cop with the swiftness.