Which Career Metric Predicts Hall-of-Famers Best?

Despite being a controversial Hall of Fame selection, Win Shares thinks Harold Baines is worthy. (via Keith Allison)

One of my regular acquisitions as a fan and student of baseball is the annual Bill James Handbook. Helping to expand the pages of the 2019 edition was James’s introduction of a new Hall of Fame Value Standard statistic. This system combines James’ Win Shares metric with Wins Above Replacement in an effort to determine which baseball players deserve to be in the Hall of Fame.

This piece is not about that system.

It is, partly, about Win Shares. James’ player valuation system predated WAR, helped to inspire WAR, then became a casualty of WAR, as that alternative system leapfrogged it in popularity. (The Hardball Times carried the Win Shares standard for a time, but not for some years.) Win Shares has become a niche stat.

Though James has stuck with his system, he was not comfortable using it as a sole arbiter of who deserves entry into Cooperstown. He does not even claim it is better than WAR for that purpose. Quoting from his piece in the 2019 Bill James Handbook: “Career WAR predicts Hall of Fame entry about as well as Win Shares do, I would guess; maybe a little better, maybe a little worse, I honestly don’t know. I don’t know that anyone has ever studied that.”

You can see where this is heading. If nobody had studied it before, somebody has now.

Filling the Ranks

Two of the player value metrics I’d test Win Shares against were obvious: the FanGraphs and Baseball-Reference versions of WAR. I had hoped to include Baseball Prospectus’s WARP as another contender for the honors, especially as their new Deserved Run Average and newer Deserved Runs Created metrics add fresh if controversial wrinkles to the question. However, their current WARP tables do not include early players such as Ty Cobb, Honus Wagner, and Tris Speaker, so I had to pass on WARP’s inclusion.

In its place, I chose Jay Jaffe’s well-known JAWS (Jaffe WAR Score) system. JAWS may seem to have an advantage in that it was conceived specifically to value a player for Hall of Fame consideration. As Jaffe himself would say, though, the system is prescriptive rather than predictive.

The JAWS numbers I use will be based on bWAR. I considered calculating fWAR versions of JAWS as a further candidate metric—FanGraphs currently doesn’t offer its own version—but decided I couldn’t. I would need to determine the best seven fWAR seasons not just for Hall members, but for every player who could plausibly have a better fJAWS score than the lowest-ranking Hall member. This was too daunting for me.

As compensation, I give you this trivia tidbit I unearthed while testing whether to attempt fJAWS’ inclusion. Hank Aaron’s best major league season was worth 8.9 fWAR; his second best was 8.4 fWAR. Aaron’s 15th-best season was worth 6.8 fWAR. This level of consistent excellence should leave us all slack-jawed, for which I hope I’ll be forgiven for my slackness about JAWS. (And for that pun.)

For purposes of the study, I counted as Hall of Famers those people who had been inducted as players (including the class elected last week), and not by the various Negro League committees. The latter restriction creates a couple of uncomfortable borderline cases—Larry Doby was a “regular” inductee, while Monte Irvin was not—but is a necessary dividing line.

Being inducted as a player produces its own sticky borderline cases, as a few honorees may have had their managing work push them over the line. (I’m thinking of Frank Chance, Hughie Jennings, and Red Schoendienst.) Reverse cases also exist. Clark Griffith was a good enough pitcher that he could well have been inducted as a player, but as a manager and then owner of the Washington Senators he got in by another route. Joe Torre had a decent Cooperstown argument for his playing days, but his glory run as Yankees manager took that argument out of the veterans’ committee’s hands. These examples underscore what we already knew, that playing value and Hall of Fame induction as a player are not perfectly linked.

Once I had the Hall of Fame roster set, I divided it into pitchers and position players. The two types are evaluated by different methods, both by the value metrics and by Hall of Fame voters. (I considered separating catchers the same way, but decided against it.) There are again borderline cases where a player performed as both, such as Bob Carruthers and John Montgomery Ward. Fortunately, a clear majority of their values came from one side, as measured by all metrics (Caruthers as a pitcher, Ward in the field.)

That done, I then had to calculate where each HoF player ranks all-time in the various career metrics I was examining. This meant looking up Win Shares, WAR, and JAWS for each player at or above the level of the lowest inductee as a position player or pitcher. I counted total career value, which included batting for pitchers and any adventures on the mound for position players. (This arose primarily with fWAR, which counts pitching and batting separately.)

From that list, I excluded players who are still active, or have not been retired long enough that they could have reached the writers’ ballot for the Hall of Fame. This takes out of consideration anyone who has not had a chance to be inducted. For the same reason, I excluded banned players who are not eligible for Hall consideration, such as Pete Rose and Joe Jackson. I did not enforce the 10-year minimum career rule. It would have added much tedious work for a rule that has been waived in the past for a candidate thought worthy: Addie Joss, who died before he could pitch his 10th season.

Examining Free Agency Since 2016
Players did a little better in free agency this winter than they did the previous year.

This left me able to compare the actual Hall of Famers to all the players who performed at least as well, by the varying metrics, as the least worthy Hall of Famer. This means I could measure which members were, by those metrics, deserving of their places, along with which fell short and by what proportion. Thus, I could see which systems made better calls about players being worthy of Cooperstown.

A Few Views

I chose a number of measurement methods, to get a broad view of the question. One simple method is to count how many Hall of Famers in each category—pitchers and position players—make the top-X list in each metric. There are now 159 position players in the Hall of Fame, and 73 pitchers. This means that, for example, every Hall pitcher in the top 73 of JAWS’ pitcher rankings means a point for JAWS.

Another simple approach tallies up the players who fall outside the list twice as long as the Hall contingent (318 for position players; 146 for pitchers), counting them as “mistakes” by that system. I will use both methods separately, and also in tandem by subtracting “mistakes” from “hits.”

Another angle is to look at the severity of each miss, how far a Hall member finishes outside a system’s list of the statistically Hall-worthy. This penalizes a system that has Hall of Fame members finishing far down its list, that makes big misses. In this way, a metric that gives decent ratings to inductees widely considered blunders, like Tommy McCarthy or Lefty Gomez, comes out ahead by providing some plausible support, or at least muted opposition, to their place in Cooperstown. Higher scores, of course, will be worse scores.

I did this in two ways. My first method was to assign a “miss” one point for each increment, equal to half the size of the Hall contingent, by which it missed the Hall-worthy ranks. (E.g., the 74th-best pitcher would get one point, the 146th-best would get two, the 147th-best three.) Since the two classes both have odd numbers, this produced a number of untidy half-point results. Partly due to this messiness, I tried again by making it one point for every full contingent size by which a player missed.

This, I hope, provides enough differing views to give us some confidence about which valuation system is best, or worst, at predicting Hall of Famers. After all of that setup, it’s time for some results. I will start with the position players.

Hall-Prediction Ratings for Player Value Metrics (Position)
Measure Win Shares bWAR bJAWS fWAR
Hits 108 112 110 111
Mistakes 18 13 14 12
Hits minus Mistakes 90 99 96 99
Half Increments 116.5 111 109 103
Full Increments 75 68 71 67

(As a brief reminder, lower “Increments” scores are better scores.)

The one clear pattern is that Win Shares lags the field, having the worst score in all five measures. While the pack is close on hits, Win Shares leads clearly in mistakes, categorizing the most Hall members as clearly outside the top ranks. It flops in the incremental measures too, even though its worst miss—Tommy McCarthy, the whipping boy for detractors of the Hall’s induction methods—was closer to the ranks of the deserving than the worst miss of each competing metric.

(For the record, bWAR had the biggest whiff, ranking McCarthy as the 967th-best out of all qualified position players, with 159 Hall members. Win Shares put him all the way up at 614.)

Rankings are much closer with the WAR-based measures, but a narrow consensus emerges for fWAR as the best of them. bWAR does marginally better than JAWS, the system based on its numbers. All three of them lead in at least one of the five measures I used, so nobody is a clear-cut winner. It’s only Win Shares’ loss that’s clear-cut.

But that’s only for position players. Looking at pitchers flips the script.

Hall-Prediction Ratings for Player Value Metrics (Pitcher)
Measure Win Shares bWAR bJAWS fWAR
Hits 52 51 50 45
Mistakes 8 10 11 13
Hits minus Mistakes 44 41 39 32
Half Increments 46 71 78.5 104
Full Increments 30 42 46 60

This time, Win Shares leads by all five measures, and convincingly by the incremental ones. fWAR ends up last on all five scorecards, again often by convincing margins. bWAR retains its slight lead over JAWS in the middle of the pack. Once more, Win Shares had the least embarrassing of misfires, with just one mistake worse than two full lengths out of the top 73.

I have some good theories regarding the big swings in predictive power we see here, but I’ll save them for now. First, I’ll combine the previous two tables to show how the value metrics performed overall in predicting Hall of Fame members.

Hall-Prediction Ratings for Player Value Metrics (Total)
Measure Win Shares bWAR bJAWS fWAR
Hits 160 163 160 156
Mistakes 26 23 25 25
Hits minus Mistakes 134 140 135 131
Half Increments 162.5 182 187.5 207
Full Increments 105 110 117 127

By the measures of hits and mistakes, bWAR consistently comes out on top, though never by dominating margins. By the measures of how bad its mistakes are, Win Shares finishes best, and by the half-increment measure its victory is a rout. Its huge margin among pitchers easily cancels out its poor showing for position players. In every measure, bWAR outperforms fWAR. Also in every measure, JAWS retreats at least a little from its forbear bWAR’s strong showing.

Were I to arrange this like the Olympics, fWAR would be off the podium, and JAWS would have bronze. Win Shares and bWAR are close enough, each with its area of strength, that a clear call is tough to make. Should it be bWAR, which never finishes worse than second, or should it be Win Shares, which is always at least close even when last in a category and pulls off the greatest domination of any one measure? My narrow choice is for bWAR to take the gold, leaving silver to Win Shares.

Whys and Wherefores

Bill James guessed that career WAR would be about as good a predictor as career Win Shares, and at least as far as bWAR was concerned, he was on the money. He didn’t foretell the divergence of effectiveness between position players and pitchers. On that matter, and a couple of others, I have some thoughts.

First, the gap between bWAR and fWAR arises mainly from fWAR’s poorer performance with pitchers. There is a clear reason why that would be so. fWAR bases its pitcher ratings on FIP, which seeks to measure underlying performance rather than outward results. bWAR instead uses the actual runs allowed by the pitchers, and does not account for the strength or weakness of the defenses behind them. Thus bWAR hews closer to results on the field and the scoreboard, which historically is how voters and committees have judged most candidates for the Hall of Fame. bWAR is, in a sense, made to judge Hall of Fame pitchers better than fWAR is.

The difference between Win Shares and the two flavors of WAR is a more general one. James himself observed that Win Shares tends to favor a long career, while WAR gives more credit for excellent seasons even if the overall career is shorter. This comes because Win Shares works from a replacement level significantly lower than the WAR metrics do. A player who would compile a negative WAR for a poor season could still register a few Win Shares. Indeed, there is no such thing as a negative Win Shares score: you can add to your career total or hold level, but never subtract. (I will sit on my jokes about how Albert Pujols probably prefers Win Shares.)

A couple of examples will show these differences in action. Sandy Koufax famously had a compact period of brilliance in a short career (for a Hall of Famer). While none of the metrics shows Koufax among the top 73 pitchers, only Win Shares has him low enough, at 156th, to qualify as an outright mistake. On the other hand, recent controversial inductee Harold Baines ran a slow, steady, and long race in the majors. All the WAR-based metrics consider Baines to be short of Hall-worthiness, with JAWS calling him a mistake. Win Shares has him 139th, inside the line at 159.

Rankings of the Aforementioned Players
Player Class Size Win Shares bWAR JAWS fWAR
Sandy Koufax  73 156 109  82  83
Harold Baines 159 139 316 410 311

Does this mean Win Shares is objectively wrong about Koufax, or right about Baines? No. This is a matter of how closely the systems track with Hall of Fame admission practices, which themselves have produced enough questionable results that nobody thinks them infallible. There is room for idiosyncrasy to produce a better result, which leads me to Win Shares’ approach to pitchers.

Win Shares is old-fashioned, arguably obsolete, in that it assigns actual values to the wins, losses, and saves awarded to pitchers. No current player value formula would include such a thing. James also included a further adjustment for the saves and (to a lesser extent) holds a reliever earned, which was his early attempt to measure and value leverage. While this might do poorly to gauge true reliever value, as a gauge for how likely a reliever is to make Cooperstown it does much better.

Of all four metrics studied here, only one put Mariano Rivera within the bounds of Hall-worthy pitchers, and that was Win Shares. Only one other primary relief pitcher in the Hall of Fame was found Hall-worthy by any of the systems: Hoyt Wilhelm, again by Win Shares. All the systems put Dennis Eckersley in the upper rank, but his long stretch as a starting pitcher accumulated the value that got their attention over the likes of Rollie Fingers and Goose Gossage. For all eight relievers in Cooperstown, their ranking by Win Shares is the best of the four systems.

Hall Relievers’ Rankings by Player Value Systems
Reliever Win Shares bWAR JAWS fWAR
Dennis Eckersley  37  51  69  43
Mariano Rivera  53  76 115 161
Hoyt Wilhelm  70 121 170 399
Goose Gossage 112 157 172 262
Lee Smith 150 281 341 362
Rollie Fingers 171 345 408 329
Trevor Hoffman 172 297 366 368
Bruce Sutter 226 372 355 577

The idiosyncrasy of Win Shares makes it uniquely able among these four to value relief pitchers highly, notably those relievers whom the Hall of Fame has valued highly. The crediting of wins likewise gives the system a leg up on valuing starters in a way close to how the writers and committees have done so. That oddity of the system, which would be considered a weakness in this age of scorn for the win and a hunt for pure statistical objectivity, ends up a strength in this particular way.

Conclusion

I’ve shown how well four of the main player value systems predict who gets admission to the Hall of Fame. Is this past performance indicative of future results? Probably not.

The BBWAA is a moving target. Its membership changes, in parallel ways. Not only are older voters who lean toward older methods of player valuation being replaced in the ranks by writers more attuned to modern analytics, but remaining voters are taking more heed of the newer and shinier measures of excellence. This is probably less true for the committees, but that means the effect is muted, not nullified. In increasing degrees, the voters will be looking to the metrics themselves for guidance, which will make them self-fulfilling prophecies to a degree.

Given this, I can offer a few predictions of my own. bWAR is likely to separate itself a bit from Win Shares over the coming years, simply because the WAR metrics are more popular than Win Shares. JAWS could make up some ground against bWAR, being a metric specifically designed to look at worthiness for the Hall of Fame. fWAR could begin to climb out of its cellar if voters start to prefer FIP over ERA, process over results, but I judge that to be less likely, at least in the short term.

This assumes, of course, that the value metrics don’t also move. There could be some modification of the various WARs, or a tweaking of JAWS, that alters the equations. The new Deserved Runs Created version of WARP very well might do likewise. There might even be a Win Shares 2.0. It seems unlikely to me that Bill James would throw a few years of work into such a project—but if Harold Baines can get into the Hall of Fame, who’s to say what’s impossible?

References and Resources

  • Jay Jaffe, The Cooperstown Casebook
  • Bill James, Win Shares
  • The Bill James Handbook, 2010-2014 and 2019 editions, from Baseball Info Solutions
  • Baseball-Reference, primarily the Play Index and Bullpen
  • FanGraphs
  • Bill James Online


A writer for The Hardball Times, Shane has been writing about baseball and science fiction since 1997. His stories have been translated into French, Russian and Japanese, and he was nominated for the 2002 Hugo Award.
newest oldest most voted
Takiar
Member
Takiar

I guess it’s no surprise the oldest system scored well. Most HoFers were elected on traditional stats and awards (no pitcher won a Cy Young based on FIP), and that is even now mostly the case. Jaffe routinely references James’s HoF index for that reason. The reason Baines is so perplexing is that his traditional numbers are nothing exceptional, other than longevity of good batting. He never dominated any category or built big milestones.

tramps like us
Member
tramps like us

Nice to see Bill James getting so much love!

Cool Lester Smooth
Member
Cool Lester Smooth

Quick correction:

rWAR explicitly does account for the strength or weakness of the defense, albeit questionably well, due to its use of season-long DRS numbers to do so.

I think it would be interesting to see how FG’s RA9-WAR correlates – that’s purely based on runs allowed.

The Kudzu Kid
Member
Member

I’d be curious to see how WAA does here. You can accumulate a bunch of WAR just by hanging around for a long time, but you can’t do that with WAA.

Jetsy Extrano
Member
Jetsy Extrano

The JAWS peak component is similar in flavor to WAA, and probably hurts it here since the real voters seem to like accumulators. And Harold Baines.

Cool Lester Smooth
Member
Cool Lester Smooth

Eh. “The real voters” didn’t like Baines at all – his highest vote total was 6.1%.

Paul G.
Member
Member
Paul G.

One of the quirks of Win Shares is that 52% goes to defense as opposed to a standard 50/50 split. The reasoning for this is pitchers were not getting enough shares at 50/50. It is possible that Win Shares has a bias towards pitching, at least compared to WAR.