Can we objectively evaluate advanced fielding data?

Colin has an article at Baseball Prospectus today that is required Sabermetrics 101 reading. It’s been linked already in the THTLinks Twitter feed this morning, but it deserves more prominence and more discussion. This is an article of fundamental importance for the baseball analysis community. Anyone who is evaluating fielding, which is almost everyone in these heady days of Wins Above Replacement (WAR) statistics, needs to read and understand what Colin is saying.

He states an approach that should have been tackled by the analytical community before advanced defensive metrics started to gain such widespread acceptance. How did we ever come to accept such statistics without ever objectively testing them?

Now that they are being tested objectively, it should not surprise us that problems are being found. That does not invalidate the metrics. It is the path to knowledge. We are, after all, on a search for objective knowledge about baseball, are we not? Openly and objectively testing defensive metrics is not the quest of those who want to destroy baseball knowledge, as some will tell you when this topic is broached. It is a path well-worn by sabermetric pioneers, though “small is the gate and narrow the road that leads to life, and only a few find it.”

We want to know what we know and why we know, when we trust it and when we don’t trust it. We want to know what the sources and ranges of the errors are. This way lies improved fielding metrics and the ability to silence critics with facts that can be demonstrated in a way that is convincing, not one that demands blind faith that sabermetricians know what they are doing.

This also does not mean that we should stop using advanced fielding data today or that it has zero utility in objective sabermetrics. First and foremost, this is a clarion call to the community to turn its research efforts toward cracking this problem. Secondly, it’s a wake-up call to understand and quantify the uncertainty in our measurements related to fielding, and to the derivative statistics like WAR, not a call to abandon them altogether.

Scientific inquiry has always operated in an environment of measurements made with uncertainty. This had led scientists to devote great effort to estimating the bounds of that uncertainty in order to determine their confidence in their measurements, and thus their confidence in the conclusions based on those measurements. There is no need to abandon the “science” of fielding measurement. Far from it. There is a need for the application of the time-tested sabermetric approach.

Doubt is not something to be feared. When its source is based on facts, doubt is healthy. Colin’s doubt, which I share, is healthy. Let’s take this opportunity as an analytical community and turn doubt into growth.

Newest Most Voted
Inline Feedbacks
View all comments
13 years ago

“the subjectivist (i.e. Bayesian) states his judgements, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science’‘


Mike Fast
13 years ago

One correction: My sentence, “How did we ever come to accept such statistics without ever objectively testing them?” would be more accurately stated as “How did we ever come to accept such statistics without thoroughly testing them?”

There has been some objective testing of fielding metrics before, and I don’t mean to diminish that work.  It’s just that I believe we need to do a much more thorough job of testing than we have done.

13 years ago


Too often people take stats at face value and don’t test them.  Too many times I’ve commented along the lines of “bob is a bad fielder” and someone will say “But UZR says he’s good and besides, fielding has less effect on WAR than hitting.”  All those things are true but they do not necessarily mean that A) Bob is a good fielder or B) Fielding is less important than hitting.  What those things really mean is A) our best fielding metric which may not be any good disagrees with you and B) Our model of player value does not consider fielding valuable.

Sabermetrics is a science.  Part of science is continually testing hypothesis and improving models.

I have one idea that I think would be very interesting.  I think it should be very easy to test WAR – assuming I understand the stat correctly.  I think WAR was determined retrospectively by looking at past performance and win totals but it should be easy to test prospectively.  If you establish the number of games a team of replacement level players should win over the course of a season, then add the totla WAR for all players on the roster for that team for the year, shouldn’t you get a reasonable approximation of the team win total?  You get thirty data points a year and if you don’t get a good correlation there is something wrong with the model.  Has anybody ever done that?

Colin Wyers
13 years ago

Very nice post, Mike. And thanks for the kind words.

MikeS – Having WAR (whichever version of it you want to name) is more than just measuring agreement with team win totals; it’s a necessary but not sufficient condition to show that WAR(P) agrees with team wins. You need to show that you’re correctly distributing those team wins to individual players.

It’s actually possible that once you get past a certain level of agreement at the team level for an “uberstat,” a metric that’s less predictive of team wins may be preferable, if it does a better job handling split credit.

13 years ago

Thanks Colin.  My stats education is about 20 years behind me so I may be wrong but how’s this line of reasoning sound:

If you add up all the individual WAR and get the team’s win total you may have a stat that works well for evaluating individual ballplayers.

If you get a wildly different different and inconsistently different number your stat probably does not reflect the individual contributions very accurately. (incidentally, this is how UZR and other fielding metrics look to me.  A player will be great one year and bad the next.  Much less consistent than offense or pitching stats)

If you are consistently above or below predictions (especially if it is by a consistent amount) you may have a good stat, but need to figure in a correction factor.

So this can’t really completely confirm that WAR is accurate on the individual level but it can tell you if it has big problems.  For instance, if the total WAR of the Orioles(or Yankees for that matter) suggest they should have won 85 games, you’ve probably got a problem.  Obviously there will be outliers.  I think I remember that you can learn a lot about the usefulness of your model by studying the outliers, but I digress.  You would have to do large sample sizes but you do get 30 data points/year.

Does that make sense to someone who knows more math than me?

John Walsh
13 years ago

I’ve been thinking about these same issues over the last couple of years.  So much so that I posted a “mailbag question” on Tango’s blog asking about how we know we can trust UZR or any other defensive metric.

MGL responded with a suggestion on how to test a defensive metric, but he also suggested that UZR was simple enough so that we can just “look at it” and know that it makes sense.  I’m paraphrasing here, you can see the details here:

13 years ago

UZR does not “just make sense.”  anybody who says it does is trying to force it down your throat.

UZR for Paul Konerko since 2002:
-3.1, 2.4, -6.5, 2.8, 0.0, 0.4, -0.9, 1.9, -4.0
Sometimes good, sometimes bad.  For a first baseman.  How does that make sense?

UZR fro Alfonso Soriano since 2006 (all LF)
6.7, 31.6, 15.9, -3.1, 4.8
From a little above average to great to below average?  Ask any Cub fan if Soriano was ever acceptable as a LF.  His best position is batters box.

UZR for Derrick Lee since 2002
-4.8, -1.3, 1.8, -1.4 1.0, 0.6, 6.8, 3.6, 3.3
I thought Derrick Lee is regarded as a good fielder?  Wouldn’t know it from that.

UZR is anything but intuitive.  Until we have good fielding mettrics it is not fair to discount fielding as less important as offense or pitching.  Maybe it’s not, but if you are using current fielding metrics to back up your argument, tell your story walking.

Laurent Courtines
13 years ago

I am more of a lurker in these debates and have minimal math knowledge.  I care about baseball and am fascinated by the work done by Hardball Times,  Baseball ThinkFactory, Baseball Pro, Fangraphs and our populist mouthpiece Rob Neyer.
I think the biggest step forward will come when the MLB installs the full field in play effects data.  It is my understanding that they are beginning to get these cameras up and will be able to collect data on the balls in play with ball flight paths, player positions and speeds of the ball in flight.  While the data may be difficult to disseminate and may start in only one or two ball parks at a time,  this data will lead the next revolution in fielding metrics. 
I hope that MLB does not try and lock it in a safety deposit box and hide it from the Sabermetrics community. 
When we are able to get data on individual players ability to judge balls by speed that will be amazing.  Can you imagine knowing that Carl Crawford is amazing at balls that leave the bat at over 120 MPH but terrible at anything coming his way at 80? (or something like that.)  It will come and I hope it the data is set free!

John Walsh
13 years ago


The variation in year-to-year UZR does not necessarily make it incorrect.  It’s possible (actually, I think it’s established) that it takes several years’ worth of data for UZR to stabilize.

I don’t think MGL is trying to shove anything down our throats.  I believe he really finds UZR intuitive and in some sense it is.  The underlying method is straightforward.  But, even though it might be straightforward in concept, there are many details.  And in any case, until you verify that something works by verifying it with real data, you can’t really tell if it’s working or not.

13 years ago

For 2009, if you take actual wins and subtract hitting WAR and pitching WAR (Fangraphs Totals) you get a number anywhere from 34.9 (TB) to 53.0 (Cin) for replacement wins. That’s a 18.1 difference between the high and low.

Colin Wyers
13 years ago

Mark, there’s also a league difference issue to account for in Fangraphs WAR, as they don’t have pitcher hitting sum to zero for the league.

13 years ago

AL range 34.9 to 50.8 15.9 difference
NL range 37.8 to 53.0 15.2 difference

AL average 43.66
NL average 48.57

Cle 29.8 WAR 65 Actual Wins
Bal 22.5 WAR 64 Actual Wins

LAA led the AL in wins over WAR 50.8