Evaluating the Evaluators

by David Gassko
February 3, 2006

This article is part of a large study I have posted on my blog. I wanted to post some background on all the systems I looked at here at THT, but I strongly encourage you to read my blog article as well. It gets a bit technical at times, but it’s probably the most extensive comparison of various defensive systems ever done, and there is a lot of interesting stuff there. The comparison is here.

When most baseball fans look back at this offseason, maybe they’ll think about outrageous free agent signings, changes in the MLB’s drug policy, or the first World Baseball Classic. But what I’ve found most memorable this offseason has nothing to do with what the real teams are doing, but rather the work of many dedicated baseball analysts: The push towards better quantifying defense.

This fall, we’ve seen David Pinto’s Probabilistic Model of Range (PMR) work, Chris Dial’s Zone Rating (ZR) translations, and my own Range system. Each uses different data, and to some extent, each system accomplishes something different. But overall, they’re all trying to do the same thing: measure defense.

Defense is in some ways the final frontier for baseball statisticians. Sure, there are disagreements about which metric is best to measure offense, but no one’s going to argue that David Eckstein is a better hitter than Derek Jeter. When it comes to their defense, though, you’ll get plenty of questions.

The main problem is estimating chances. Once we know how many outs a player should have caught, we can judge his defense. But figuring out how many outs a player is expected to make is not so easy. Different systems try to get there differently. Here’s a brief overview of the five defensive systems most often mentioned by analysts.

Ultimate Zone Rating (UZR)

It would not be an overstatement to suggest that Mitchel Lichtman (also known by his initials, MGL) revolutionized player evaluation when he began publishing UZR ratings in the late 90s. UZR was the first system to calculate how many outs each player was expected to make based on play-by-play data, convert that information into run values, and then compare that number to the average.

UZR starts off by looking at the probability that a ball in play will make an out, based on the type of batted ball it is(ground ball, fly ball, line drive, etc.), the zone that ball was hit to, and how hard it was hit. UZR then adjusts for everything under the sun, including:

1. Park Factors

2. Batter Handedness

3. The pitcher’s groundball-to-flyball ratio

4. Base/Out Situation

This way, UZR is able to capture almost everything that could impact a fielder’s chances of catching a ball—his positioning, how well hit the ball is, how much ground the fielder has to cover, etc. UZR is generally considered to be the gold standard of fielder ratings, and most sabermetrically-advanced teams, like the A’s or Red Sox, use similar systems to internally judge fielders.

Sadly, Lichtman is now working for the Cardinals, so UZRs are no longer publicly available, and so baseball fans are forced to go to other sources to get good defensive ratings.

Probabilistic Model of Range (PMR)

When Lichtman reported that he would not be releasing full UZR ratings after the 2004 season, uber baseball blogger David Pinto stepped up to the plate (THT policy states, “No more than one bad pun per article,” so I’m done), and started publishing PMR. What I like most about the system, perhaps, it its name. The word “probability” really describes what PMR, and UZR for that matter, is doing: It finds the probability of each ball in play being caught by a certain fielder.

Pinto’s approach is very different from Lichtman’s, though what the two systems are trying to do is very similar. Under the UZR system, a ball is assigned a probability of being caught by a certain fielder, and then that probability is adjusted based on the various factors listed. PMR uses empirical probabilities, meaning that it looks at each ball in play that was the same type of batted ball, hit in the same direction, with the same “hardness”, in the same park, thrown by a pitcher of the same handedness, and hit by a hitter of the same handedness, and assigns its ratings based on the probability of that specific type of ball in play being made into an out by each fielder.

Each approach has its advantages: Lichtman’s gives him larger sample sizes, so that his probabilities are “smoother”, while Pinto’s approach might allow him to better capture any “weirdnesses” in the data, which could arise due to a stadium’s environment. However, there are some issues with PMR that UZR clearly handles better:

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

1. PMR does not look at how far a ball travels, which can create problems in the outfield. The hope is that how hard a ball is hit will capture how far it travels, but it would obviously be better if PMR did track distance.

2. PMR uses four years of data. On one hand, this provides for a larger sample size, which is good. On the other hand, there’s a lot of variation from year-to-year in the probability of a ball in play being caught, so the four-year baselines can distort information, and make for “weird” ratings.

3. This isn’t really a problem anymore, but it had been one. In his original model (which Pinto still publishes), Pinto included line drives and infield pop-ups for infielders. That caused all kinds of weird ratings, because the number of line drives a player catches is largely a function of luck (how close he was standing to where the ball flew) and the number of pop-ups an infielder catches is largely a function of how much of a ball hog he is (i.e., Orlando Hudson waves everyone off on almost every infield fly). Pinto is now publishing a ground-balls only model, which is what you want to look at for infielders. UZR, ZR, and Range also only look at ground balls for infielders.

Despite these shortcomings, PMR is a great system, and good for evaluating player defense. The fact that it uses play-by-play data alone makes it a worthwhile endeavor. I would suggest, however, that if you want to interpret PMR ratings, it’s best to visit “Chone” Smith’s blog where “Chone” has graciously done the work of converting PMR ratings into runs above average.

Zone Rating (ZR)

Zone Rating has been around for almost two decades, and while it is often ridiculed, ZR is actually a very useful system. Here’s how Zone Rating works: Every zone where a fielder has a better 50% chance of catching a ball is assigned to that position, so straight away center field (Zone N) is the center fielder’s zone. Every ball hit into a player’s zones is counted as a chance, as is every ball he catches outside of his zones.

And that right there is the main problem with Zone Rating. Let’s say we have two players, each of whom are on the field for six balls in play. Five are within their zones, and one is not. Player A catches all five balls within his zone and allows the sixth to go in for a hit. Player B catches four balls in his zone, and the one out of it. But what are their Zone Ratings going to be? Player A will have a 5/5 = 1.000 ZR, while Player B will have a (4 + 1)/(5 + 1) = .833 ZR, even though they made the same number of plays in the same number of chances.

Because of this, Zone Rating will tend to overrate sure-handedness, as players with good hands will make all the plays they’re expected to, while underrating rangy players. Nevertheless, Zone Rating generally is in agreement with UZR and is actually a good system.

Zone Rating itself is tough to understand, but it’s easy to convert to runs. Chris Dial has done that for the 2005 season, actually. Dial has in fact done a lot of work with Zone Rating, and I’d recommend reading his stuff. You can start here.

Davenport Fielding Translations (DFTs)

DFTs are Clay Davenport’s fielder ratings on Baseball Prospectus. While I don’t exactly know what goes into them, I do know that Davenport tries to estimate a fielder’s chances based on the number of outs made by the outfield and by the infield (as a proxy for groundball-to-flyball ratio), the number of innings pitched by lefties on the staff, and various other factors. It’s basically an adjusted Range Factor, and is expressed in Runs Above Average.

What’s cool about BP’s fielding stats is that they are available for every player in major league history. They’re decently reliable everywhere but at first base and in right field, so a lot of historical arguments and questions can be settled with the use of DFTs. However, there are much better systems for evaluating fielding freely available today, which makes DFTs kind of pointless when it comes to evaluating today’s fielders.

Also, because Davenport has never fully explained how the ratings are derived, and occasionally changes his method without warning, it’s tough to judge how good a system he has beyond the final results it produces, which are okay, but nothing special.

Range

Okay, so this is my metric. The math is explained here. The basic idea is this: I estimate each fielder’s chances based on the number of ground balls, outfield fly balls, and line drives his team allows, as well as the number of left, and right-handed batters it faces. Knowing these inputs gives me very good estimates of chances, as you will see if you read the article linked to at the top.

Range, in my opinion, compliments Zone Rating really well. Dial and I have gotten into arguments over this, but I think I finally have proof that I was right. My argument has been that each system compliments the other system’s weakness: Zone Rating uses exact ball-in-play data rather than estimating chances, while Range doesn’t have the bias against rangy players that Zone Rating does. The test I did on my blog confirms that, sort of.

Both Zone Rating and Range correlate very well with UZR, but not so well with each other. What we can infer from that is that they capture different parts of UZR, and that a combination of the two would really give you good results. And in fact, if we weigh ZR at 2/3 and Range at 1/3, we get a very good metric—one that correlates almost perfectly with UZR.

Concluding Thoughts

UZR is clearly the best metric out there, though even UZR can be improved. Beyond that, PMR, Zone Rating, and Range are all about the same in terms of how well they track UZR. PMR has the better data, but not the best design; otherwise it would better correlate with UZR. Range and ZR are reliable separately, and spectacular together. DFTs are nice, but not really useful with all these other metrics around.

Defensive evaluation is moving forward, and it won’t be long before it’s accepted just as much as more mainstream sabermetric measures, such as VORP or Runs Created. And the glut of metrics out there is a good thing: There’s a little to be learned from each, and the latest improvements force others to improve their metrics further, and teach us a little more about defensive evaluation each time they do.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG