Defensive Regression Analysis: Part Two
Editor’s Note: Last year, Michael Humphreys introduced a revolutionary new fielding statistic, called Defensive Regression Analysis (DRA), which represents an entirely new way of thinking about fielding stats. DRA uses stats that are available throughout baseball history so it can be used to evaluate fielders of any era. We consider it a significant improvement over fielding Win Shares.
The original DRA article (pdf) submitted to Baseball Primer is now available. Also, Web Archive has the original Primer articles, with the correct formatting — Parts One, Two, and Three.
In this series of three articles, Michael will explain DRA, use it to evaluate major league fielders from 2001-2003, and compare it to zone-based systems such as Zone Rating and Ultimate Zone Rating in order to verify its accuracy.
The first article has already been presented. The 3rd one will be presented tomorrow.
Explanation of DRA Testing Methodology
A. DRA Test Results for 1999-2001
The November 2003 DRA article on Baseball Primer described in general terms how DRA was developed, but did not include the actual formulas or all of the ideas necessary to replicate the system, because I was considering selling DRA to a team for minor league player evaluation. For a variety of personal reasons, I have now decided instead to publish a book containing the formulas, a complete explanation of their derivation, ratings of all of the best fielders in baseball history, and new (some will probably find them radical) time-line adjustments for changing talent pools.
As the DRA article did not provide explicit formulas, I thought the best way to promote interest in the system would be to show how its results compared to results from what I consider to be the best system, UZR, and to report historical DRA results I had already prepared for 1974-2001. The UZR-DRA comparison included all players who had played at least two full seasons (at least 130 games) at one position (without splitting a season between teams) (“Full Seasons”) anytime between 1999 and 2001. Average per-player DRA ratings over that time period had an approximately 0.7 correlation with corresponding UZR ratings, and approximately the same “scale” (as measured by standard deviation of ratings) as UZR.
Although I believed UZR was the best reference point for testing DRA, I knew it wasn’t perfect, for reasons I’ll explain below. When I adjusted UZR ratings to reflect zone-based DM commentary during that period (DM actually has more detailed individual fielder evaluations, including non-Gold-Glove-quality fielders, for the 1999-2000 seasons than for the 2001-2003 seasons), the correlation rose to slightly over 0.8. A BTF poster called AED (a Ph.D. who said he had also developed a regression-based fielding model, though he has not published it, and I do not believe it provides ratings integrated with pitcher ratings, or ratings denominated in runs) explained that the appropriate thing to do if UZR ratings differed from DM evaluations was just to delete the UZR (and corresponding DRA) ratings from the sample. When I did so, the correlation was still slightly greater than 0.8.
After publishing this study, I came across two new pieces of information that were encouraging. First, Tangotiger wrote that there is so much noise in even the best proprietary zone-based fielding data that one really needs at least two years of full-time UZR ratings to get a reliable rating. So my approach of comparing the average of two Full Seasons (or, if available, three Full Seasons) of UZR and DRA ratings over 1999-2001 made sense. Second, Ken Ross, baseball fan and former President of the Mathematical Association of America, was, in his own words, “fearless” enough to actually provide a “rough” estimate of what a “strong” positive correlation should in general be: 0.7 (A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans, 127, Pi Press, 2004). So I felt more comfortable saying that DRA had a strong correlation (at least 0.7) with UZR ratings (and a probably 0.8 correlation with “correct” UZR ratings) derived from an appropriate sample (players with at least two Full Seasons of ratings).
B. Motivation for New DRA Test
As I began compiling 1893-2003 Retrosheet data for the historical ratings section of my book, I thought I should try one more time to see if I could improve DRA, which seemed to have some difficulties rating third basemen and right fielders in the 2001-2003 study. I theorized that the problem was that traditional baseball statistics don’t provide ground out and fly out data for left-handed and right-handed pitchers separately, but that there were a few potential indirect approaches to addressing the problem. (ESPN may report actual ground ball and fly ball data for contemporary pitchers, so perfect lefty ground ball/fly ball and righty ground ball/fly ball calculations can be done for recent seasons, but the goal of DRA has always been to provide a system that works throughout major league history, as well as for minor league and non-U.S. prospects.)
Furthermore, close analysis of 1993-2003 Retrosheet data suggested that the indirect effect of left-handed pitching had drastically declined after 1995 in the National League, reflecting the increased use of LOOGies and ROOGies, as well as the shortage of utility and platoon players, so the left-handed pitching adjustment (independent of ground ball / fly ball adjustments) should have been different in 1999-2001. (As you probably know, left-handed pitching doesn’t “cause” more ground outs to the left side and more fly outs to the right side; rather, if an opposing team has enough players to platoon, left-handed pitchers will face more right-handed batters, who will generate more ground outs to the left side and fly outs to the right side than left-handed batters, regardless of the handedness of the pitcher.)
I made some interesting discoveries, which I’ll discuss in the book, including a few that seem to have improved ratings at third base, but there are still lingering problems in right field (though the same is true for ZR and DFT, which seem to do even worse there). At first base, the fundamental problem for all but zone-based systems, identified a few years ago by Charlie Saeger and Bill James, is that we not only don’t know the number of fielding opportunities at first base (batted balls fieldable at first — that problem is true at all positions for non-zone-based systems); we don’t even know precisely the number of batted balls fielded by first basemen, still less the number of ground balls fielded by first basemen (pop-ups and most fly outs caught by infielders are usually discretionary chances). We can make estimates, and they’re getting better, but non-zone-based first base ratings are probably only really reliable over time periods longer than two or three years.
C. Accuracy and Reliability of UZR
UZR is the best system out there, and Mitchel Lichtman deserves the thanks of all baseball fans for paying the multi-thousand-dollar price for the necessary data and generating zone-based run-savings ratings when no one else in baseball was willing to. However, UZR ratings are derived from extremely large and complicated data sets, and Mitchel was not then part of an organization that could provide auditing and de-bugging back-up. So it’s not surprising that there have been errors in the past in how the UZR data has been processed. (I know; I helped Mitchel fix a formula that had almost eliminated the year-to-year consistency of per-player UZR ratings.) Even now, a few individual UZR ratings are clearly wrong. I am certain that I have made errors in compiling and analyzing my data; we’re all human, and we’re not getting paid to do this.
A few examples might illustrate the need to double-check UZR. As mentioned above, DM uses zone data similar to UZR’s and actually provides (in its team essays) commentary for most full-time fielders, whether good, bad or indifferent, for the 1999 and 2000 seasons, whereas the 2001-03 “Gold Glove” essays primarily address only the best fielders. So there is a large sample of 1999-2000 full-time fielders who have both a UZR rating and a highly specific DM evaluation based primarily on high-quality zone data.
The 1999 UZR rating for shortstop Rey Ordonez is +39 runs. That would translate into about 55 plays above average. Yet in its detailed evaluation of Rey Ordonez’s 1999 season, DM says, “Error totals aren’t usually a good indication of fielding prowess, but the four errors charged against Ordonez were impressive nonetheless.” Not a word about what would be possibly the greatest range performance at shortstop in baseball history, if UZR were correct. And DM acknowledges elsewhere that marginal plays measured using the best zone data can be quite high:
In a typical season, the top fielders at each position make 25-30 more plays than the average. Exceptional fielders have posted marks as high as 40-60 net plays, but those are fairly uncommon. Recent examples include Darin Erstad in 2002, Scott Rolen just about every year, and Andruw Jones in his better seasons. The worst fielders tend to be in the minus 25-40 range. DM, “Evaluating Defense.”
The 2001 UZR rating for Ordonez, albeit after an injury-shortened 2000, is –6. DM (correctly) identifies Rey Sanchez as the best shortstop in 1999 (UZR +31, DRA +33).
The 2000 UZR rating for centerfielder Doug Glanville is -32 runs: a truly catastrophic performance. This is what DM says about that season: “Glanville’s speed is still his best asset. He is a very good base stealer and has been a strong performer in center field, though 2000 was far from his best year defensively. Glanville made just four errors in 2000 and threw out nine runners.” It is true that speed alone won’t make a good outfielder (Lou Brock was a poor outfielder and Tim Raines nothing special). But an established “strong” (presumably well-above-average) performer in centerfield whose stolen base attempts and success rate have not declined will not cost his team 32 runs, even if his performance is “far from his best.” The 2000 Glanville DRA rating (not including arm) is -3. Another quick example in centerfield: the 2000 Gerald Williams UZR rating is -42. Yes, minus forty-two runs, close to sixty plays below average. DM says, “Playing average-at-best defense.” DRA rating: +3.
A close review of UZR ratings and detailed DM commentary for 1999 and 2000 suggests that about a third of the single-season centerfielder (I haven’t check left and right field) and about a quarter of single-season infielder assessments are clearly inconsistent, meaning that UZR and DM are effectively well more than 10 runs apart, i.e., UZR gives a poor rating (say, -15 or -25), whereas DM says the player was average or only subpar; or UZR gives a Gold-Glove type rating (+15 to +25 or more) and DM had nothing positive to say. Remember: DM looks at the same kind of zone data used by UZR. Yet Tangotiger was onto something when he suggested using at least two years of UZR ratings, because after two years, UZR is (at least as far as I can tell) very, very close to the broad consensus of other zone-like evaluations, such as DM, PMR or ZR, about 90% of the time.
D. Procedure for Determining Data Points to Delete
In spite of these difficulties, I still must say (for at least the third time) that UZR is the best 2001-03 resource for fans. It just has to be considered with care. In trying to determine when a UZR (and corresponding DRA, DFT and ZR) rating needed to be deleted from the sample, I finally settled upon the following principles, based on AED’s advice:
(1) If a UZR rating differs from DRA by more than one (UZR) standard deviation (12 runs), the rating is flagged.
(2) If non-UZR zone-based assessments agree with DRA more than with UZR, the player’s rating is deleted from the sample, otherwise it remains.
(3) In evaluating the non-UZR zone-based assessments, I consider the following sources of information, in descending order of reliability and relevance: (a) DM “Gold Glove” essays for 2001-03, (b) DM individual player evaluations in 1999 and 2000, (c) PMR ratings in 2003 and 2004, (d) ZR in 2001-03, and, as a last resort if no other information is available (e) anything I can think of, including rates or changes in rates of hitting triples, attempting steals, or getting caught stealing.
This will all become clearer as we get down to cases. The important things to bear in mind are:
(1) All results are disclosed, including the ones I believe should be deleted.
(2) All deletions are explained.
(3) I considered and analyzed many “rules” for deletion, including some that might be considered more rigorous by a trained statistician, but I wanted a procedure that would require minimal explanation for purposes of this article (I may, depending on the patience of editors, provide a more rigorous but far more elaborate procedure for the book).
(4) Under every deletion methodology I considered, the overall DRA correlation with UZR (excluding first base) was always between 0.69 and 0.79; DFT’s was always between 0.54 and 0.66; ZR’s was always right around 0.67. In every case, DRA had a higher correlation than ZR and DFT. For example, when I followed the AED algorithm, but applied it to DFT to delete UZR ratings inconsistent with DFT and other zone-based systems without regard to DRA, the overall DFT correlation did not increase meaningfully (it went up 0.60), and though the DRA correlation did decline (to 0.73), it was still clearly higher than DFT’s.
(5) I have probably spent more time studying all full-time player UZR, DFT, ZR, PMR and DM evaluations from 1999-2004 than anyone else, certainly any sane person, including the creators/authors of each of those systems. My firm conviction is that the DRA “correlation” with the “truth” is closer to 0.8 than 0.7. If you’re inclined to disagree, all the information I have relied upon is at your disposal, so you can make your own assessment. What is undeniable, however, is that DRA’s standard deviation numbers, as you will see, are clearly more in line with UZR than those of any other system (though still slightly more conservative than UZR), which leads me to my last point.
(6) FWS and FLW are not included in the comparison, because they are less accurate than DFT on inspection. FLW outfield ratings, as a casual look through Total Baseball will reveal, are at least 50% too “compressed” compared with infielder ratings, which are getting better but still effectively double-count double plays and errors and overemphasize putouts. All FWS ratings are compressed about 50% too much. In addition, the outfielder ratings have a poor correlation with UZR, based on a 2003 study available at Raindrops.
Part Three will be published tomorrow and will include complete DRA test results for 2001-03.
References & Resources
I’d like to thank Dick Cramer for his support in the past, Mitchel Lichtman for creating UZR, and baseball analyst, Tangotiger, for making detailed UZR output available in a convenient form. I’d especially like to thank the folks at Retrosheet:
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.
There is one more absolutely necessary acknowledgement: my own fallibility. In creating DRA and tracking the results of other fielding systems, I had to do a tremendous amount of cutting and pasting and hand-coding of data. I have done my best, but I’m sure there are some errors, though I don’t believe any of them are significant.
I look forward to hearing from you. Don’t hesitate to e-mail with questions, criticisms and corrections.