Putting the scissor to defense (Part 1)
Does the world need another defensive metric? Well, I needed one, at least. I will admit that it’s significantly inspired by what has preceded it, and as a system in its infancy it has some ground to make up to join the rest. I think there are a few little clever bits to the system that may interest you, though.
Moreover, I hope this is the last time anyone ever has to commence building such a thing from scratch, at least with the data at hand. Why? Because I’ll be providing full source code at the end. So if you’ve ever needed your own defensive metric as well, congratulations, you have one, too.
Two for the price of one
But wait, there’s more! Simple Zone Rating (pronounce it scissor, if you like) is not really one defensive evaluation system but two. SZR began with a simple premise: Try to devise the simplest play-by-play metric possible that would work with all years of Retrosheet data. (The availability of certain parts of the play-by-play data in Retrosheet varies greatly from season to season, particularly the further back in time you go.)
Then I got to wondering: How much of that system could I approximate using simply official fielding statistics? I then proceeded to cheat and start using official batting and pitching stats as well, but the result is still a fielding system that works without play-by-play data. I found this pretty exciting, to be honest.
Basic concepts
Either version of SZR works based upon the idea of judging a fielder’s efficacy by this formula:
Plays made/chances
Defining plays made is relatively trivial: We have a record of putouts and assists that provide a pretty good idea of when an infielder made a play on a batted ball (and understanding that can be enhanced by looking at the play-by-play data Retrosheet provides). Defining what a chance is proves to be much more bedeviling. So let’s break it down. A players’ fielding chances are:
- His plays made.
- A ball he would have made a play on, if not for an error.
- A ball that was batted cleanly for a hit, but was fieldable by the player.
Assigning hits to fielders is the hard part. What we want to do, in an ideal world, is assign partial credit for a hit to those players most capable of fielding it. For instance, with a ground ball hit between the shortstop and third baseman, we’d assign part credit to each of them. And then ideally we’d compare a fielder to how other fielders at his position would have done.
For an infielder, what we’d ideally like to know about a batted ball when assigning credit is:
- Where it was hit—at the very least, whether it went to left field, right field or center field. Ideally we have some sort of a vector or zone breakdown.
- What type of ball is it? Ground ball? Fly ball? Line drive? Ideally we’d just get a distance and hang time measurement and be left to our own devices here, but such is life.
- Where an infielder was standing when the ball was hit. Nobody I’m aware of tracks this systematically, but it’d be very nice to have.
Unfortunately, we have almost none of that for the majority of the Retroera. In a rather cruel twist (for our purposes, at least), the typical practice prior to about 1989 or so was to record batted ball type for outs but not hits. No, I don’t get the sense of that, either. We also don’t have location data, not even an idea of which outfielder fielded the ball. (Again, that was recorded for outs, but not for hits.) And we obviously don’t have that data for the years before Retrosheet. So what are we to do?
We estimate. In other words, we look at what we do know and see what it can tell us about what we don’t know. Typically this works fine, if nothing else because we use the typical relationship between what we do know and what we are trying to estimate. As we get to less typical cases, we of course lose accuracy in our estimates. The general hope is that over time, the inaccuracies in our estimates wash out. This won’t always be the case, of course, so it does some good to be cautious when applying the results.
My belief is that the perfect is the enemy of the good. Yes, these results are flawed. And so long as we bear that in mind, they’re better than no results. This of course doesn’t absolve me—or anyone—from not fixing mistakes or making improvements when possible. But I still think that it’s worthwhile to try, even when we know that perfection isn’t possible.
Play-by-play data
For the Retrosheet years, a play is made when the fielder who originally handled the ball is awarded a putout or assist on a ground ball. That is pretty straightforward and requires no estimation on our part.
Here is how hits are charged to players in SZR for years in which we have play-by-play data. First, the assumption is made that the distribution of responsibility for hits on balls in play (that is, excluding home runs) is equal to the relative number of plays recorded by the fielders at that position. This is of course not entirely correct, but any correction to this probably would introduce as much potential error into our figuring as it would remove.
So, for each season, the percentage of plays made at each position was calculated based upon the handedness of the batter and the pitcher—so the percentages with a lefty on the mound facing a righty are different from a lefty facing a lefty or a righty facing a righty. Then, each hit is credited to the fielders who were on the field at the time, based upon this percentage. Let’s look at the table from 1961, for instance.
BAT_HAND | PIT_HAND | FLD1 | FLD2 | FLD3 | FLD4 | FLD5 | FLD6 | FLD7 | FLD8 | FLD9 |
L | L | 0.10 | 0.03 | 0.16 | 0.22 | 0.08 | 0.11 | 0.09 | 0.11 | 0.10 |
L | R | 0.07 | 0.02 | 0.14 | 0.22 | 0.08 | 0.11 | 0.11 | 0.14 | 0.10 |
R | L | 0.07 | 0.03 | 0.07 | 0.12 | 0.15 | 0.19 | 0.09 | 0.15 | 0.14 |
R | R | 0.08 | 0.03 | 0.07 | 0.13 | 0.17 | 0.20 | 0.09 | 0.13 | 0.11 |
So if a left-handed hitter gets a hit off a right-handed pitcher, we assign 14 percent of the credit to the first baseman, 22 percent to the second baseman, 11 percent to the shortstop and 8 percent to the third baseman. Let’s call a player’s totals in this regard his partial hit credits.
We can further adjust these values for a number of constraints. We can look at the groundball tendencies of the hitter and pitcher as well. Using the log5 method, we can figure the odds of a ball in play being a ground ball or an air ball, and adjust the division of out probability among infielders and outfielders accordingly.
From here we can calculate a player’s zone rating as:
Plays made/(Plays made + errors fielding ground balls + partial hit credits)
Or more simply, plays made divided by chances. A player’s plus-minus rating is simply his zone rating minus the league average zone rating times chances.
Without play-by-play data
This is where it gets trickier. Let’s start with estimating plays made. I first looked at the average player’s plays made in relation to his putouts and assists, and used those figures to come up with these simple formulas to estimate plays made:
- 1B:
- .85*A + .08*PO
- 2B:
- .85*A
- SS:
- .85*A
- 3B:
- .90*A+.06*PO
That’s it. No cute claim points or any other complex-looking set of figures. (I’ve yet to determine if this is a plus or a minus for the system.)
So that’s plays made. We’ll treat all errors as chances, again for simplicity’s sake. What about hits?
We first have to assign responsibility for hits at the team level, and then work our way down to the individual player level. We start off in much the same way as we did for assigning hits with the play-by-play data: We figure out the percentage of plays made by each position at the league level, and then for every hit allowed by a team, we assign partial credit to each of the team’s fielding units. (In this case, all of the players who spend time at first base are part of that fielding unit, and so on.)
Then we divide those partial hit credits among the various members of that unit. Now, we use games played at that position to figure out how much playing time a player has had at that position, but we intuitively know that a starting player tends to have more innings per game than a bench player or reserve. So a player’s total plate appearances are used to determine how many innings he should be credited for per game.
Now that we have both an estimate of plays and chances, we can figure zone rating and plus-minus as above. In both cases, we can convert plays to runs by multiplying by the average run value of a hit (absent homers) relative to an out. That value is typically about .7.
What’s past is prologue
This is a first, rough pass at this. Notably absent are, well, outfield rankings. Less notably absent—but still absent, and needed—are groundball/flyball and left-handed/right-handed pitching adjustments for the non-balls-in-play data, as well as various subtler adjustments to the play-by-play measures. All of this will come in the fullness of time.
But let’s get a snapshot of the system so far. Values below are a combination of the two systems, using the PBP version when it is available and otherwise using the simpler version. First, the infielders with the highest career totals in plays above average:
NAME | POS | YEARS | CH | PLUS_MINUS |
Brooks Robinson | 5 | 1955-1977 | 8933 | 382.8 |
Germany Smith | 6 | 1884-1898 | 8644 | 289.2 |
Ozzie Smith | 6 | 1978-1996 | 11090 | 286.6 |
Mark Belanger | 6 | 1965-1982 | 7529 | 281.1 |
Jack Glasscock | 6 | 1880-1895 | 7870 | 276.6 |
Bid McPhee | 4 | 1882-1899 | 9585 | 251.3 |
Joe Tinker | 6 | 1902-1916 | 7576 | 251.1 |
Travis Jackson | 6 | 1922-1936 | 5980 | 237.2 |
Graig Nettles | 5 | 1968-1988 | 7734 | 221.4 |
Billy Jurges | 6 | 1931-1947 | 6364 | 218.8 |
Jimmy Collins | 5 | 1895-1908 | 5416 | 218 |
Hughie Critz | 4 | 1924-1935 | 6529 | 216.2 |
Art Fletcher | 6 | 1909-1922 | 6592 | 203.6 |
Buddy Bell | 5 | 1972-1989 | 7144 | 199 |
Dave Bancroft | 6 | 1915-1930 | 8590 | 187.2 |
Marty Marion | 6 | 1940-1952 | 6105 | 184.7 |
Terry Pendleton | 5 | 1984-1998 | 5729 | 183.5 |
Bill Dahlen | 6 | 1891-1911 | 10335 | 182.6 |
Lou Boudreau | 6 | 1939-1952 | 6034 | 176.6 |
Joe Gordon | 4 | 1938-1950 | 5865 | 172.6 |
And for players who made the most (fewest?) plays below average:
NAME | POS | YEARS | CH | PLUS_MINUS |
Eddie Yost | 5 | 1944-1962 | 5839 | -171.7 |
Derek Jeter | 6 | 1995-2008 | 7121 | -157.8 |
Cub Stricker | 4 | 1882-1893 | 5195 | -148.2 |
Ed McKean | 6 | 1887-1899 | 7470 | -146.7 |
Larry Doyle | 4 | 1907-1920 | 6307 | -135.5 |
Heinie Sand | 6 | 1923-1928 | 3543 | -127.2 |
Dean Palmer | 5 | 1989-2003 | 3144 | -123.3 |
Jim Bottomley | 3 | 1922-1937 | 3614 | -122.9 |
Pinky Higgins | 5 | 1930-1946 | 4985 | -121.8 |
Bill Madlock | 5 | 1973-1987 | 4132 | -119.7 |
Red Kress | 6 | 1927-1940 | 3428 | -115.5 |
Jorge Orta | 4 | 1972-1984 | 2357 | -111.2 |
Tommy Dowd | 4 | 1891-1898 | 1443 | -108.5 |
Bob Aspromonte | 5 | 1960-1971 | 3064 | -105 |
Fred McGriff | 3 | 1986-2004 | 4533 | -103.8 |
Milt Stock | 5 | 1914-1925 | 3795 | -101.7 |
Mo Vaughn | 3 | 1991-2003 | 2659 | -100.4 |
Hal Chase | 3 | 1905-1919 | 3821 | -99.9 |
Fresco Thompson | 4 | 1925-1931 | 2894 | -98.8 |
Mickey Vernon | 3 | 1939-1959 | 4413 | -96 |
I don’t think that’s too crazy for a first pass. (And oh, look, every sabermetrician’s dream: to rank Derek Jeter’s defense and find it wanting! Truly I have arrived.)
Until next time… be seeing you.
References & Resources
Play-by-play information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.
Seasonal totals come from the Baseball Databank.
Major inspirations upon the design of the play-by-play system were TotalZone, Simple Fielding Runs and Ultimate Zone Rating.
The non-PBP metric has been inspired by a number of sources, such as Fielding Win Shares, DRA and Range.
Groundball rates for hitters and pitchers were regressed to the mean before use. I used the weighted average method from Tom Tango, figuring R from the method involving random and observed variance.