Who watches the watchers?

by Colin Wyers
May 21, 2009

Nothing will rile a fan up quicker than the notion that his team has been cheated out of a victory, a run, or the outcome of a single plate appearance through the fault of poor officiating. But how much of an effect can poor umpires have? Let’s focus in on one particular aspect of umpiring and look at how the home plate umpire calls his strike zone.

Before beginning, one caveat is that we absolutely cannot argue the umpire’s strike zone based upon what we see on television. Why not? Because of parallax. Because the center field camera used in TV broadcasts is neither in straightaway center nor at ground level, it cannot give an accurate impression of where the ball crosses the plate. It’s the same reason you can’t accurately read the speedometer of a car while in the passenger seat—in order to correctly judge the position of the needle, you need to be looking straight on.

We could, of course, look at ball tracking data, like that provided by PITCHf/x. But that only tells us whether a pitch is called a ball or strike. It doesn’t tell us whether balls and strikes are being called accurately, at least not initially. And we only PITCHf/x data for the past two years, so it tells us nothing about baseball before 2007. So can we judge how well an umpire judges the strike zone without pitch tracking data? We can certainly try.

Unfortunately, we can’t go out there and perform controlled experiments, repeating the same at-bat over and over again with different umpires. So what we want to try and do is compare batter-pitcher matchups that are similar in every way except for the umpire behind the plate. For instance, we can look at all plate appearances between Keith Foulke and Michael Young in Arlington in 2008, and break down the results by who was umpiring. If we want to evaluate one particular umpire, we compare the results when he was umpiring that specific set of players (and environment) to all other umpires who saw that same set of players. Analyst Tom Tango likes to call this a “with or without you” approach. In this case, we look at how a specific batter-pitcher matchup resolves itself with and without the umpire in question.

Given enough of these matched pairs over enough plate appearances, we can safely assume that the only change between the two sets of data is the umpire involved, and that any difference in walk rate between the two groups is due to the change in umpiring. That gives us a baseline for comparison, though this won’t tell us who is calling the “correct” strike zone unless the average umpire is calling the zone correctly, of course. But it will at least give us a yardstick against which we can measure umpires.

(Now, thanks to the fine folks at Retrosheet.org, we have play-by-play data going all the way back to 1953. To keep results relevant to our modern offensive context (and to keep my computer from spewing smoke out of its fan), I limited the query set to 1993 and beyond.)

Looking at all umpires with at least 1,000 plate appearances observed in the study, here’s what we get for the greatest difference between umpires in rates per plate appearance:

	BB_DIFF	SO_DIFF	1B_DIFF	2B_DIFF	3B_DIFF	HR_DIFF
Min.	-0.02	-0.03	-0.02	-0.02	0.00	-0.01
Max.	0.02	0.02	0.03	0.01	0.01	0.01
Range	0.04	0.05	0.05	0.03	0.01	0.02
Per650	28.6	33.2	31.0	18.3	6.1	16.0

So what does this table mean? In the bb_diff column, the “Min.” figure means the umpire least likely to call a walk called .02 fewer walks per plate appearance relative to average; the “Max.” figure means the umpire most likely to call a walk called .02 more walks per plate appearance relative to average. “Range” is the difference between min and max, and the Per650 of 28.6 means that the ump with the largest strike zone would call roughly 29 fewer walks than the ump with the smallest strike zone over 650 plate appearances, which is the average number of PAs for an MLB starting player.

Now, bear in mind that these are the most drastic differences observed between umpires with significant time behind the dish; the difference between any two umpires is likely to be much smaller than that. We can measure that using standard deviation, which will tell us the range of nearly 70 percent of the population, assuming that umpiring talent is normally distributed.

	BB_DIFF	SO_DIFF	1B_DIFF	2B_DIFF	3B_DIFF	HR_DIFF
StdDev	0.01	0.01	0.01	0.00	0.00	0.00
Per650	4.6	6.1	5.2	3.1	1.0	2.7

In other words, over 650 plate appearances, a typical umpire will be within 4.6 walks and 6.1 strikeouts of the average, plus or minus.

Now, it’s easy to conceive of how an umpire can effect the walk and strikeout rates of the hitters and pitchers involved, but what about the singles and home runs? It’s important to bear in mind that baseball players are not autonomous, but thinking people who learn and adapt to their circumstances. If a hitter knows that a certain umpire calls a wider strike zone, he’ll swing at more outside pitches, which will lead to increased strikeouts, but also lead to weaker contact and fewer hits. Conversely, if a pitcher is facing an umpire with a smaller zone, he’ll be forced to stay closer to the heart of the zone and give hitters a meatier pitch to hit.

So what umpires come closest to calling the average strike zone, and which are furthest away? We can figure it out using the Pythagorean Theorem (the one dealing with triangles, not the one dealing with run differential). Essentially, we figure out the distance of each rate stat from the average, and then treat those distances as the side of a triangle. (Okay, so an n-dimensional figure in space.) This is essentially the way PECOTA figures its similarity scores. For ease of use, they are placed on a scale of 100 to 0, where 100 is identical and 0 is not similar at all.

Closest to the average zone:

Umpire	PA	Sim
Mike Winters	12459	96
Dale Ford	4060	96
Mike Everitt	8460	95
Larry Vanover	10406	95
Jim Joyce	10980	94
Tim Welke	11608	93
Ted Barrett	10364	93
Chuck Meriwether	12277	93
Joe West	10270	93
Mike Reilly	12292	93

Farthest from the average zone:

Umpire	PA	Sim
Greg Bonin	4749	74
John McSherry	1665	74
Larry McCoy	4056	73
Adrian Johnson	1173	71
Scott Higgins	1003	70
Mike Vanvleet	1381	69
Matt Hollowell	2917	69
Jim Evans	4694	67
Kevin Kelley	1355	66
Vic Voltaggio	1197	66

We should note that the umpires closest to the average zone got by far more plate appearances observed than the umpires furthest away; MLB seems to do a pretty good job of weeding out the umpires inconsistent with the others over time.

One more thing for you to take from this article: like the weather and the park, the umpire is something both teams have to deal with. In a particular game, sure, it can make a difference between winning and losing for a team. But over the course of the season, well, it typically evens out.

References & Resources
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.

The formula for converting the distance estimates to more palatable sim scores is:

(1-x*10)*100

This is based on trial and error, and isn’t supposed to represent anything particularly meaningful.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG