﻿ Scouting the Minors Pitch by Pitch: Modeling Infield Defense | The Hardball Times

# Scouting the Minors Pitch by Pitch: Modeling Infield Defense

Can we measure Amed Rosario’s defensive impact? (via slgckgc)

 Eli Ben-Porat’s Scouting the Minors Pitch by Pitch Series Nov. 22, 2016: Swinging Strike % Dec. 13, 2016: Power

Today, we’ll attempt to build a play by play model of defense for minor league players and test it against major league talent. We’ll first build the model step by step, with the assistance of tables, charts and graphs, then explore the results of the model visually and determine whether or not we have built anything worthwhile. This is neither the first defensive metric system, nor is it the last; the methodology here likely borrows concepts from other models, however, this is essentially a model that is custom built to work with the data available for it, rather than a re-hash of more mature models. We’ll compare our model to UZR to determine if, generally speaking, the models produce similar results, despite different methodologies. The comparison to UZR is a simple way to test whether this comparatively crude model is on the right track; where there are disagreements I would err on the side of UZR.

### Overview of The Model

Our model will borrow from a very over-rated hockey metric, save %. Essentially, we’re going to establish a zone that each fielder is responsible for to determine the average out%, generate a formula which estimates the distribution of out% based on the spray angle and then convert it into another hockey metric (+/-) to determine how a player’s out% for his fielding zone outperforms our expected out% for that fielding zone. Here is the methodology in a little greater detail:

1. Determine the median fielding position for 3B, 2B, 1B and SS for all GB outs that traveled at least 50 feet. Split by LHH and RHH.
2. Determine “fielding zones” for which each fielder is responsible. These zones will be based on any ground balls above 50 feet and split by spray angle.
3. Eliminate obvious shifts that may distort the data, for example, a ball hit to the first base side that is fielded by the 3B.
4. Model the distribution of out% based on the the absolute difference between the spray angle and the median angle, within the fielding zone
5. Compute a fielder’s average +/- out% within their fielding zone, with positive indicating an above-average fielder

The underlying assumption is that if we’re modelling a fielder based solely on what they control, we mitigate the effect that having a really poor or really strong fielder might have on a player’s fielding metrics. Additionally, it makes intuitive sense to assign all of the blame or credit to a fielder when the ball is in their zone of control, even if mathematically, the responsibility is often shared. Let’s begin.

### A Macro View of the Data (Ground Balls)

SLUGGING ON BALLS IN PLAY – Rookie Ball to MLB
Ground Balls (Distance >= 50 Feet)
Spray Angle Majors AAA AA A+ A A- R
0 0.6820 0.7075 0.7260 0.3062 0.2823 0.2460 0.1680
3 0.8078 0.7880 0.8251 0.5475 0.4987 0.4669 0.3203
6 0.2409 0.2688 0.2839 0.3721 0.3408 0.3694 0.2986
9 0.1291 0.1590 0.1649 0.3302 0.3177 0.3469 0.3411
12 0.1817 0.1777 0.1921 0.2983 0.2901 0.2920 0.3034
15 0.3225 0.3128 0.2995 0.3351 0.3296 0.3129 0.2914
18 0.3726 0.3898 0.3531 0.3209 0.3074 0.2866 0.2530
21 0.3201 0.3183 0.3118 0.2604 0.2564 0.2488 0.1979
24 0.1791 0.1578 0.1722 0.1787 0.1826 0.1825 0.1701
27 0.1085 0.0945 0.0993 0.1357 0.1321 0.1479 0.1459
30 0.0821 0.0652 0.0710 0.1167 0.1106 0.1184 0.1416
33 0.1116 0.0910 0.0889 0.1287 0.1325 0.1445 0.1592
36 0.2641 0.2267 0.1979 0.2011 0.2148 0.2126 0.2407
39 0.5262 0.4717 0.4352 0.3402 0.3706 0.4173 0.3873
42 0.7280 0.7236 0.6854 0.5853 0.6292 0.6447 0.6583
45 0.7337 0.7734 0.7838 0.7484 0.7912 0.8258 0.8475
48 0.7588 0.7985 0.7995 0.7048 0.7138 0.7385 0.7860
51 0.6486 0.6901 0.6611 0.5271 0.5387 0.5310 0.6102
54 0.4842 0.5245 0.4541 0.3931 0.3961 0.4215 0.4161
57 0.2250 0.2260 0.2071 0.2065 0.2197 0.2432 0.2249
60 0.1259 0.1223 0.1117 0.1491 0.1409 0.1714 0.1763
63 0.1078 0.0834 0.0845 0.1196 0.1147 0.1437 0.1397
66 0.0923 0.0819 0.0774 0.1117 0.1124 0.1185 0.1240
69 0.1693 0.1637 0.1495 0.1413 0.1374 0.1411 0.1334
72 0.2425 0.2513 0.2435 0.2052 0.1984 0.1928 0.1759
75 0.2922 0.3202 0.3315 0.2759 0.2815 0.2779 0.2350
78 0.2973 0.3364 0.3504 0.3370 0.3498 0.3376 0.2787
81 0.2279 0.2623 0.2698 0.3449 0.3567 0.3712 0.3243
84 0.1610 0.1981 0.2004 0.2985 0.3327 0.3254 0.3532
87 0.3164 0.3351 0.3369 0.3566 0.3612 0.3528 0.4187
90 0.4721 0.4788 0.4990 0.3535 0.3162 0.3057 0.2587
0 = 3rd Base | 90 = 1st Base

We begin with a high level view of the data in tabular form. The yellow highlight represents peak slugging, the pink represents valleys in slugging, essentially where the fielders are located. This should give us a fairly good representation of how valuable a ground ball is, based on its spray angle. Value peaks down the lines as well as right up the middle, though we lack the ability to assess quality of contact, outside of measuring how far the ball traveled before it was fielded; for the table above, it is filtered to ground balls hit at least 50 feet. The table above doesn’t distinguish between lefties and righties, so we’ll break that down in chart form, as well as when we use the more granular data used to calculate fielder ability.

### How Consistent is the Data From Level to Level?

We see extremely consistent patterns between MLB, AAA and AA as well as between Class A and Rookie ball. Generally speaking, the data quality at AA and above is superior to the data below, which is likely what is going on here; however, despite the imperfect data, it actually can yield information, as we saw in the prior articles. Note that lefties experience a boost in performance down the left field line, but this advantage really only shows up at AA and above. The consistent patterns lend confidence to the data – here’s a snapshot with the distribution curves amalgamated:

These charts clearly demonstrate that there is a consistent effect in AA an above, as well as below that. Given the large sample size we’re dealing with, it’s encouraging to know that it is extremely consistent level to level. What they also demonstrate is the extreme probability shift that a couple of degrees can make, suggesting that we will need to employ a methodology to correct for minor recording errors that occur based on x,y markings by the stringers.

#### Step 2: Where are the Fielders Positioned?

We need a robust methodology that will correct for slight systemic recording bias as well as correct for slight differences in what the x,y coordinates translate to in real life. To do this, we’re going to calculate the median spray angle recorded for all ground ball outs, by infield position, by venue. Median is more appropriate since it effectively corrects for shifting, where the 3B will often field a ball at an extreme spray angle, affecting the average significantly, but the median only slightly. In lieu of dumping another huge table with the computations, we’ll look at box-and-whisker plots showing the distribution of median angle by venue.

What we see is a rather large distribution of median angle of the average ground ball out – the smallest range is 4 degrees for shortstops and right handed batters. On the surface, 4 degrees does not seem like a great deal of variance, however, we see from the table above that a 3 degree variance (such as 36 to 39) can have an extreme effect on the probability, stressing the importance of accurately gauging the central location of a fielder, in that particular venue.

#### Step 3: Model the Probability Based on Angle Differential

Let’s make an assumption that the median spray angle that ground balls are fielded at, indicate the maximum probability that a fielder will make the play and that it declines either linearly or parabolically, as the angle differential increases, approaching zero as it gets farther away. Linear formula or normal distribution? Pictures are worth a thousand formuale…

That looks like a bunch of bell curves to me. What you’re actually looking at is the probability that a ball will be fielded by the respective position across the spray angle spectrum, with intersection points essentially equidistant between the two fielders. Thus, to predict the probability of a ball being fielded by a SS, we’ll simply measure how many standard deviations away the ball is from the median fielding point for a SS, based on the stadium and handedness of the batter. I didn’t have a direct way to flag plays that had significant shifting, instead, I set a filter which excluded any plays that were more than 40 degrees away from the fielder’s median zone (i.e. 3B fielding a ball where a 1B would usually be). This allowed for standard deviations that look as follows:

STANDARD DEVIATION OF GB OUTS
Obvious Shifts Filtered Out
L L L L R R R R
1B     2B     SS     3B     1B     2B     SS     3B
Median Angle   82.5   67.0   31.7   15.8   79.8   63.1   27.8   12.4
Std. dev. of Angle    5.8    6.3    9.8    7.6    6.7    7.4    5.8    5.9
# of Outs Recorded 44,556 66,352 28,571 12,120 12,716 37,870 81,022 69,182

Interestingly, it appears that when balls are hit on the ground to the pull side, it generates less variability in the average location of outs recorded. My assumption is that ground balls hit to the pull side are hit harder than to the opposite field, with slower balls allowing greater variability than harder hit balls which will cluster closer to the mean, or alternatively, batters are more predictable in their spray angle when hitting to the pull side. When a batter is hitting from the left side, the standard deviation is 9.8 to SS versus 6.8 from the right side; we see the opposite effect when a LHH hits a GB to 2B (6.3) vs a RHH (7.4). This might help explain the run values on ground balls hit to the opposite field, where they are better than ground balls hit to the middle or pulled.

In lieu of assuming a perfect normal distribution and modelling the expected out% based on a normal distribution with the mean = median and the standard deviation as per above, I took the approach of modelling observational data which would allow for specific models for RHH, LHH and each level of organized baseball. For example:

We see two distinct curves, that are fairly predictable and easily quantifiable. In this instance the model estimated that for a RHH in the majors, on a ground ball of at least 50 feet, the probability distribution would be approximately 92% – 0.6%*(Angle – MedianAngle) + 0.35%*(Angle – MedianAngle)^2. This is intuitive since it assumes that the difficulty level scales exponentially as the spray angle deviates from the median. The model produced a probability curve for each level (Majors, AAA, AA), handedness (R/L) and fielding zone (3B, SS, 2B)

### Testing the Model vs UZR

Let us begin with the shortstop position, where we’ll chart our model versus the time-tested UZR. We’ll look at this in a couple of different ways, first by +/- % (effectively the marginal probability of a play being made) compared to UZR/150 as well as total +/- (number of plays above average) to UZR.

#### Total +/- to UZR (SS in 2016) | R^2 = .46

Players in the top right and bottom left represent agreement between the two models, bottom right and top left representing disagreement. What is encouraging is that the models agree on who the worst shortstops are and who the best shortstops are. Alexei Ramirez, Ketel Marte and Brad Miller all rank very poorly. Lindor, Russell and Simmons all grade very strongly. Interestingly, this model thinks highly of Carlos Correa’s 2016 performance, whereas UZR has him at slightly below average. I’d defer to the far more mature and robust UZR, however, it is encouraging that the models largely agree on who the good defenders are and who the poor defenders are.

#### Average +/- to UZR/150 (SS in 2016) | R^2 = .42

The models largely agree here as well, with similar outliers such as Jonathan Villar, which this model is more bullish on. Jedd Gyorko is not a good shortstop by any measure.

#### Comparison of Second Baseman | R^2 = 0.11

For 2Bs, we see much less agreement between the models, suggesting to me that the methodology described above may be fundamentally flawed when applied to second basemen. Given that this model is rather crude in comparison to UZR, this was disconcerting, so I wanted to check the internal consistency of the model, to see if it was predictive year, to year, which would imply that at the very least, it is measuring the same signal.

#### 2B Year to Year Consistency | R^2 = .19

We see an encouraging year-to-year correlation for the metric, which would suggest that there is a signal being detected, though it may not be a particularly strong one. For now, let’s conclude that the model appears to be more accurate for shortstops than for second basemen.

#### Comparison of Third Baseman | R^2 = 0.54

Fortunately, for 3B, the models are largely in agreement as to 2016 performance, with agreement on the top 3Bs (Turner, Duffy, Beltre) as well as how atrocious Danny Valencia is as a fielder. UZR like Juan Uribe a lot more than this model; as always, I’d err on the side of UZR.

### Conclusions and Next Steps

We can say with some level of confidence that the model is picking up a fair amount of signal, though it clearly has a weak spot with respect to accurately measuring second baseman. I was encouraged that the model was able to pick out the consensus top and bottom performers at their positions, which bodes well for using this to project minor league talent, where only the best of the best will ever make it.

In the next article, we’ll take a look at how well this model performs at predicting fielding success in the majors for minor league infielders.

Eli Ben-Porat is a Senior Manager of Reporting & Analytics for Rogers Communications. The views and opinions expressed herein are his own. He builds data visualizations in Tableau, and builds baseball data in Rust. Follow him on Twitter @EliBenPorat, however you may be subjected to (polite) Canadian politics.
Inline Feedbacks