# Data: Impossible – The Minor League Strike Zone Part 2

Throughout my professional career, I’ve encountered multiple situations in which it was believed acquiring, cleaning or wrangling a specific data set was impossible. The MLB Gameday minor league data set includes such an example: hand-crafted strike zone data placed by stringers.

Yesterday, we demonstrated these minor league strike zone data contain real, useable information, at least at Double-A and above. This is also a great example of the power of markets in providing information. Any one game, stringer or venue may produce inaccurate results; however, many different inputs from a variety of sources often can come together to produce a whole that is greater than the sum of its parts. Today, we’re going to spin these data around and attempt to learn a little bit more about some of today’s young prospects.

## The Aaron Nola Clones

Aaron Nola is a very interesting pitcher in that he routinely outperforms his swinging-strike rate by garnering a larger-than-average share of called strikes as a percentage of all pitches not swung at. (In other words, the rate of called strikes on all taken pitches.) This a skill he has demonstrated not only at the major league level, but while at Triple-A and Double-A, which tells us if we find players with a similar profile, they may similarly outperform their swing-and-miss numbers.

Let’s look at three charts showing called strike percentage vs. distance from center (greater average distance to the right, smaller average distances to the left) for the majors, Triple-A, and Double-A. Note that the distance from center is measured in screen pixels, so we’ve left the raw number out of the x-axis.

**Distance from Center & Called Strike % (MLB) – Min. 500 Pitches**

Let’s make specific note of where Nola is on this chart, far above the average called-strike percentage, behind only Rich Hill, Bartolo Colon and Cliff Lee. He also is fairly close to the average in terms of how far from middle-middle he throws, and he profiles similarly to Kyle Hendricks by this metric. Additionally, we see a somewhat loose R^{2} correlation of 0.14, which confirms what we’d expect: If, on average, you throw closer to the middle, more of the pitches taken will be called strikes.

Obviously, its not that simple since, if a pitcher is always close to the heart of the plate, more of his would-be-called strikes will get swung at. I’m pointing this out now because we don’t see this correlation in Triple-A and Double-A. Being able to get a high called strike percentage without being too close to the heart or nibbling too much is a nice spot to be in, which for now, we’ll dub the “Nola Zone.”

**Distance from Center & Called Strike % (AAA) – Min. 500 Pitches, Age <=25**

Let’s again make note of the Nola Zone, which is eerily consistent from Triple-A to the majors, located at a slightly above-average distance and at a significantly above-average called-strikes level. At the Triple-A level we have a minuscule 0.01 R^{2} correlation between distance and called strike percentage, indicating either issues with the quality of the data or far more random variation with lesser hitters.

Our first clone is Tom Eshelman, who—due to his surgical command—is able to draw an exceptional number of called strikes, all while operating in the Nola Zone. Brock Stewart, a solid prospect in his own right, appears to lend some weight to Jay Jaffe’s assertion that it’s too early to give up on him. However, he loses his clone status when we consider he had a significantly higher WHIFF% than Nola in Triple-A (25 percent to 18 percent) and also drew a lot more swings (51 percent of pitches to Nola’s 41 percent). Enter Nestor Cortes, he of second team All-KATOH fame, but also less importantly, very Nola-esque swinging strike percentage, WHIFF rates and called-strike rates.

Why are we searching for Nola clones? Often, when we evaluate pitchers with elite command, where it is hard for us to quantify *how *or *why* they draw extra called strikes, we tend to discount them as prospects and focus our attention on the easier to understand SwStr%. This is all to say that we should be paying a little more attention to Cortes, whose command may play up as a starter if he’s given a shot there, despite his poor small-sample start in the bullpen.

It’s somewhat interesting to see Luiz Gohara, Jordan Montgomery and Jack Flaherty all clustered in the same spot, though I’m not sure if it means anything. The other guy in that zone is Williams Perez, so make of it what you will.

**Distance from Center & Called Strike % (AA) – Min. 500 Pitches, Age <=23**

I don’t want to spend a great deal of time at Double-A, since the average distance metric is less reliable at this level. However, I wanted to point to the consistent Nola zone.

## Projecting Groundball Pitchers

Generating ground balls is a combination of location (lower in the zone is better), movement and deception. Let’s spin our data around the location axis and look for a little bit of insight into projecting groundball pitchers. Here’s a chart showing groundball percentage by vertical location, from High-A to the majors.

**Groundball Percentage by Vertical Location and Level of Play**

It’s rather interesting how similar the Double-A and Triple-A curves are, as well as the weird dip at the extreme low end of the zone, which I don’t have a solid explanation for. Perhaps it suggests there are a certain number of “throw-away” zone markings placed at the very bottom of the zone, irrespective of where they actually were thrown.

Let’s take a quick look at three charts showing the per-level correlation of groundball percentage to vertical location.

**Groundball Percentage by Vertical Location, Double-A | R ^{2} = 0.10**

**Groundball Percentage by Vertical Location, Triple-A | R ^{2} = 0.18**

**Groundball Percentage by Vertical Location, MLB | R ^{2} = 0.13**

It’s a tad strange that we see a stronger correlation in Triple-A than in the majors, which tells me there likely is an element of confirmation bias happening in the Triple-A data. Specifically, when a ground ball is hit, the stringer naturally will assume the pitch was lower in the zone. Let’s keep this in mind, as it is important that the average vertical distance is measuring a distinct skill (keeping the ball down) as opposed to a skill it goes hand-in-hand with (generating ground balls).

**MLB Groundball Percentage by Vertical Location, Triple-A | R ^{2} = 0.13**

We wouldn’t normally expect the Triple-A-to-majors number to have the same correlation as the major league-to-major league number. However, as noted above, it is likely we are capturing an element of groundball percentage in the Triple-A numbers.

## Projecting Groundball Percentage Pitchers

I wanted to see how much extra information the average vertical distance captured above and beyond simply using groundball percentage, so I built several models using various combinations of groundball percentages from Double-A and Triple-A and average vertical distance from Double-A and Triple-A to see which permutations produced the strongest correlations. What became very apparent were four things.

First, using Double-A vertical location data at all only confused the model, making it less accurate overall. Second, Triple-A groundball percentage by itself was by far and away the strongest signal. Third, Double-A groundball percentage was useful. Last, the Triple-A vertical distance number boosted the model a tad but was also consistent in how it affected the models.

Here are the simple multiple linear correlations the models spit out (rounded to two decimal places). Triple-A pz is Triple-A average vertical location.

**Model No. 1:** MLB GB% = 0.12 + 0.27 Double-A GB% + 0.60 Triple-A GB% – 0.04 (Triple-A pz) | R^{2} = 0.608

**Model No. 2:** MLB GB% = 0.17 + 0.80 Triple-A GB% – 0.05 (Triple-A pz) | R^{2} = 0.575

**Model No. 3:** MLB GB% = 0.03 + 0.28 Double-A GB% + 0.62 Triple-A GB% | R^{2} = 0.605

I wasn’t blown away by the predictive nature of vertical location at the Triple-A level. However, I was encouraged that it boosted the model ever so slightly (No. 1 vs No. 3) and that it spat out roughly the same -4/-5 percentage points per foot of average vertical location (lower being better). It seems likely that if we’re able to clean the data a little bit more, adjust for biases, and filter out the clearly wrong data, we’ll get a stronger signal.

## Projecting Home Run Hitters

We’ve saved the most interesting part for last, mixing in a little bit of strike zone data to help us predict home run percentage (percentage of balls in play that result in a home run) hitters in the majors. As a bonus, we’ll throw in a little bit of Triple-A flyball distance just for fun and see if that helps our model at all.

Let’s begin with the heart of predicting home run percentage in the majors using Triple-A data, specifically home run percentage at the Triple-A level. We get a very, very strong signal, which looks like this.

**HR% MLB to HR% Triple-A (Min. 1000 pitches at each level) | R ^{2} = 0.50**

An R^{2} of 0.50 is a pretty strong signal, especially with a binary attribute such as home run percentage. This, in and of itself, is neither novel nor particularly interesting. What is interesting is baking in *how* these batters are being pitched at Triple-A and seeing how much, if at all, it boosts our signal.

We know more dangerous hitters are pitched to more carefully than batters who don’t have much power. Thus, we should see a strong correlation between average distance and home run percentage, as well as a boost to our signal.

The most-pitched-around batters in Triple-A were Aaron Judge, Joey Gallo, Matt Olson and Matt Chapman, and the least-pitched-around were old Jeff Francouer, Josh Phegley, Yolmer Sanchez and Angel Sanchez. Overall, the spread from top to bottom was roughly 0.5 feet of average distance from the plate.

**HR% MLB to Avg. Distance from Center Triple-A (Min. 1000 pitches at each level) | R ^{2} = 0.07**

The 0.07 R^{2} value in and of itself is pretty mediocre. However, where it gets interesting is when we throw it into the model with home run percentage.

HR% MLB = -8% + .67*HR% Triple-A + 7.8%*(Avg. Distance from Center – Feet) | R^{2} = 0.53

More important than the 0.03 boost in signal is the p-value of 0.000036 (for the strike zone variable), telling us that mixing in this variable is statistically significant. Given the average distance from center was roughly 1.3 feet, Judge’s home run percentage projection would have been boosted by roughly two percentage points. Let’s mix in average flyball distance at Triple-A and see how far we can push this model.

**HR% MLB to Avg. FB Distance AAA (Min. 1000 pitches at each level) | R ^{2} = 0.43**

We see a very strong signal, which we’d expect. However, the real question is how much independent signal does it provide over and above the home run percentage?

HR% MLB = -25% + .43*HR% Triple-A + 7.5%*(Avg. Distance from Center – Feet) + .04%*(Avg. Distance on Fly Balls – Feet) | R^{2} = 0.56

Again, we were able to boost the model by a modest 0.03, with all three predictor variables producing p values below 0.0001. As in the groundball percentage models, the strike zone metric produced a consistent result. While I’m not sure how much it means on its own, I find it quite interesting that the predictive aspect of distance was consistent across the board, including 6.7 percent per foot when the only variable.

## Conclusion

When I began digging into the minor league strike zone data, I was highly skeptical of there being anything real, let alone helpful, in making predictions. As I began exploring the data, it became quite clear it was at least a somewhat rough measure of where pitchers were pitching and batters were being pitched to.

While for certain aspects, such as groundball percentage, the data weren’t too predictive, the predictive boost it gave to home run percentage was especially intriguing. Also fascinating, but not necessarily predictive, was how consistent Aaron Nola’s approach was from Double-A all the way through to the majors.

What I’m most encouraged by with these data are the quality of the signals with a relatively simple approach to cleansing the data. I expect as I develop more sophisticated methods for adjusting and cleaning the data, the quality of said signals will improve.

Do Park Factors have a roll to play in the percent HR to AAA FB distance comparison.

Yes. I didn’t discuss it here as it wasn’t the central thesis, but all FB distance are normalized to the average FB distance for that venue for that particular month.

I was sort of musing about a correlation such as this to what is essentially a categorical variable is compared to a continuous one as well.