Toward a Probability Distribution Over Batted-Ball Trajectories
Editor’s Note: This piece was initially given as a presentation at the marvelous 2016 Saberseminar.
If you haven’t noticed, the first tip of the Statcast iceberg has been made available to the public at Baseball Savant. You can download pitch-by-pitch data in .csv format, including pitch metrics like spin rate and spin axis. But this article is about the two types of batted ball data available there: exit velocity and launch angle.
Exit velocity is the speed of the ball off the bat and it corresponds to distance from the origin in the polar coordinate plot above. Launch angle is the vertical angle of the trajectory of the baseball after contact. The above figure is a heatmap showing the joint probability density function of launch angle and exit velocity, across all recorded batted balls this regular season through July. Red and orange represent trajectories that are more common; blue and green represent trajectories that are less common. Actually, the pic above is slightly cropped. Let’s see the full version:
While the sabermetric community has made progress on estimating the run value of a batted ball as a function of its trajectory, I have not found anything written in the public sphere about the probability distribution of trajectory as a random variable. The goal of this article is to predict, for any batter-pitcher matchup, the distribution of possible batted ball trajectories.
The value of this endeavor is twofold. First, such a model could be used to evaluate batters and pitchers on the basis of batted balls, while simultaneously controlling for sample size, park effects and opponent quality. Second, and more importantly, the predicted trajectory distribution could inform fielder positioning.
Because the publicly available Statcast data are still relatively new, many of the results below are of an exploratory nature. This is a first attempt toward a probability distribution over batted ball trajectories, and hopefully more refined work will follow.
The Strategy
As a batter, you can’t hit the ball hard (high exit velocity) without making solid contact. Launch angle is, in some sense, a measure of quality of contact. Angles of 50 or -30 degrees, for example, signal poor contact. The angle says something about what the speed off the bat is likely to be, so the distributions of launch angle and exit velocity must be modelled jointly, not separately.
The strategy is first to predict the distribution of launch angle for the batter-pitcher matchup and second to predict the conditional distribution of exit velocity given the launch angle. The joint distribution, then, is just the product of these two distributions.
Focusing first on launch angle, the figures below show that this variable adheres very closely to a normal distribution, across all batted balls in the dataset. In fact, I have found that distributions of launch angle on a batter-by-batter or pitcher-by-pitcher basis are also consistent with a normal distribution, as are residuals after controlling for batter, pitcher and other factors.
My conclusion is that a normal distribution is appropriate to model the randomness in launch angle. I find this to be a very pleasant surprise because I would have expected the cosine or some other transformation of the angle to follow more closely a normal distribution.
Given that it is a normal distribution we are dealing with, the task is to estimate the mean (center) and variance (spread) given the batter and pitcher involved. Assuming these two parameters can be estimated by a linear combination of variables (an assumption I will validate later), this amounts to generalized least squares with unknown variance.
Rather than writing down and optimizing a likelihood function, here I will go with the simpler feasible generalized least squares from econometrics literature, which in this case is a three-step procedure:
- Use ordinary least squares to estimate the expected angle for each batted ball, leading directly to the residual for each ball, which is the difference between observed and expected angle.
- Using regularization in the form of a ridge penalty, regress the squared residuals on the same variables as in Step 1. This gives an estimate of the variance of the angle of each batted ball.
- Solve the generalized least squares problem regressing angle on the same variables as in Steps 1 and 2, using the estimated variances from Step 2. As in Step 2, apply regularization with a ridge penalty.
Specifically, the variables on which I regress are the identity of the batter, the identity of the pitcher, the identity of the ballpark (for park effects), an indicator of whether the batter is on the home team and an indicator of whether the batter has opposite handedness relative to the pitcher.
For a batter-pitcher matchup, the model from Step 2 predicts the standard deviation in launch angle, and the model from Step 3 predicts the mean launch angle. The next section presents the results of fitting the model from Step 2.
Launch Angle Standard Deviation
Carlos Gonzalez of the Rockies has an average launch angle of about 10 degrees. Coincidentally, 10 degrees is my average launch angle when I go to my local batting cage. But that’s because half the time I hit the ball up at 50 degrees and the other half of the time I hit the ball down at -30 degrees. This illustrates the importance of variation in launch angle.
Highest 5 | Angle S.D. | Lowest 5 | Angle S.D. |
Todd Frazier | 26.7 | Joey Votto | 21.9 |
Maikel Franco | 26.6 | Jon Jay | 22.1 |
Kevin Plawecki | 26.6 | Nick Castellanos | 22.3 |
Steven Wright | 26.4 | Starlin Castro | 22.3 |
Kevin Kiermaier | 26.4 | DJ LeMahieu | 22.3 |
The table above gives estimated launch angle standard deviation for each batter (against an average pitcher) and pitcher (against an average batter). The top five and bottom five are shown. Joey Votto has the lowest variation in launch angle, and his former Reds teammate, Todd Frazier, has the highest.
Steven Wright is highlighted because he is the only pitcher to appear in this table. It makes sense that a knuckleballer would be the top pitcher in terms of standard deviation in batted balls against.
Mean Launch Angle
This section presents the results of fitting the model in Step 3 of feasible generalized least squares, described above. The table below gives the top five and bottom five batters and pitchers by expected launch angle when facing an average pitcher or batter, respectively.
Highest 5 | Mean Angle | Lowest 5 | Mean Angle |
Ryan Buchter | 23.4 | Christian Yelich | 3.9 |
Zach McAllister | 22.1 | Cameron Maybin | 4.0 |
Koji Uehara | 21.0 | Jeremy Jeffress | 4.2 |
Bryan Holaday | 21.0 | Jeurys Familia | 4.2 |
Nolan Arenado | 20.7 | Marcus Stroman | 4.3 |
All pitchers are highlighted, and we observe that there are more pitchers in this table than in the previous one. Intuitively, this makes sense because pitchers have more control over the trajectories of batted balls against them based on what types of pitches they throw and where they locate them.
A key assumption that I have made in this model is additivity. Under this assumption, for example, a batter whose average launch angle is 5 degrees above average and a pitcher whose launch angle is 5 degress above average would be expected to produce a launch angle 10 degrees above average if they faced each other. We can check the validity of this assumption by plotting residuals from the model against predictions. If the assumption is wrong, we would expect an upward or downward trend.
The figure above shows the results of aggregating all batted balls by predicted launch angle and averaging the residuals, with 95 percent confidence intervals for the mean. We see evidence of a slight upward trend between predictions and residuals, suggesting slight sub-additivity. But it is a small effect, so I am content with concluding that the additivity assumption is not terribly off-base.
The additivity assumption is appealing because we know from The Book that fly ball hitters struggle against fly ball pitchers, and ground ball hitters struggle against ground ball pitchers. This is consistent with the hypothesis that facing an opponent who tends to produce the same type of trajectory will lead to even more extreme trajectories.
Mean Exit Velocity
Now that we’ve gotten launch angle out of the way, let’s move on to exit velocity.
The figure below shows for each launch angle the average transformed exit velocity, with 95 percent confidence intervals. The black curve has a cosine shape and fits the data very well between roughly -35 and 45 degrees, accounting for 87 percent of all batted balls. The fit is poor outside of this region, but the standard errors are much higher there, too.
Based on the above, to account for launch angle when modeling exit velocity, we’ll include an additional linear term in our model for the cosine of the difference between the launch angle and 10 degrees. Otherwise, we fit the same ridge regression as we did for angle mean and variance in the previous two sections, only now we use exit velocity as our response variable.
Highest 5 | Mean Speed | Lowest 5 | Mean Speed |
Giancarlo Stanton | 99.0 | Billy Burns | 87.5 |
Mark Trumbo | 98.9 | Billy Hamilton | 88.2 |
Nelson Cruz | 98.5 | Dee Gordon | 88.5 |
Matt Holliday | 98.0 | Jose Iglesias | 88.7 |
Ryan Zimmerman | 97.6 | Jarrod Dyson | 88.9 |
The table above shows some of the results of fitting this model. For each player, the table reports expected exit velocity against average opposition conditional on a 10 degree launch angle. This is a better measure of the “power” tool than average exit velocity, because it controls for the “contact” tool.
Giancarlo Stanton tops the list, which suggests that we must be doing something right. And as a whole, the players appearing on both sides of the table match intuition. No pitchers appear in the table, and this also makes sense because pitchers should exhibit less control over the pure power of the swing than batters do.
Putting It All Together
Now we have, for a given batter-pitcher matchup, estimators for the distribution of the launch angle and the conditional distribution of the exit velocity given the launch angle. The joint distribution, then, is just the product of these two distributions. Below are two extreme examples of estimated trajectory distributions for batter-pitcher matchups.
Nolan Arenado is a fly ball hitter, and Chris Young is a fly ball pitcher. As expected, the predicted trajectory distribution from their faceoff puts a greater weight on fly balls. The same is true for Christian Yelich batting against Marcus Stroman, but for ground balls.
We didn’t need all of these Statcast data to tell us that ground ball hitters facing ground ball pitchers should hit ground balls. But we’ve quantitatively estimated the likelihood of each trajectory. Since we know how to estimate the expected wOBA for each trajectory, this means we can quantify the expected wOBA for each matchup. More correctly, we quantify the expected wOBAcon, or wOBA on contact.
The predicted trajectory distributions above include EwOBAcon (expected wOBAcon). The league average wOBAcon this season is around .370, so the Arenado-Young (.340) and Yelich-Stroman (.296) matchups both lead to below-average wOBAcon. In case you don’t want to be limited to these two examples, I built a Shiny app so that you can explore the results of this model for the batter-pitcher matchup of your choice.
Next Steps
In order for this work to be of any use in fielder positioning, we need to include batted ball direction, or horizontal angle, in the predicted trajectory distributions. Currently, the data from Baseball Savant do not include this, but they do include (x, y) coordinates of batted ball location for most balls. This should be good enough to start. Furthermore, to get a complete picture for player evaluation, non-batted ball events need to be considered. Right now, swing-and-misses are ignored because there is no sensible way to define exit velocity or launch angle for these swings.
One troubling assumption is that the optimal launch angle (in terms of maximizing exit velocity) is about 10 degrees. This is probably true for an average batter whose swing plane is 10 degrees, but a batter whose swing plane is 25 degrees probably has an optimal launch angle closer to 25 degrees.
Note that I have presented you with this model, but I have not yet given you a reason to think it is any good. Model validation is an important next step. Finally, incorporating pitch types and locations into the model could improve the precision.
References & Resources
- Special thanks to Alan Nathan and David Kagan and other Saberseminar attendees who gave suggestions for methodological improvements. And I want to give thanks to Dan Brooks, Chuck Korb, and all of the organizers and volunteers of Saberseminar for putting together a special event, if not for which this work never would have happened.
- Code and data for producing the above results are available here.
- Shiny app
- Baseball Savant
- Tom Tango, Mitchel Lichtman and Andrew Dolphin, The Book: Playing the percentages in baseball
- Rob Arhthur, FiveThirtyEight, “The New Science of Hitting”
- Glenn Healey, The Hardball Times, “The Intrinsic Value of a Batted Ball”
- Glenn Healey, The Hardball Times, “The Reliability of Instrinsic Batted-Ball Statistics”
- David Kagan, The Hardball Times, “The Physics of Hard-Hit Balls”
- Wikipedia, “Generalized Least Squares”
Scott – Congratulations on a well researched and well written article. I think the relatively large number of hit balls that are missing Statcast data may affect the actual numbers and player rankings that you have calculated. I don’t see any mention in the article of what time period you used but missing data was missing on about 14.4 % of the hit balls in 2015 and about 12.8 % in 2016 through 8/3. The missing data is not evenly distributed through the launch angles with around 50 % of balls classified as popups missing data and 17 % of ground balls missing data while line drives and fly balls are only missing 2 to 3 %.
It would be helpful to know whether there are specific vertical angles that Trackman has difficulty in tracking especially for ground balls. Also to know whether the percentages of missed data are roughly the same for all venues.
Thanks Peter. The time period I used is 2016 through 7/31. I agree that missing data affect the results. Before acting on these results, this problem is one that needs to be explored further. Maybe the missing data could be partially imputed using the trajectory classification and batted ball location.
Very interesting to see a parametric take on this – earlier this summer, when I had a lot more spare time, I did a bit of exploratory analysis of the statcast data and had wondered if something like this were possible, but I had been only toying with nonparametric methods and had no real idea how I might go about doing it.
It would be nice to see some cross-validation to test the predictiveness of statcast-trajectory-based stats like these. It seems to me the real question is “over what sample size are these stats more predictive of near-future performance than career averages?”, which is a pretty deep question and will probably require a few more years of statcast data to really chew through.
this right now http://goo.gl/V2cU4L
boom boom or a lady http://goo.gl/PmO0Go
It seems to me the real question is “over what sample size are these stats more predictive of near-future performance than career averages?”, which is a pretty deep question and will probably require a few more years of statcast data to really chew through. http://goo.gl/V2cU4L
Scott, you mentioned the use of x, y coordinates of the batted ball – I believe this is coded as hc_x and hc_y when using the statcast search. Do you know the location of home plate and other bases when using their coordinate system? Thanks.