Using Statcast Data to Predict Hits by Bill Petti June 14, 2016 Giancarlo Stanton has the hardest-hit ball, but the launch angle prevented it from being a hit. (via Arturo Pardavila III) Introduction Ever since Statcast was announced as a concept I have been waiting to get access to begin examining to what we extent we can build models with the data. We know various teams have been doing similar work with the data available to them for a few years now, but, given how MLBAM essentially made PITCHf/x publicly available, the hope was the general public would similarly have access to the new data generated by the Statcast system. A little over two months into the season there have been quite a few batted balls that include the three most critical pieces of Statcast information–batted ball distance, launch angle of the ball, and exit velocity. In fact, we have more than 40,000. That seems like a good amount to build and validate at least an initial model of how those three factors determine whether a ball will fall for a hit or be turned into an out. Although the data are new, some similar work has already been released by a number of researchers. Alan Nathan has done some amazing work with both HITf/x data and now the Statcast data here at The Hardball Times around what constitutes the optimal swing and contact. Jonathan Judge, Nick Wheatley-Schaller, and Sean O’Rourke of Baseball Prospectus have examined to what extent there may be park factors that impact the raw readings we see in terms of exit velocity. Rob Arthur has recently published some work at FiveThirtyEight in which he appears to model expected hitting outcomes based on batted ball launch angle and exit velocity. There is no code, or much in the way of a discussion about the methodology he used, but it appears he took an approach similar to what I’ve done here; namely, using a random forest algorithm to classify batted balls into different outcomes based on elements of Statcast data. Rob was looking at linear weights, while I am starting simply, with whether a batted ball is an out or a hit. Data and Modeling Methodology To build the model, I used data acquired from Baseball Savant and its Statcast Search feature. I created a list of MLBAM IDs for all players who put a ball in play in 2016. I then adopted some code shared by Jonah Pemstein to loop through the list of MLBAM IDs and download the available Statcast data in csv format. The data I pulled included games played through May 28 of this year. That provided me with more than 38,000 records. I then coded each record for whether the batted ball resulted in a hit (1) or an out (0). I also coded which team was fielding and which team’s park the ball was hit in. I considered a number of variables for the model: launch angle, batted ball speed, overall distance of the hit, the location of the batted ball (i.e. the hc_x and hc_y coordinates of the ball on the field), the handedness of the batter, the fielding team, and the park. I sampled 15 percent of the data and used that to train my models. Both the training and test sets had similar ratios of hits to non-hits (34 percent hits, 65 percent non-hits) and were similar in terms of the distributions of the features (predictor variables). Additionally, I trained the models and then validated them on different sampling ratios and found that 15 percent appeared to be the lowest required to build the optimal model. This left more out-of-sample data to apply the model to for additional analysis. With random forest models the model is generally fit perfectly to the training data, making applying the model to the training data a futile exercise and will introduce a chunk of over fit data if the training data is mixed in with other data. Next, continuous variables were transformed using the scale function in R so that each variable had a mean of zero and a standard deviation of one. Transforming continuous variables that are normally on different scales can aid in building a classification or regression model. Armed with a pre-processed training set, I experimented with a few different versions of the model using the random forest algorithm. Random forests can be used for both classification (hit or out) as well as regression, but in this case I am focusing on classification. Model Results & Discussion I experimented with a number of different models tuned in different ways. Model 1 included all the variables mentioned above. Model 2 excluded park, batter handedness and the fielding team. Model 3 included only batted ball distance, speed off the bat and launch angle. Model 4 included only batted ball distance. Model 5 included launch angle, speed off the bat, park, batter handedness and the fielding team. The results of the five models are compared below. The data reflect each model’s performance when applied to the holdout or test data–data not used to build the model:A Hardball Times Updateby RJ McDanielGoodbye for now. COMPARING HIT CLASSIFICATION MODELS USING OUT OF SAMPLE DATA Performance Metrics Model 1 Model 2 Model 3 Model 4 Model 5 Sensitivity (Recall) 85.80% 83.70% 73.90% 52.30% 68.90% Precision 85.80% 90.50% 84.20% 54.90% 66.70% Specificity 92.20% 95.20% 92.30% 76.20% 81.00% Overall Accuracy 89.90% 91.10% 85.70% 67.70% 76.70% F1 Measure 0.858 0.87 0.787 0.536 0.678 SOURCE: Statcast/BaseballSavant.com Models built and tuned in R using the randomForest package (randomForest 4.6-12) I’ve used five performance metrics to judge the models. First, the sensitivity or recall of the model. This simply is how well the model is able to identify true positives (True Positives / (True Positives + False Negatives). Second, the precision of the model, which is related to how accurate the model is when it predicts a positive result (True Positives / (True Positives + False Positives). Third, specificity refers to how well the model classifies negative cases (True Negatives / (True Negatives + False Positives)). Fourth, the overall accuracy of the model, which is simply the percentage of accurate classifications it makes ((True Positives + True Negatives) / All Cases). Finally, the F1 Score (or, simply, F score), which provides a measure of how well the model balances its precision and recall (2*((Precision*Recall)/(Precision+Recall)). The first and second models are the best performing when we consider all the metrics together. Model 3 does have strong overall accuracy, but its F1 Measure is lower than Model 1. However, if we think about what we are including in those three models it becomes clear those models are “cheating” compared to Model 5. Why? Because those models include information about where the ball was hit that is based on the end of the play. The batted ball location provided in the Statcast data tells us where the ball was touched by a fielder. But what we are more interested in is the general direction of the batted ball, especially for ground balls. Think about ground balls that get through the infield for a hit. They appear to travel farther and end up with a different hit location than similarly struck balls that happen to be stopped by an infielder. What would be better is simply the horizontal angle of the batted ball. Using the actual distance measure potentially introduces similar bias into the model. For this reason, I have used Model 5 as the go-forward model for classifying hits. Model 5 makes full use of launch angle and speed off the bat, as well as controls for batter handedness, the park the ball was hit in and the fielding team. Each of the models does a better job predicting outs than hits, and Model 5 is no different. We can see the difference by plotting the correct classifications in orange and the incorrect classifications in blue, and comparing actual hits and outs side by side: We can see that for outs it is much tougher to make out the blue incorrect classifications, whereas we can see more incorrect classifications when it comes to hits. In either case, the accuracy is extremely high for both hits and outs. Random forests classify by calculating the probability that an observation belongs to each class (here, hits or outs) and then classifies an outcome based on the class for which it has the highest probability of belonging to. We can take the raw probabilities from the model when applied to the out of sample test set and graph the predicted probability of a hit depending on different combinations of the new Statcast data–namely, distance, speed and angle. We’ve seen that a launch angle between 20 and 40 degrees appears to be optimal, and when you mix in batted ball speed over 100 miles per hour you are looking at essentially a guaranteed hit–in many cases, a home run. That aligns with the predicted probabilities of the model. Comparing launch angle to distance we can see that a launch angle between zero and 20 degrees gives hitters a high probability of avoiding an out for any distance over 150 feet. When the launch angle gets too high (e.g. over 20), batters are generally looking at fly outs between 250 and 360 feet. The model gives balls that are hit at a distance less than 100 feet a greater chance of being converted to outs, regardless of speed off the bat. This begins to flip as distance increases–which is related to launch angle. We can also see the diagonal boundary that indicates how distance relates to increasing speed. These graphs align nicely to similar visualizations of batting outcomes based on Statcast or HITf/x batted ball data, so this provides additional confidence that the model is sound. Another way to examine the fit between the model and actual outcomes is to compare the relationship between the average predicted probability of a hit based on ranges of batted ball angle and speed based on the model and the actual average probability. I bucketed the test set in ranges of 10 degrees and 10 miles per hour and then compared the average probabilities a hit for the actual outcomes and those generated by the model. The fit is pretty good. The data points that are a significant distance from the regression line–basically, where actual data and the model disagree–are generally cases where we have only seen 10 or fewer actual batted balls with those characteristics. However, there the model does appear to have some systematic issues with how well it classifies batted balls. Let’s plot and compare the location of batted balls that the model correctly predicted versus those where it missed. We see that, as noted above, the model tends to call outs better than hits. This appears related to some extent to the distance of the batted ball. The model can infer distance based only on launch angle and speed off the bat, since we excluded distance and batted ball location. This is easier to see if we plot the accuracy of the model given different batted ball distances. The model performs quite well on balls hit below 90 feet, but its accuracy declines dramatically between 90 and 200 feet. Accuracy rises again and sees a dip only between 350 and 400 feet. (Yes, I know the buckets below are not uniform, but I chose them based more on practical distances in a baseball setting given positioning, etc.). My initial guess is this is partly–but not entirely–due to lacking horizontal angle in the model. Including horizontal angle would help the model pick up where fielders normally are positioned and therefore when balls are more or less likely to be converted to outs, even considering how well they are struck. Applying the model back the actual data allows us to not only classify events as hits or outs, but also derive a probability that they will be hits. Armed with this probability, we can look at batted balls in a slightly different way. For example, in a June 9 game against the Twins, Giancarlo Stanton hit the hardest ball so far recorded by Statcast–a 123.88 mph scorcher. However, this didn’t result in a hit, but rather a double play started by the second baseman. While Stanton’s 123 mph batted ball was incredibly impressive, he hit it with a launch angle of -4.83 degrees. Taken together with the other variables, the model predicted that batted ball had only a 45 percent chance of being a hit. If we look at similar batted balls, we can see why the model was not as optimistic about that ball being a hit: In fact, if anything the model was more optimistic than the empirical record suggests (largely driven by the extreme exit velocity), as only one ball struck in a similar manner ended up being a hit. My initial guess is if the model took horizontal angle more explicitly into account the probability would have been lower. On May 19, Troy Tulowitzki crushed a ball to center field at 107.11 mph and a launch angle of 23.91 degrees. The ball traveled nearly 400 feet before it was snagged for an out. Tulowitzki’s ball had a 99.6 percent chance of being a hit according to the model. As you can see below, almost all balls hit in a similar way ended up as hits. Summing Up This is just an introduction to the model. I plan to perform more diagnostics on it, but I feel comfortable with the initial performance. Now, random forests inherently mitigate overfitting (notice I said mitigate, not eliminate), and the application of the model to completely out-of-sample data does provide some comfort. The model was trained on 5,000 observations and tested and validated on another 28,000. So I do not think overfitting is an issue, but I never like to assume anything. The model also does an inherently better job predicting outs than hits. Again, I think this is in part due to the lack of horizontal angle information. I plan to incorporate that into a future version. My guess is the overall accuracy, as well as the precision and recall of the model, should increase pretty nicely as a result. I also plan to expand the model to predict not just hits or outs, but also the type of hit (i.e. single, double, etc.). That would allow us to develop expected wOBA on batted balls and compare to actual performance, as well as begin looking at how predictive expected wOBA based on Statcast data would be and how many observations we would need before needing to add in an equal number of league average observations to accurately estimate true talent. Finally, I am exploring ways to make the predicted probabilities available for both previous batted balls and for new batted balls. It may be something as simple as a table or possibly even a tool where you can input the characteristics of a batted ball and see it’s probability of being a hit. More on that soon. As always, suggestions welcome. References & Resources All code and data for this analysis can be found on GitHub. CRAN R Project, “randomForest: Breiman and Cutler’s Random Forests for Classification and Regression,” Version 4.6-12 Many thanks to MLBAM and Baseball Savant making the Statcast data available via Statcast Search For a great introduction and summary of classification model evaluation criteria, see Jason Brownlee, Machine Learning Mastery, “Classification Accuracy is Not Enough: More Performance Measures You Can Use” Alan Nathan, The Hardball Times, “Going Deep on Goin’ Deep” Jonathan Judge, Nick Wheatley-Schaller and Sean O’Rourke, Baseball Prospectus, “Prospectus Feature: The Need for Adjusted Exit Velocity” Bo Moore, Galvanize, “What Counting Jelly Beans Can Teach Us About Machine Learning” Rob Arthur, FiveThirtyEight, “Who’s Hitting The Ball Harder This Year, And Who’s Just Getting Lucky?”