How Many Useful Dimensions of Data Does Trackman Report?

Less is more. Less can also be equally good. (via Michelle Jay)

Trackman data (and prior to a few years ago, PITCHf/x data) has redefined how baseball analysts evaluate players. For each pitch thrown, and each ball put into play, we have a wealth of statistical data at our fingertips. Looking specifically at the pitch data reported on Baseball Savant, there are 19 dimensions of measurements reported for each pitch.

While that is certainly a remarkable number, a keen person would observe that some of the dimensions are related to each other, and that the number of independent dimensions is less than 19. Furthermore, I suspect it is not necessary to use all these dimensions to get good results. This article will explore how many of these dimensions are useful for analysis purposes and what the effect is of using fewer dimensions in an analysis.

A Non-baseball Analogy

Let me pose a non-baseball analogy. If you are buying a car, you might see an advertisement from a dealership that screams “19 free items with the purchase of a new car!” It is easy to get excited at this prospect; everybody loves getting free stuff. Of these 19 items, some are probably quite valuable (such as free oil changes for the lifetime of the car), and some are practically worthless (such as pens with the dealership’s name on them). An intelligent person would read between the lines and interpret this message to say “Three valuable items, seven moderately useful items, and nine worthless items with the purchase of a new car!” An even more intelligent person might assign dollar values to each of the items.

In a similar manner, I will attempt to quantify the value of the dimensions of Trackman data.

Examining the Trackman Data

For this study, I used Trackman data from all the pitches of the 2018 major league regular season, which number slightly over 700,000.

These are the dimensions of the data:

  • Release speed
  • Release location (x, y, and z components)
  • Plate crossing (x and z components)
  • Pitch break (x and z components)
  • Velocity (x, y, and z components)
  • Acceleration (x, y, and z components)
  • Strike zone (top and bottom)
  • Spin rate
  • Release extension
  • Effective speed (adjusted for release point and extension)

It is intuitive to see that some dimensions are related. For example, if we take the magnitude of the three velocity components (which are measured 50 feet in front of home plate), we will have the speed. The speed 50 feet in front of home plate is going to be very close to the release speed, which is already reported as a separate dimension. Having this extra dimension probably won’t make things worse (as long as the measurement is accurate), but it may not help either. Mathematically speaking, the absolute value of the correlation coefficient between the release speed and the velocity y-component is 0.9997, which indicates a very strong correlation between those values.

There are other relationships between the dimensions that are somewhat obvious, and then there may be other, more obscure relationships that are not obvious at all.

Principal Component Analysis Background

The technique we will use for transforming the Trackman data is called Principal Component Analysis. In short, PCA is a technique that will take correlated data and transform it to create new dimensions that are uncorrelated. Furthermore, these new, uncorrelated dimensions are arranged in order of decreasing variance.

Here is a classic example of PCA in two dimensions:

The Cartesian grid defines the usual x and y axes. For this data set, most people would think the x and y axes are not the best possible axes you could pick. Intuitively, we can see the data are “rotated,” and there probably is a better set of axes out there. Using PCA will give us what we are looking for. The longer arrow line represents the first principal axis, and the shorter arrow line represents the second principal axis. If you can imagine those arrow lines as the new x and y axes, then the data will appear to be more “straight,” and are easier to visualize and analyze.

As mentioned earlier, the first principal component will yield the largest possible variance in the data. In some applications, such as when dealing with measurement errors, variance is considered to be bad, and it is desirable to minimize it. However, in this case, variance is considered to be good, and it is desirable to maximize it. If you think about an extreme case, where a principal component has no variance (all the values in that component have the same value), then that component is useless because it doesn’t convey any useful information about the data points.

Applying PCA to the Trackman data

Now let’s apply PCA to the Trackman data. We will be computing the principal components in the same manner as was explained in the above example, except with more dimensions. I used the PCA implementation in Python’s Scikit-learn. Note that since the original measurements are not on the same scale (for example, a typical value for pitch speed in miles per hour is about 90, while a typical value for strike zone in feet is about three), we will normalize each dimension of data to have an average value of zero and a standard deviation of one before applying PCA.

A Hardball Times Update
Goodbye for now.

After using the new principal components as the new axes for the data, we can determine the percentage of explained variance per component. This is directly related to how “important” each component is.

This exponential, decay-like graph is common in these scenarios. The first component contains a third of the total variance, while the second component contains less than half that amount. The last seven components contain almost no variance (while the graph shows values of zero, the actual values are non-zero but very small). We can conclude there are 12 useful dimensions in this data set.

Using the New Data Values in an Analysis

I will now incorporate PCA and repeat the analysis I did in this previous article on predicting swinging strikes. I am curious how many of these components are necessary to get a good result.

The data I used in the previous article were based on PITCHf/x data from several years ago, which lack the effective speed and release extension dimensions but contain a plate velocity dimension, resulting in a total of 18 dimensions. Upon computing the explained variance numbers, I saw that all the relevant variance is contained in the first 11 components. Thus, there is no need to use more than 11 components in doing the swinging-strike prediction. However, I speculate we can get good results with even fewer than 11.

Recall that in the previous article, I built models using logistic regression and random forests and evaluated the fidelity of those models with the area under the curve (AUC) of the Receiver Operating Characteristic (ROC) curve. This time, I will start by building a model with only the first component, then build successive models by adding one additional component at a time.

For both the logistic regression and random forests models, we see that at six components, barely half of the total, we are already plateauing in AUC, with only marginal gains after that point. At lower numbers of components, random forests does somewhat worse, while logistic regression does quite well with even just a single component. It is nice to be able to judge the outcome of a swing with a single number instead of an array of many numbers. Note that even if you only use the single first principal component, that one component actually contains some information about all the original dimensions. It’s as if you had some magical oracle who hand selected the best attributes of each of the original dimensions and packaged them together into a component.

The price of using transformed data is that it is more difficult to interpret the results and draw conclusions. With the original data set, it is not too difficult to make a statement such as, “The batter missed the pitch because it was located low and away from the strike zone.” Making such a statement with the transformed data set is not impossible but is certainly not straightforward.

Now, the truth is that whether it’s 18 or 19 dimensions, it isn’t exactly a huge number. With today’s cloud computing power, number crunching using 18 or 19 dimensions is pretty fast, so many people may not be inclined to consider applying PCA (or dimensionality reduction in general) to this data set. However, it is not uncommon to have data sets with hundreds or even thousands of dimensions. In those cases, it is often worth your while to reduce the number to avoid the curse of dimensionality. A model that uses several dozen dimensions is always going to be preferred over a model that uses several thousand if it can achieve a similar accuracy and fidelity.

Conclusion

There’s the old adage “less is more,” and I think it applies quite appropriately when dealing with Trackman or PITCHf/x data. For those who aren’t quite convinced that is the case, I think we can agree that at the very least, “less is equally good.” We showed how we can use PCA to transform the pitch data, determine that the pitch data has 12 useful dimensions, and use this lower-dimensional version of the transformed pitch data to achieve good results at the swinging-strike prediction task. While today’s high-powered computers can crunch the current version of Trackman data efficiently, perhaps one day Trackman will invent an upgraded hardware system that reports hundreds of dimensions per pitch, in which case PCA will be a valuable tool to have in our back pockets.


Roger works as a software engineer by day, writes for The Hardball Times and FanGraphs by night, and has also worked for a Major League club.
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Kenmember
5 years ago

Interesting article. Thanks for all the work that went into this. What are the six most useful dimensions?

Just A Guy
5 years ago
Reply to  Ken

I am no data scientist so half of this article went over my head, but I think the analysis isn’t looking specifically for which dimensions are important. The dimensions aren’t independent so you can use a different combination of them to get the same data. The question is how many of those values do you need before you start to get little benefit from adding more.

For example plate crossing location is a function of release location and pitch break. It probably doesn’t matter which two of those values you have as long as you have two of them. There is likely some complicated physics formula that can be used to deduce the third value from the existing two. That reduces the benefit of being provided the third value.

slingermember
5 years ago
Reply to  Just A Guy

Actually it does make a difference which of the dimensions are correlated, because the dimensions explain different amounts of the variation – i.e., the % of explained variance graph. Implications are very different if its the 1st 6 vs. the last 6 variables.
Anyway – nice article!

Jetsy Extrano
5 years ago

What’s the first component? I’m going to guess it’s closely aligned with release speed. And that the second component is to do with total movement.

Jetsy Extrano
5 years ago
Reply to  Roger Cheng

Thanks Roger! I guessed speed, but I was all wrong about speed (and drag and friends) dominating everything else. I wonder what those other 0.3+ coefficients are doing, like z acceleration.

Oh, maybe that represents “fast fastball”?