How Many Useful Dimensions of Data Does Trackman Report?

by Roger Cheng
December 6, 2018

Less is more. Less can also be equally good. (via Michelle Jay)

Trackman data (and prior to a few years ago, PITCHf/x data) has redefined how baseball analysts evaluate players. For each pitch thrown, and each ball put into play, we have a wealth of statistical data at our fingertips. Looking specifically at the pitch data reported on Baseball Savant, there are 19 dimensions of measurements reported for each pitch.

While that is certainly a remarkable number, a keen person would observe that some of the dimensions are related to each other, and that the number of independent dimensions is less than 19. Furthermore, I suspect it is not necessary to use all these dimensions to get good results. This article will explore how many of these dimensions are useful for analysis purposes and what the effect is of using fewer dimensions in an analysis.

A Non-baseball Analogy

Let me pose a non-baseball analogy. If you are buying a car, you might see an advertisement from a dealership that screams “19 free items with the purchase of a new car!” It is easy to get excited at this prospect; everybody loves getting free stuff. Of these 19 items, some are probably quite valuable (such as free oil changes for the lifetime of the car), and some are practically worthless (such as pens with the dealership’s name on them). An intelligent person would read between the lines and interpret this message to say “Three valuable items, seven moderately useful items, and nine worthless items with the purchase of a new car!” An even more intelligent person might assign dollar values to each of the items.

In a similar manner, I will attempt to quantify the value of the dimensions of Trackman data.

Examining the Trackman Data

For this study, I used Trackman data from all the pitches of the 2018 major league regular season, which number slightly over 700,000.

These are the dimensions of the data:

Release speed
Release location (x, y, and z components)
Plate crossing (x and z components)
Pitch break (x and z components)
Velocity (x, y, and z components)
Acceleration (x, y, and z components)
Strike zone (top and bottom)
Spin rate
Release extension
Effective speed (adjusted for release point and extension)

It is intuitive to see that some dimensions are related. For example, if we take the magnitude of the three velocity components (which are measured 50 feet in front of home plate), we will have the speed. The speed 50 feet in front of home plate is going to be very close to the release speed, which is already reported as a separate dimension. Having this extra dimension probably won’t make things worse (as long as the measurement is accurate), but it may not help either. Mathematically speaking, the absolute value of the correlation coefficient between the release speed and the velocity y-component is 0.9997, which indicates a very strong correlation between those values.

There are other relationships between the dimensions that are somewhat obvious, and then there may be other, more obscure relationships that are not obvious at all.

Principal Component Analysis Background

The technique we will use for transforming the Trackman data is called Principal Component Analysis. In short, PCA is a technique that will take correlated data and transform it to create new dimensions that are uncorrelated. Furthermore, these new, uncorrelated dimensions are arranged in order of decreasing variance.

Here is a classic example of PCA in two dimensions:

The Cartesian grid defines the usual x and y axes. For this data set, most people would think the x and y axes are not the best possible axes you could pick. Intuitively, we can see the data are “rotated,” and there probably is a better set of axes out there. Using PCA will give us what we are looking for. The longer arrow line represents the first principal axis, and the shorter arrow line represents the second principal axis. If you can imagine those arrow lines as the new x and y axes, then the data will appear to be more “straight,” and are easier to visualize and analyze.

As mentioned earlier, the first principal component will yield the largest possible variance in the data. In some applications, such as when dealing with measurement errors, variance is considered to be bad, and it is desirable to minimize it. However, in this case, variance is considered to be good, and it is desirable to maximize it. If you think about an extreme case, where a principal component has no variance (all the values in that component have the same value), then that component is useless because it doesn’t convey any useful information about the data points.

Applying PCA to the Trackman data

Now let’s apply PCA to the Trackman data. We will be computing the principal components in the same manner as was explained in the above example, except with more dimensions. I used the PCA implementation in Python’s Scikit-learn. Note that since the original measurements are not on the same scale (for example, a typical value for pitch speed in miles per hour is about 90, while a typical value for strike zone in feet is about three), we will normalize each dimension of data to have an average value of zero and a standard deviation of one before applying PCA.

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

After using the new principal components as the new axes for the data, we can determine the percentage of explained variance per component. This is directly related to how “important” each component is.

This exponential, decay-like graph is common in these scenarios. The first component contains a third of the total variance, while the second component contains less than half that amount. The last seven components contain almost no variance (while the graph shows values of zero, the actual values are non-zero but very small). We can conclude there are 12 useful dimensions in this data set.

Using the New Data Values in an Analysis

I will now incorporate PCA and repeat the analysis I did in this previous article on predicting swinging strikes. I am curious how many of these components are necessary to get a good result.

The data I used in the previous article were based on PITCHf/x data from several years ago, which lack the effective speed and release extension dimensions but contain a plate velocity dimension, resulting in a total of 18 dimensions. Upon computing the explained variance numbers, I saw that all the relevant variance is contained in the first 11 components. Thus, there is no need to use more than 11 components in doing the swinging-strike prediction. However, I speculate we can get good results with even fewer than 11.

Recall that in the previous article, I built models using logistic regression and random forests and evaluated the fidelity of those models with the area under the curve (AUC) of the Receiver Operating Characteristic (ROC) curve. This time, I will start by building a model with only the first component, then build successive models by adding one additional component at a time.

For both the logistic regression and random forests models, we see that at six components, barely half of the total, we are already plateauing in AUC, with only marginal gains after that point. At lower numbers of components, random forests does somewhat worse, while logistic regression does quite well with even just a single component. It is nice to be able to judge the outcome of a swing with a single number instead of an array of many numbers. Note that even if you only use the single first principal component, that one component actually contains some information about all the original dimensions. It’s as if you had some magical oracle who hand selected the best attributes of each of the original dimensions and packaged them together into a component.

The price of using transformed data is that it is more difficult to interpret the results and draw conclusions. With the original data set, it is not too difficult to make a statement such as, “The batter missed the pitch because it was located low and away from the strike zone.” Making such a statement with the transformed data set is not impossible but is certainly not straightforward.

Now, the truth is that whether it’s 18 or 19 dimensions, it isn’t exactly a huge number. With today’s cloud computing power, number crunching using 18 or 19 dimensions is pretty fast, so many people may not be inclined to consider applying PCA (or dimensionality reduction in general) to this data set. However, it is not uncommon to have data sets with hundreds or even thousands of dimensions. In those cases, it is often worth your while to reduce the number to avoid the curse of dimensionality. A model that uses several dozen dimensions is always going to be preferred over a model that uses several thousand if it can achieve a similar accuracy and fidelity.

Conclusion

There’s the old adage “less is more,” and I think it applies quite appropriately when dealing with Trackman or PITCHf/x data. For those who aren’t quite convinced that is the case, I think we can agree that at the very least, “less is equally good.” We showed how we can use PCA to transform the pitch data, determine that the pitch data has 12 useful dimensions, and use this lower-dimensional version of the transformed pitch data to achieve good results at the swinging-strike prediction task. While today’s high-powered computers can crunch the current version of Trackman data efficiently, perhaps one day Trackman will invent an upgraded hardware system that reports hundreds of dimensions per pitch, in which case PCA will be a valuable tool to have in our back pockets.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Kenmember

6 years ago

Interesting article. Thanks for all the work that went into this. What are the six most useful dimensions?

Just A Guy

Reply to Ken

I am no data scientist so half of this article went over my head, but I think the analysis isn’t looking specifically for which dimensions are important. The dimensions aren’t independent so you can use a different combination of them to get the same data. The question is how many of those values do you need before you start to get little benefit from adding more.

For example plate crossing location is a function of release location and pitch break. It probably doesn’t matter which two of those values you have as long as you have two of them. There is likely some complicated physics formula that can be used to deduce the third value from the existing two. That reduces the benefit of being provided the third value.

slingermember

Reply to Just A Guy

Actually it does make a difference which of the dimensions are correlated, because the dimensions explain different amounts of the variation – i.e., the % of explained variance graph. Implications are very different if its the 1st 6 vs. the last 6 variables.
Anyway – nice article!

Roger Chengmember

Reply to slinger

Thanks!

“A Guy” is correct in that this article doesn’t try to determine which original dimensions are the most useful/important. The most useful/important dimensions depend on what task you are trying to achieve. For the swinging-strike prediction task, the most important dimension is plate_z, followed by plate_x. I did address this topic in the original article, feel free to look at it for more details.

Jetsy Extrano

What’s the first component? I’m going to guess it’s closely aligned with release speed. And that the second component is to do with total movement.

Reply to Jetsy Extrano

First component = -.373*release_speed + .092*release_pos_x + .199*release_pos_y – .042*release_pos_z + .174*pitch_break_x – .334*pitch_break_z + .024*plate_crossing_x – .100*plate_crossing_z – .125*velocity_x + .374*velocity_y + .238*velocity_z + .182*acceleration_x – .322*acceleration_y – .352*acceleration_z – .013*strike_zone_top – .014*strike_zone_bottom – .378*effective_speed + .047*spin_rate – .199*release_extension

Whoa, that was a bear to type! Your guess is correct, the largest magnitude coefficient is for effective_speed, while velocity_z and release_speed are just behind. There are also quite a few other components that contribute prominently. Note that this equation assumes that all the variables have been normalized to zero mean and a variance of one.

Second component = .088*release_speed + .478*release_pos_x – .022*release_pos_y + .119*release_pos_z + .436*pitch_break_x + .149*pitch_break_z + .029*plate_crossing_x + .057*plate_crossing_z – .503*velocity_x – .087*velocity_y – .110*velocity_z + .473*acceleration_x +.077*acceleration_y + .148*acceleration_z + .026*strike_zone_top + .019*strike_zone_bottom + .085*effective_speed + .020*spin_rate + .022*release_extension

For the second component it looks like velocity_x, release_pos_x, and acceleration_x feature most prominently.

Reply to Roger Cheng

Thanks Roger! I guessed speed, but I was all wrong about speed (and drag and friends) dominating everything else. I wonder what those other 0.3+ coefficients are doing, like z acceleration.

Oh, maybe that represents “fast fastball”?

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG