Accounting for the “No Nulls” Solution

by Andrew Perpetua
November 9, 2017

There are certain exit velocity buckets where a fair number of batted balls aren’t being tracked. (via Andrew Perpetua)

In 2015 and 2016, Statcast had a well-publicized problem with missing data, but now in 2017 every batted ball has exit velocity and launch angle information. So that means the problem was solved, right? Wrong. The problem persists. Around 11.6 percent of batted balls on Baseball Savant have their exit velocity and launch angle decided by an algorithm, not a TrackMan measurement. The consequences of this algorithm may color your understanding of Statcast and which techniques you will want to avoid when analyzing the league or its players.

How Much Of The Data Is Missing?

Statcast had a rocky first year and failed to report 20-30 percent of batted balls in 2015. There were many reasons for this, but long story short, it wasn’t necessarily a technical limitation because the data was retroactively filled in after the season ended. Today, having logged three full seasons, Statcast appears to be reporting data for about 88 to 89 percent of the batted balls. Later on in this piece I’ll explain how I am identifying missing data, but for now it is important to point out the types of batted balls that are missing from the dataset.

Stringer Totals

Type	Total	Missing	% Missing
Ground Ball	181,017	29,047	16.0%
Line Drive	101,000	1,860	1.8%
Fly Ball	82,973	2,541	3.1%
Pop Up	27,282	12,328	45.2%
Total	392,272	45,776	11.7%

SOURCE: Statcast

As it turns out, TrackMan has trouble detecting balls that move perpendicularly to the face of the radar. That is, straight up, down, left (third base side) or right (first base side). The more the movement is directed away from the radar; i.e., into the field of play, the better it is at detecting and accurately measuring the velocity of the ball.

Luckily, baseball has foul territory, so many of the balls that move left or right can be ignored as foul balls. Unfortunately, balls hit down into the ground and straight up into the air are fairly common. This means that not only does TrackMan have a bias against certain types of batted balls, but these batted balls are generally very weakly hit ground balls and pop-ups.

In addition to missing individual batted balls, nine games appear to be missing all their TrackMan data. July 24 and 25, 2015 in San Francisco; July 3, 2016 in Fort Bragg, N.C.; Sept. 23 through 27, 2016 in Pittsburgh; and Aug. 20, 2017 in Williamsport, Pa..

The missing data in Fort Bragg and Williamsport shouldn’t be a surprise, considering neither of those stadiums has TrackMan installed. Well, I’m not sure Fort Bragg should even be considered a stadium, but that is neither here nor there. Missing five consecutive games in Pittsburgh is a bit odd, but I suppose hardware problems can arise from time to time. It is possible there are missing games that I have overlooked, but it is nice to see that the number appears to be minimized.

All in all, though, TrackMan has covered 7,385 out of 7,394 major league games in the past three years. These nine missing games had 726 out of the 561,666 plate appearances that occurred. There are so few of these missing games that we can effectively ignore them going forward, although it is important to acknowledge they exist, especially in 2016, which had six missing games.

However, you cannot explain all of the missing data using only weakly hit balls and missing games. There are a large number of missing line drives and fly balls as well. There may be some fraction of batted balls that are missed by random chance, or perhaps there is another mechanism for missing batted balls that I don’t quite understand.

Either way, you can generally assume that a great many of the missing data points are weakly hit. So much so that if you were to calculate average exit velocity and launch angle using the balls measured by TrackMan. you would end up with inflated numbers. Both the exit velocity and launch angle would be too high, since you would be throwing out an enormous number of weakly hit ground balls. Pop-ups, too, but mostly ground balls.

The Toolbox

This is a problem. We want to accurately measure exit velocity and launch angle, but to do so we need to account for these missing batted balls. Before we can address a solution we need to address the tools we have to work with.

When TrackMan data is missing we lose out on:

Exit velocity
Launch angle
Batted ball distance
Batted ball spin

It is important to understand that batted ball distance is lost. If you had batted ball distance, you might be able to reverse engineer an exit velocity using the batted ball type and fielding location data. Alas, we are left without the distance data.

We are left with:

Batted ball type
Batted ball result
Which players fielded the ball
Rough approximation of where the ball was fielded (hc_x, hc_y)

The first three items on this list are called the stringer information. The hc_x and hx_y coordinates are very rough estimates, and are not especially reliable (although they are better than nothing when you’re left with no other choice).

Last year, Jeff Zimmerman developed a method in which he found the average launch angle and exit velocity for batted balls fielded by each position. So, for example, a pop-up to the second basemen versus a fly ball to the right fielder, etc. In this way he used all three aspects of the stringer information to estimate the batted ball quality.

I was working on a system of grouping balls using the hc_x and hc_y coordinates along with batted ball type. Before I had a chance to finish this project, MLB announced it would be filling in the missing data on its own. Since then, MLB has retroactively filled in data for 2015 and 2016 and provided data for the 2017 season. Once MLB implemented its solution, accompanied by Tom Tango’s article, I shifted my focus to other aspects of the game. The missing data problem went out of sight, out of mind. But I believe the missing data is still an issue that needs to be addressed.

The “No Nulls” Solution

As I have already stated, there are two types of missing batted ball data. Missing games, and missing batted balls. It comes down to an issue of measurement bias versus failure of measurement.

In the cases where the game is recorded, but an individual batted ball does not register, you can assume that the majority of the time (70-85 percent) the ball was poorly hit. Therefore the batted ball likely has a weaker than average launch angle and exit velocity.

When the game is entirely missed, you don’t have any information about the batted ball, so you cannot make any assumptions.

The data you see on Baseball Savant is going through a multi-step process to correct for this missing data. First, it is checking to see whether the game is entirely missed. If the game is missed, it gives an average launch angle and exit velocity for each stringer batted ball type. So, for example a groundball single will have an exit velocity of 93 mph and a launch angle of -3, whereas a flyball double will have an exit velocity of 93 mph and a launch angle of 32 degrees.

If the game is recorded, but an individual batted ball is missing, it will assign a second set of values based upon the stringer data. These are, by far, the most common data corrections you see for batted balls with missing data. For example, a ground out has an exit velocity of 83 mph and a -21 degree launch angle and a pop out is 80 mph, with a 69 degrees launch angle.

This is a simplistic understanding. Under further analysis, there appear to be multiple launch angle and exit velocity combinations for the various stringer types, even in missing games. So, the exact rules for how these numbers are distributed are a bit complicated and remain unknown to me. Presumably, you could reverse engineer them if you were so inclined.

Mountains of Balls

Since MLB is using a short list of rules to distribute exit velocity and launch angle information to the batted balls with missing data, we can reverse engineer the rules using the stringer data and frequency information.

Generally speaking, combinations of exit velocity (which has one decimal place) and launch angle (three decimals) are pretty random. There are so many possible combinations of launch angles and exit velocities that you wouldn’t expect many results for each given pair of numbers (for example 84.5 and 23.457). When you include the stringer tags, the number drops even further. In fact, there are only 75 combinations of exit velocity, launch angle, and stringer tag with five or more matches. Compared to 346,956 with fewer than five matches, 99.8 percent of which are unique.

I have taken these 75 triplets and named them the most likely candidates for MLB’s “No Nulls” rules. It is possible that a few of these are not actually part of the “No Nulls” set, and it is possible there are a few combinations of “No Nulls” that are so rare that they haven’t yet occurred five times. For example, a pop-up triple.

I am fairly confident that nearly all of the batted balls identified in this manner are No Nulls, and at the end of this piece I will include several tables that include all of the candidate groups of balls. For now, look at these two images, which show the distribution of No Nulls in the dataset. Click the images to make them larger.

The dark red bars show the batted balls measured by TrackMan, and the light red show the balls added using the No Nulls rules. Look at the enormous concentrations of balls around -24 to -20 degrees and 68 to 72 degrees. These are the majority of your 29,000 missing ground balls and 12,000 missing pop-ups. On the exit velocity chart you can see these balls between 80 and 84 mph.

There is a clear gap in frequencies of batted balls hit between 85 and 90 mph. It seems like there is a good chance the missing batted balls might fill in this gap. With the No Nulls solution, MLB has more than filled in this gap. Realistically, you’d probably expect these batted balls to be spread out more, following a more gradual curve. The exact shape of that curve is unknown, but the balls measured by TrackMan give us a good idea of what it might look like.

The MLB No Nulls solution has created very large spikes in the frequencies of very specific batted balls, but this gets even more messy when you start comparing different seasons. In the GIF below you will see vertical launch angle along the Y axis and exit velocity along the X axis. Each cell represents 2 mph by 2 degrees. The colors represent frequency as a percentile of the largest cell. Green cells are high frequency, and blue cells are low frequency. Click the image to make it larger.

This GIF shows the No Nulls frequency problem better than anything else I have seen to date. Do you see those cells that remain dark green in each season? Those are the missing batted balls filled in by MLB. Some of these cells mysteriously disappear in 2017. Can you guess why? I’ll give you a hint, I told you the answer up above.

The frequencies of various exit velocity and launch angle combinations are clearly changing over time. The difference between 2015 and 2017 is particularly stark. In 2017 the high exit velocity balls appear to be shifting up in launch angle, and the low launch angle balls appear to be moving down in exit velocity. You can also see growing frequencies of pop-ups. The suppressed number of TrackMan recorded pop-ups in 2015 may have been a technical limitation, perhaps. Maybe. I have no evidence for that, but it could be the case, considering how 2016 and 2017 suddenly have balls measured above 80 degrees, although it is clear that the number of pop-ups is definitely increasing each season.

However, with all of these changes, those sticky No Nulls cells are constant.

If the No Nulls are remaining constant from year to year, and there is a clear league-wide migration of batted balls, wouldn’t that mean the No Nulls will get increasingly less accurate over time? If we assume these changing trends are constant, anyway. Perhaps in 2018 some of this will reverse course, batters will start hitting lower launch angle hard-hit balls, and ground balls will increase in velocity. Maybe. That seems unlikely. It is more likely that players will hit even more balls into the air, in an effort to maximize the value of each plate appearance.

Change Is In The Air

When MLB instituted this No Nulls solution, it did so to correct the average launch angle and exit velocity data, both for the major leagues and and for individual players. However, if the average distributions keep changing and the No Nulls aren’t updated to match, these No Nulls may end up overcompensating and hurting the data. For example, in the table below I have put the average exit velocity for all balls hit below -20 degrees for each of the three seasons. Notice how it is dropping with each season.

Below -20 Degrees

Year	Exit Velocity
2015	74.03
2016	73.82
2017	69.16

SOURCE: Baseball Savant

Discounting Nulls

The difference between 2017 and 2016 is dramatic. Many of the No Nulls ground balls fall into this category — about 25,000 of them. The vast majority of these No Nulls ground balls are listed with an exit velocity of 83 mph. That seems a bit high, in light of the sudden drop in groundball exit velocity in 2017. Perhaps I am wrong. Maybe TrackMan happened to record more of the softly hit ground balls and missed all of the hard hit ones. It’s certainly a possibility.

If the batters are, in fact, producing weaker contact on ground balls and the No Nulls solution hasn’t accounted for this, then the average exit velocity may actually be lower than what you see on Baseball Savant. Perhaps MLB is doing something with the No Nulls to keep up with these trends. I can’t see that it has, and I think the above GIF speaks for itself. These clusters of batted balls haven’t changed, even though the landscape around them has.

What To Do Going Forward

When you are looking at major league average launch angle and exit velocity, you should use the No Nulls solution put forward by MLB. You should understand that these numbers are estimates, and even as estimates they appear to have a minor flaw. The league-average launch angle probably will not be off by much, but exit velocity could be off by as much as 1 mph.

The league averages aren’t a big concern, though. Nor are the player averages. Rather, you must be careful when you examine the league-wide results when bucketing balls based upon their launch angle, exit velocity, or batted ball type, particularly balls hit between -30 and -20 degrees or above 60 degrees. Bucketing these batted balls will subject you to the double whammy of both being artificially inflated in frequency and exit velocity due to the No Nulls solution.

Whenever you are bucketing batted balls, you will want to first remove the No Nulls balls. I have created four tables consisting of the No Nulls categories I have identified. You can use these definitions to remove these balls from your own research, if you deem it necessary.

The No Nulls solution implemented by MLB could be better, but it is difficult to criticize without knowing the exact rules that are being used to classify each batted ball. Clearly the balls could be smoothed out more, perhaps using fielder location or some other metric. It appears that the rules are being applied across all three seasons evenly, but perhaps they should be tailored to each individual season. But, again, I don’t know how MLB is assigning the balls, so perhaps fielder location and seasonal variations are already included to some extent.

Ultimately, though, there is only one true solution to the problem, and that is TrackMan recording 100 percent of the data. It goes without saying that this is one of the highest priorities. Any attempt we as analysts make to manipulate the data after the fact will leave us wanting for more.

References & Resources

Dan Fox, The Hardball Times, “A Day in the Life of a Stringer”
Jeff Zimmerman, RotoGraphs, “Missing Statcast Data with Hyun Soo Kim”
Tom Tango, Tangotiger Blog, “Statcast Lab: No Nulls in Batted Balls Launch Parameters”

Appendix: No Nulls Definitions

Stringer Ground Balls

Stringer	EV	Vertical Angle	Sample
Ground Ball Double	90	-13	124
Ground Ball Double	90.2	-13	92
Ground Ball Double	93	-1	12
Ground Ball Error	43	-62	57
Ground Ball Error	84	-20	462
Ground Ball Error	86	-11	15
Ground Ball Single	40	-36	988
Ground Ball Single	90	-17	2340
Ground Ball Single	90.3	-17.3	1128
Ground Ball Single	93	-3	149
Ground Ball Triple	94	-12	6
Ground Ball Triple	94.3	-12.1	11
Ground Out	41	-39	3777
Ground Out	82.9	-20.699	6083
Ground Out	83	-21	13296
Ground Out	84	-13	507
Total Ground Balls	76.97	-23.07	29047

SOURCE: Statcast

Stringer Line Drives

Stringer	EV	Vertical Angle	Sample
Line Drive Double	98.8	17.1	96
Line Drive Double	99	17	244
Line Drive Home Run	104	24	76
Line Drive Home Run	104.4	23.699	21
Line Drive Single	41	16	5
Line Drive Single	90	15	564
Line Drive Single	90.4	14.6	135
Line Drive Triple	98.4	18	11
Line Drive Triple	99	18	31
Line Out	37	31	37
Line Out	91	18	512
Line Out	91.1	18.199	128
Total Line Drives	91.76	17.24	1860

SOURCE: Statcast

Stringer Fly Balls

Stringer	EV	Vertical Angle	Sample
Fly Ball Home Run	103	30	160
Fly Ball Home Run	102.8	30.199	57
Fly Ball Triple	97	31	11
Fly Ball Double	95	29	18
Fly Ball Double	93.1	32	32
Fly Ball Double	93	32	48
Fly Out	89.2	39.299	594
Fly Out	89	38	213
Fly Out	89	39	1323
Fly Ball Single	73	34	11
Fly Ball Single	71.4	36	19
Fly Ball Single	71	36	55
Total Fly Balls	89.85	37.79	2541

SOURCE: Statcast

Stringer Pop Ups

Stringer	EV	Vertical Angle	Sample
Pop Out	37	62	306
Pop Out	75	60	89
Pop Out	80	69	11833
Pop Up Double	89	63	12
Pop Up Error	81	65	40
Pop Up Single	86	67	48
Total Pop Ups	78.93	68.73	12328

SOURCE: Statcast

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Tangotigermember

7 years ago

Excellent analysis and understanding of the issues and limitations in handling untracked data.

Andrew Perpetuamember

Reply to Tangotiger

Thank you!

channelclemente

It would be very interesting to learn if there was a pitch type bias reflected in the null data results. From the types of hits missing, one might presume they resulted from a pitch that induced poor contact. That would suggest curve, slider, or cutter, hypothetically, would be over represented in the null buckets.

Reply to channelclemente

I don’t think this is at all related to the missing batted balls, but I understand that Trackman has difficulty identifying pitches thrown out of certain arm slots, and with sliders in general. But I have also heard that the issue with sliders has been addressed multiple times and it much better than it used to be.

But perhaps whatever it is about certain arm slots that are difficult to track translate to balls that come off the bat in that same sort of manner? I don’t know.

I think you might be referring to certain pitch types creating certain weak contact, though. Maybe. It might be hard to tell since pitch type data can be missing as well.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG