Research Notebook: New Format for Statcast Data Export at Baseball Savant

by Bill Petti
April 28, 2017

The Statcast Search tool has undergone some recent changes. (via Baseball Savant)

Over the past few years, Daren Willman has made some of the pitch-by-pitch data generated by Statcast available on his Baseball Savant site. We have gotten only a taste of the kind of data that Statcast can produce, but even that taste is interesting and useful to work with.

The easiest way to get the data is through the Statcast Search query tool. After running a query–say, for all pitches thrown on April 24, 2017–you have the option of exporting the data as a comma-separated values (csv) file. As of April 24, the query output has changed in some not so subtle ways. Daren was kind enough to share with me the changes he was making ahead of time, which allowed me to quickly update the scrape_statcast_savant series of functions in my baseballr package.

I thought it would be helpful to outline what those changes are for people who have been working with the original data exports and plan on working with the new ones.

Overall Changes

The old export included 60 variables. The new file, however, has 75 variables. In some cases, the export includes brand new variables. In other cases, some of the existing variables are being renamed. Of those being renamed, however, some will continue to be reported, but others will be deprecated and will not show values going forward. Let’s break these out into two separate lists, shall we?

Statcast Data Export Column Changes

Column Names going away	New Column Names
pitch_id	release_speed
start_speed	release_pos_x
x0	release_pos_z
z0	spin_rate_deprecated
spin_rate	break_angle_deprecated
break_angle	break_length_deprecated
break_length	plate_x
px	plate_z
pz	inning_topbot
inning_top_bottom	tfs_deprecated
tfs	tfs_zulu_deprecated
tfs_zulu	pos2_person_id
catcher	launch_speed
hit_speed	launch_angle
hit_angle	pos1_person_id
	pos2_person_id.1
	pos3_person_id
	pos4_person_id
	pos5_person_id
	pos6_person_id
	pos7_person_id
	pos8_person_id
	pos9_person_id
	release_pos_y
	estimated_ba_using_speedangle
	estimated_woba_using_speedangle
	woba_value
	woba_denom
	babip_value
	iso_value

Looking at these two lists I noticed a few things:

start_speed looks to now be release_speed. (Note: start_speed from 2008-2016 was generated by PITCHf/x. release_speed is being generated by Statcast. For more on this, see the discussion at Tom Tango’s site here.)
The break variables are going to be deprecated.
hit_speed will become launch_speed and hit_angle will be launch_angle.
the strike zone coordinates are changing from px, px to plate_x, plate_z
The player IDs for each position will be added as separate variables.
They appear to be including variables with values and/or estimates of things like wOBA, BABIP, etc., for batted balls given angle, speed, etc.

To make life easier, I put together a simple crosswalk between the old and new data exports to show which variables are being renamed:

STATCAST DATA EXPORT COLUMN CROSSWALK

OLD COLUMN NAMES	NEW COLUMN NAMES
start_speed	release_speed
x0	release_pos_x
z0	release_pos_z
spin_rate	spin_rate_deprecated
break_angle	break_angle_deprecated
break_length	break_length_deprecated
inning_top_bottom	inning_topbot
tfs	tfs_deprecated
tfs_zulu	tfs_zulu_deprecated
catcher	pos2_person_id
hit_speed	launch_speed
hit_angle	launch_angle
px	plate_x
pz	plate_z

You will notice that for every column name that was “going away” there is a replacement, except for pitch_id. That appears to no longer be available, even as a deprecated column.

New Variables and Values

In terms of the new columns that aren’t simply replacements for some of the old columns, we get some fun new data to play with.

First, we are getting the mlbamid’s for each position player and what position they were playing when the pitch was thrown. Now, we don’t get positioning data in the export (at least, not this year), but knowing who was playing where can be useful in many ways.

Second, the crew at MLBAM appears to be gearing up to release their own measures in terms of estimated Weighted On-base Average (wOBA) and Batting Average (AVE) based on exit velocity and launch angle. The variables that start with estimated_ appear to show the average wOBA or AVE based on batted balls with similar launch angles and exit velocity.

One item that is still not being released is horizontal spray angle on batted balls. Tango and the crew have said they will release that data at some point, but we don’t have it in this release.

You should also note that for some of the variables the type of values are a little different. For example, if you look at events and descriptions we now have more machine-friendly values (i.e. codes in lowercase without spaces, etc.). Take events, instead of “Grounded Into DP” we now have “grounded_into_double_play”. We also have null values in events where the pitch did not result in the end of the plate appearance. This is cleaner for analysis, but also might break any old code you have. Also, lining up existing data files with the new ones for these columns will require a little more TLC.

Omissions

Here is some R code that you can use to calculate horizontal spray angle yourself, based on where the MLBAM stringers plot where a batted ball was picked up by a fielder (based on the hc_x and hc_y variables in the export):

spray_angle <- with(df, round(
  (atan(
    (hc_x-125.42)/(198.27-hc_y)
  )*180/pi*.75)
  ,1)
)

Is this perfect? No, but it is pretty good in the absence of official sensor-based spray angle data. Note that -45 degrees is the left field line and 45 degrees is the right field line. (Note also that this calculation was originally produced by Jeff and Darrell Zimmerman.)

Still another thing to note is that, currently, the umpire variable is not populating. This column normally contains the mlbamid for the umpire that was behind the plate during the pitch. Daren has mentioned that this should be fixed and retroactively populated soon.

Merging Your Old Files with the New Files

Finally, if you are looking for a way to easily merge existing data you’ve downloaded from Baseball Savant with the new download format, here is a function in R that can set up your existing data to do that. Basically, it takes the current data, transforms the variables whose names are changing and adds in blank columns for the new variables, and then arranges the columns so that they are in the same order as the new download. Here’s the code (also available as a gist here):

format_old_savant_output <- function(df) {

  updated_names <- c("pitch_type", "pitch_id", "game_date", "release_speed", "release_pos_x", "release_pos_z", "player_name", "batter", "pitcher", "events", "description", "spin_dir", "spin_rate_deprecated", "break_angle_deprecated", "break_length_deprecated", "zone", "des", "game_type", "stand", "p_throws", "home_team", "away_team", "type", "hit_location", "bb_type", "balls", "strikes", "game_year", "pfx_x", "pfx_z", "plate_x", "plate_z", "on_3b", "on_2b", "on_1b", "outs_when_up", "inning", "inning_topbot", "hc_x", "hc_y", "tfs_deprecated", "tfs_zulu_deprecated", "pos2_person_id", "umpire", "sv_id", "vx0", "vy0", "vz0", "ax", "ay", "az", "sz_top", "sz_bot", "hit_distance_sc", "launch_speed", "launch_angle", "effective_speed", "release_spin_rate", "release_extension", "game_pk")
  
  colnames(df) <- updated_names
    
  new_cols <- c("plate_x", "plate_z", "pos1_person_id", "pos2_person_id.1", "pos3_person_id", "pos4_person_id", "pos5_person_id", "pos6_person_id", "pos7_person_id", "pos8_person_id", "pos9_person_id", "release_pos_y", "estimated_ba_using_speedangle", "estimated_woba_using_speedangle", "woba_value", "woba_denom", "babip_value", "iso_value")
  
  df[,new_cols] <- NA

  df <- df %>%
    select(pitch_type, game_date, release_speed, release_pos_x, release_pos_z, player_name, batter, pitcher, events, description, spin_dir, spin_rate_deprecated, break_angle_deprecated, break_length_deprecated, zone, des, game_type, stand, p_throws, home_team, away_team, type, hit_location, bb_type, balls, strikes, game_year, pfx_x, pfx_z, plate_x, plate_z, on_3b, on_2b, on_1b, outs_when_up, inning, inning_topbot, hc_x, hc_y, tfs_deprecated, tfs_zulu_deprecated, pos2_person_id, umpire, sv_id, vx0, vy0, vz0, ax, ay, az, sz_top, sz_bot, hit_distance_sc, launch_speed, launch_angle, effective_speed, release_spin_rate, release_extension, game_pk, pos1_person_id, pos2_person_id.1, pos3_person_id, pos4_person_id, pos5_person_id, pos6_person_id, pos7_person_id, pos8_person_id, pos9_person_id, release_pos_y, estimated_ba_using_speedangle, estimated_woba_using_speedangle, woba_value, woba_denom, babip_value, iso_value)
 
  df
}

From what it sounds like there will be more changes coming, but Daren has mentioned that the way they are setting things up future changes will be easier to deal with–essentially, just tackling new variables on to the end of the file export. That should make lining up any existing data you may have even easier.

References & Resources

Baseball Savant, Statcast Search
Bill Petti, Github, baseballr Package
Tom Tango, Tangotiger Blog, “Pitch velocity: new measurement process, new data points”
Bill Petti, GithubGist, format_old_savant_output.R

11 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dennis Bedard

8 years ago

Wow! I am living in a different universe. I remember circa 1967 when I could evaluate a player based almost solely on the way Topps depicted him with a bat or ball in his hands. Now I need a refresher course in advanced astrophysics to figure this out.

Jeff Zimmerman

Reply to Dennis Bedard

I disagree. I have gone head deep into this data but every time it is to answer a question. Don’t go into the data looking for questions. First, start with a good question and maybe the answer lies in the astrophysical data. Probably not.

Concentrate on creating a good question and then go find the data. If you don’t know where to find the data, ask. Us writers get bored being alone in our mom’s basement.

Rally

Don’t mean to complain because this is great stuff, and I’m grateful that MLB is willing to share so much data. But one thing I noticed is the umpire field is null in the downloads. Last year the umpire ID was there.

Is that an oversight or an intentional removal?

Bill Pettimember

Reply to Rally

Noted this in the post–it’s being worked on and they hope to have it back soon–should retroactively populate as well.

James

I like to bring these files into a spreadsheet to play around with. with pitch_id and tfs removed I don’t see how one would get the pitch sequencing back into the correct order. sv_id looks similar to tfs but many events are missing a value in that field. Am I missing something?

Reply to James

Any idea if either of these fields will be reinstated?

Don Hessey

sv_id is the first record in time of the pitch recorded by pitchfx or statcast. That would be your unique id for the pitch, combined with the game_pk field you will have a unique pitch id. I don’t think you’ll want to include the records without a sv_id in launch angle or launch speed calculations as they look to be done by a stringer and not generated by the statcast software.

Jeff

Will the pitch values (vx0 vy0 vz0 ax ay az) be populated going forward?

Michael Liu

Great stuff. Just a quick question. Why do you multiply by 0.75 to find the spray angle? Thanks!

Bill

A possibly related question… anyone know what is being measured by hc_x and hc_y in the Baseball Savant data? What are the units?

Buy Yeezy Boost 350 V2

Looking to mimic the look of a Brown baseball glove, the shoe comes constructed out of a premium leather upper equipped with baseball glove-inspired woven detailing on the side panels and heel

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG