Gameday-PITCHf/x changes for 2010

Every year brings changes and improvements to MLBAM’s Gameday application, and many of them have some bearing on PITCHf/x and related analysis. Let me share with you the differences I’ve noticed so far in 2010.

First, Cory Schwartz of MLBAM notified some of us in February that some redundant information was going to be removed from the directory structure.

Folks, just wanted to give you a heads-up that we are deprecating the individual batter and pitcher .xml files published under these directories:$YEAR/month_$MONTH/day_$DAY/gid_*/pbp/batters/$YEAR/month_$MONTH/day_$DAY/gid_*/pbp/pitchers/

If you’re using any data in those files you should be able to get it from other files in the gd2 directories, but we no longer need or use these for any of our internal purposes or products. In addition, we are deleting the 2008 and 2009 files from our servers to free up the disc space for other content.

Fortunately, Dan Brooks has been able to adapt his site to the new structure for 2010.

Ross Paul shared that he would be deploying pitcher-specific neural nets for MLBAM’s pitch classification.

The Gameday PITCHf/x data also has a few new fields this year. In the at bat element, there is a new field called “start_tfs”. This is a time stamp in the Eastern Time Zone. It matches up more closely with the accurate actual time than does the sv_id time stamp, which can be a few minutes off. Cory tells me that this field wasn’t intended for analysis and is used internally by MLBAM. Speculation is that this field may be used for syncing up the Gameday data with other data sources, such as video. Since it’s there in the data, I wouldn’t be surprised if someone finds an analytical use for it, too.

The pitch element has three new fields: “nasty”, “zone”, and “cc”. The zone field appears to correspond to the location of the pitch based on the boxes into which the Gameday app divides the strike zone for its hot/cold zone graphics. The “cc” field is a comment field that appears to my highly-trained eye to be auto-generated, probably also based on the hot/cold zone information that MLBAM tracks. Here are some examples of the sparkling wit and insight produced by the auto-commenter:

A.J. Burnett didn’t read the scouting report; Adrian Beltre loves four-seam fastball in that zone.
A.J. Burnett didn’t read the scouting report; Jacoby Ellsbury loves sinker in that zone.
A.J. Burnett didn’t read the scouting report; Victor Martinez loves four-seam fastball in that zone.
Tim Lincecum has thrown 75 pitches; he holds opposing hitters to a .000 average in the first 75 pitches and .000 after that.
Vicente Padilla didn’t read the scouting report; Jeff Clement loves curveball in that zone.
Vicente Padilla didn’t read the scouting report; Lastings Milledge loves four-seam fastball in that zone.

(Apparently Ted Williams was right.)

The “nasty” field is presumably a crude attempt to calculate how hard to hit a particular pitch was, on a scale of 0-100. My initial cursory look at the data indicates that they are calculating the “nasty” factor mostly based on the location of the pitch, a linear calculation of how close it is to the edges and away from the heart of the zone. For the fastball, MLBAM does not appear to be including anything related to the movement or speed of the pitch into the “nasty” factor. For the curveball, they appear to be rating sweeping curveballs as significantly more nasty than 12-to-6 curveballs. Anyway, I’m not sure that any of this matters as more than a curiosity. As a sabermetric community we have much better approaches available for measuring the nastiness of a pitch.

A.J. Burnett knuckle curve grip
A.J. Burnett throws a knuckle curve against the Angels in Game 5 of the 2009 ALCS. (Icon/SMI)

Finally, the MLBAM pitch classification have introduced a new bucket this year: KC, the knuckle curve. I’m not sure why they did this. I suspect it has something to do with the scouting data they got for their training data, although I haven’t asked Ross about it. For my own classifications, I do not classify the knuckle curve separately from other curveballs. I don’t generally classify pitch types separately based on grip unless the grip differences actually produce substantial spin movement differences (e.g., two-seam and four-seam fastballs). I don’t classify palmballs, forkballs, circle change-ups, three-finger change-ups, and Vulcan change-ups separately. I do occasionally classify hard curves and slow curves separately when they are two distinct pitch types for the same pitcher, as they are for Roy Oswalt, for example. But the knuckle curve, also called the spike curve, moves just like other curveballs.

A.J. Burnett’s curve is the only pitch that I’ve noticed so far that MLBAM is labeling a knuckle curve, which handily gives me an excuse to include an image of a pitcher’s grip, one of my favorite topics.

Newest Most Voted
Inline Feedbacks
View all comments
Lucas A.
12 years ago

The pitch classifications have been pretty crude so far.  Even more so than usual, it seems.  For example, over the first two games, every David Robertson fastball (which sometimes cuts slightly) that had positive horizontal spin deflection has been misclassified as a curveball.  Personally, I don’t mind going back and working on the classifications myself, but just out of curiosity, do you know if the algorithm was changed at all this year?  In the early going, it’s been pretty rough.

12 years ago

Lucas – Ross is working on custom neural nets for each pitcher. The results look promising, but I have no idea how far along in the process he is. Mike posted a link in the article.

Mike Fast
12 years ago

Lucas, my impression also is that the classifications have taken a step back this year.  The examples that Ross posted at the Book blog on his new classification nets looked good, but there are some pretty bad/obvious mis-classifications going on in the first couple days of 2010.  I’ll leave it at that lest I get another “open letter” directed my way. smile

Detroit Michael
12 years ago

So where can one access the pitch-by-pitch data from 2008 and 2009?  Just in case one wants to look at the raw data.

Mike Fast
12 years ago

Detroit Michael, what Cory meant in his email was that the redundant directories were not only being removed going forward but that they were also being removed from the old data.  The pitch-by-pitch data is still present for 2007-2009 in the inning/inning_?.xml files.

For example:

Lucas A.
12 years ago

Mike, would you be able to expand on the “zone” function?  I’m not sure if I understand what it does.

Mike Fast
12 years ago

Lucas, Gameday Premium has a feature where you can see hot/cold zones for each hitter based on batting average in that zone.  I can’t bring it up from work, but it divides the strike zone into boxes, nine I think, and then there are some zones outside the strike zone.

It’s a very crude thing, and I wouldn’t use it for anything other than red and blue entertainment for your eyes.  It’s similar to the kind of drivel you occasionally see on national baseball telecasts where they show you a hitter’s hot and cold zones.  It’s nothing up to the level like what Dave Allen does with heat maps.

So, I wanted to note that it was there in the data this year, but I certainly don’t want to give the impression that I think it’s useful.  I’m sorry if I implied that.

Mike Fast
12 years ago

To expand on my point about the “nasty” score being useless and wrong as MLBAM is calculating it, here are how the different pitch results stack up according to their average “nasty” score.

“Nasty”  Result
  41.6   In play, out
  41.3   In play, run/no out
  40.7   Swinging strike
  40.4   Called strike
  40.1   Foul
  38.3   Ball

Remember this is on a scale of 0-100, with an average score of 40 and a standard deviation of 16.  It’s the worst kind of junk stat—purporting to measure something really cool but actually containing no useful information.  Fits right in with the rest of the Bloomberg repertoire, I suppose.

12 years ago


At least attention is being paid to things like classification. Ross hasn’t gotten to all starters yet or any relievers.

Also, the metrics like nasty are made for their entertainment products, not for serious analysis. Since they make the data freely available, folks like you and Dave Allen can go the more rigorous route which is awesome.

I’m more than willing to cut them yards of slack on the latter point alone.

Mike Fast
12 years ago

Josh, I wouldn’t want to be a Premium customer and be told that the numbers they are giving me are meaningless and may as well be coming from a random number generator.

The reason it irks me so much is that Gameday is the face of PITCHf/x for the majority of the baseball world.  If they are communicating the message that PITCHf/x is just a junk toy for entertainment but rife with errors and bad conclusions when it comes to analysis, that is the message I have to fight against when trying to show that the data can be used reliably and with impact in the hands of skilled analysts.

I agree that Sportvision and MLBAM have done the baseball world a ton of good by making the data freely available.

Alan Nathan
12 years ago

Regarding the “nasty” metric, it is further amusing that it is quoted to 3 significant figures.  Or, equivalently, we are rating nastiness on a 0-1000 scale.  I doubt we can classify anything in baseball to that precision.

Cory Schwartz
12 years ago

Mike, upon reading this I was discouraged by your continued willingness to mock and criticize what we do with the Pitch-f/x data, when you aren’t even willing to take the time to even ask us about it first. However, given that you’ve already demonstrated your willingness to show off how uninformed and presumptuous you are in this regard (, I shouldn’t be surprised that you’re doing it again here.

Your snide and dismissive comments reveal that your true motivation is not to evaluate or inform but simply to criticize, which does not serve anyone in any way. New and informative research based on our data continues to emerge daily, fan interest in that research (and yes, even in our products) continues to grow, and our business continues to thrive despite what you perceive as our mishandling of this data. If you’d like to participate in a constructive conversation at any time, you know where to find me, but I’m not going to take that opportunity here because frankly it doesn’t appear you’re interested in it.

That said, anyone else who is interested in learning more about the new features we are working on should feel free to ask me and I’d be happy to explain, and then people can make their own judgments.


P.S. – Dr. Nathan, we are only reporting the “nasty” values in whole numbers from 1-100. It’s Mike’s display of them to the tenths place that is amusing you.

Mike Fast
12 years ago

Cory, I am sorry that you feel I am being snide and dismissive. 

You guys at MLBAM do a phenomenal job with communicating the game information and experience. 

At the same time, the sabermetric community tears apart every stat that’s ever been made, and that’s what I did here with the Nasty Factor.  If a film critic pans a scene in a movie, does that mean he hates the director or needs to get the director’s input before he publishes his review?

I love the movie.  Just don’t like a couple scenes.  And I wouldn’t be reviewing at all if I didn’t care about the movie in the first place.

The main point of this post was not to be only about the Nasty Factor.  It was to communicate to the analysts that there were new fields in the data again this year.

Dan Brooks
12 years ago

I, for one, learned about the knuckle-curve. It is now added to my site! =)

Now, if there was only an additional flag (KH) for “knuckle-head”, that was reserved for when pitchers threw to first when no one was actually on the bag or outfielders threw the ball into the seats after only 2 outs.

Alan Nathan
12 years ago

Cory…evidentally, you are not following this thread at Tango’s blog:
If you were, then you would not be calling me “Dr. Nathan” smile.

Alan Nathan
12 years ago

Cory….a more serious comment:

It might actually be helpful for all the baseball analysts out there if we had some knowledge of how you arrive at the “nasty” factor.  …alan