Developing The baseballr Package For R
Introduction
Late in 2015 I wrote a piece here at The Hardball Times that walked through some of my favorite R packages for gathering and analyzing baseball data. Like all things, no single package has everything I need, nor should it. Following that article, I started collecting various functions that I’ve written and routinely use and decided to compile them in a formal package that anyone can easily load and use.
I’ve never written an R package before, so this is partly an excuse for me to learn a new skill. That means the development of the package will be slow, and have its fair share of bumps along the way. I thought I would share some initial views of the kind of functions I plan to include.
Data Acquisition
I featured a number of packages in my previous article that focused on grabbing data, whether it was full season data for individual players and teams, or pitch-by-pitch data. However, right now there isn’t a package that makes it easy to pull real-time data on players during the season. The Lahman package is great, but that database is only updated once a year after the season. As a writer at The Hardball Times I have direct access to our database, but not everyone does. FanGraphs make it very easy to download leaderboards in CSV format that include dozens of statistics for players updated daily, but there isn’t an easy way to grab that data from within R.
I am creating a series of functions that do just that.
First, there will be functions for downloading some of the most useful reference tables–FanGraph’s historical park factors and the yearly constants and coefficients for calculating things like wOBA or FIP.
Here’s an example of the park factors function for the 2010 season. All you need to do is include a year when you call the fg_park() function and you get the following output:
> fg_park(2010) season home_team basic single double triple hr so UIBB GB FB LD IFFB FIP 5 2010 Angels 96 100 96 87 96 99 96 99 100 100 96 98 6 2010 Orioles 103 102 101 89 110 99 99 101 100 100 97 104 7 2010 Red Sox 105 101 116 100 97 99 100 101 100 103 103 99 8 2010 White Sox 105 98 99 89 113 102 107 98 102 99 101 105 9 2010 Indians 97 100 101 83 94 101 100 101 98 103 102 98 10 2010 Tigers 102 102 99 113 101 97 97 101 103 96 104 101 11 2010 Royals 101 103 103 115 93 97 100 103 100 100 92 99 12 2010 Twins 101 102 102 110 94 98 100 101 99 103 101 99 13 2010 Yankees 103 100 97 85 110 100 101 100 100 98 100 104 14 2010 Athletics 97 98 96 98 94 99 100 98 101 101 103 98 15 2010 Mariners 94 99 94 88 92 103 102 98 99 98 104 97 16 2010 Rays 95 98 95 107 95 100 99 97 100 99 110 97 17 2010 Rangers 107 102 104 118 110 99 101 100 101 103 99 104 18 2010 Blue Jays 101 98 103 117 104 102 99 100 100 100 98 101 19 2010 Diamondbacks 106 101 108 121 104 99 98 101 99 100 96 101 20 2010 Braves 100 101 99 99 97 102 102 100 99 103 95 99 21 2010 Cubs 103 101 100 99 102 100 101 100 101 98 100 101 22 2010 Reds 102 99 97 100 111 101 100 99 101 98 102 103 23 2010 Rockies 113 106 108 123 114 96 100 104 99 108 94 106 24 2010 Marlins 101 100 102 102 97 106 105 99 99 99 104 98 25 2010 Astros 99 99 100 100 104 102 101 100 100 98 99 101 26 2010 Dodgers 95 99 96 77 98 101 97 99 100 97 105 98 27 2010 Brewers 100 98 100 104 106 103 102 98 102 100 101 102 28 2010 Nationals 100 101 101 99 100 97 97 102 100 102 99 101 29 2010 Mets 97 99 97 109 93 101 101 99 102 97 108 98 30 2010 Phillies 100 99 100 92 102 101 100 99 99 101 100 100 31 2010 Pirates 97 100 101 93 92 95 97 102 100 102 98 98 32 2010 Cardinals 97 100 97 92 92 99 100 100 99 101 100 98 33 2010 Padres 92 97 93 105 89 103 103 100 98 97 97 97 34 2010 Giants 96 99 100 102 91 99 98 101 96 100 99 97
Sometimes I like to pull team data such as their schedule and record (which is very helpful for my “team consistency” work). Baseball-Reference is the easiest site to acquire this from, so I created a function that allows you to specify the team and year and get back detailed information about the outcome of each of their games.
Using the team_results_bref() function, here’s what the first 10 games of Houston’s 2015 schedule and results would look like:
> head(team_results_bref("HOU", 2015),10) Rk Gm# Date Tm H_A Opp Result R RA Inn Record Rank GB Win Loss Save Time D/N Attendance Streak 1 1 1 Monday, Apr 6 HOU H CLE W 2 0 NA 1-0 1 Tied Keuchel Kluber Gregerson 2:30 N 43753 1 2 2 2 Wednesday, Apr 8 HOU H CLE L 0 2 NA 1-1 3 0.5 Carrasco Feldman Allen 2:40 N 23078 -1 3 3 3 Thursday, Apr 9 HOU H CLE L 1 5 NA 1-2 4 1.0 Bauer Wojciechowski 3:08 D 22593 -2 4 4 4 Friday, Apr 10 HOU A TEX W 5 1 NA 2-2 2 0.5 McHugh Holland 2:45 D 48885 1 5 5 5 Saturday, Apr 11 HOU A TEX L 2 6 NA 2-3 3 0.5 Gallardo Hernandez 3:18 N 36833 -1 6 6 6 Sunday, Apr 12 HOU A TEX W 6 4 14 3-3 1 Tied Harris Verrett Deduno 4:24 D 35276 1 7 7 7 Monday, Apr 13 HOU H OAK L 1 8 NA 3-4 2 0.5 Kazmir Feldman 2:51 N 19279 -1 8 8 8 Tuesday, Apr 14 HOU H OAK L 0 4 NA 3-5 3 1.5 Graveman Peacock 2:58 N 18935 -2 9 9 9 Wednesday, Apr 15 HOU H OAK W 6 1 NA 4-5 2 0.5 McHugh Pomeranz 2:42 N 19777 1 10 10 10 Friday, Apr 17 HOU H LAA L 3 6 NA 4-6 4 1.0 Ramos Qualls Street 2:57 N 22660 -1
Finally, it’s fairly easy to get player performance data for many standard splits, such as by month or by pitcher handedness. But we may want to grab information over a very specific time frame; say, batter performance from August 10, 2015 through the end of the 2015 season. Without access to a game-by-game database this would be impossible, or just incredibly time consuming if you wanted to compile it by hand.
The daily_batter_bref() function makes this very simple. All you need to pass to the function is the first and last date you are interested in. The function will then pull batter performance only over this time frame from Baseball-Reference (the first six records are shown below):
> x <- daily_batter_bref("2015-08-10", "2015-10-04") > head(x) Name Age Level Team G PA AB R H X1B X2B X3B HR RBI BB IBB SO HBP SH SF 1 Shin-Soo Choo 32 MLB-AL Texas 52 237 191 45 66 46 11 1 8 32 37 1 46 7 1 1 2 Manny Machado 22 MLB-AL Baltimore 52 234 205 31 52 32 9 0 11 29 26 1 40 2 0 1 3 Adam Eaton 26 MLB-AL Chicago 50 230 203 31 66 51 9 1 5 29 18 1 45 5 2 2 4 Kole Calhoun 27 MLB-AL Los Angeles 52 229 213 27 46 31 4 1 10 23 12 0 61 3 0 1 5 Mookie Betts 22 MLB-AL Boston 48 228 209 40 71 44 17 2 8 29 16 0 31 1 0 2 6 Matt Duffy 24 MLB-NL San Francisco 51 227 212 29 58 46 8 1 3 26 14 0 33 0 0 1 GDP SB CS BA OBP SLG OPS 1 1 2 0 0.346 0.466 0.539 1.005 2 5 5 3 0.254 0.342 0.459 0.800 3 1 7 4 0.325 0.390 0.453 0.844 4 2 0 0 0.216 0.266 0.385 0.651 5 1 8 2 0.340 0.386 0.555 0.941 6 8 7 0 0.274 0.317 0.363 0.680
Metric Calculation
FanGraphs and Baseball-Reference do the hard work of calculating some of the most commonly used advanced metrics for visitors. However, there are times when you might want to calculate some of these metrics yourself.
Let’s take our last example, where you have data over a very specific time frame. FanGraphs doesn’t produce wOBA or wRC+ for custom time frames, but there is nothing stopping you calculating statistics like these as long as you have the basic data.
The function below will (eventually) calculate wOBA, wRC, and wRC+ for any player over any timeframe, so long as you feed it the requisite data. For now, the function will only calculate wOBA (hey, I’m working on it).
As an example, let’s say you want to know the wOBA for players from August 10, 2015, through the end of the regular season. It’s a snap as long as you have the data in the right format. We can just feed the woba_plus() function the data we just scraped. Here I am just showing the top-15 players by their wOBA:
> x <- daily_batter_bref("2015-08-10", "2015-10-04") > df <- woba_plus(x) > filter(df, PA > 100) %>% .[,c(2,43)] Name wOBA 1 Edwin Encarnacion 0.492 2 David Ortiz 0.470 3 Joey Votto 0.463 4 Bryce Harper 0.459 5 Chris Davis 0.443 6 Shin-Soo Choo 0.439 7 Francisco Lindor 0.435 8 Franklin Gutierrez 0.431 9 Jose Bautista 0.426 10 Ryan Zimmerman 0.424 11 Corey Seager 0.421 12 Mike Trout 0.415 13 Starlin Castro 0.413 14 A.J. Pollock 0.412 15 Mike Moustakas 0.412
I am also planning to include functions that will calculate some of the custom metrics that I have developed and co-developed over the years. Take team consistency, for example. If someone wants to know how consistent each team was in terms of their run scoring and run prevention in 2015 they can easily calculate that with the team_consistency() function:
> team_consistency(2015) Team Con_R Con_RA Con_R_Ptile Con_RA_Ptile 1 ARI 0.37 0.36 22 15 2 ATL 0.41 0.40 87 67 3 BAL 0.40 0.38 70 42 4 BOS 0.39 0.40 52 67 5 CHC 0.38 0.41 33 88 6 CHW 0.39 0.40 52 67 7 CIN 0.41 0.36 87 15 8 CLE 0.41 0.40 87 67 9 COL 0.35 0.34 7 3 10 DET 0.39 0.38 52 42 11 HOU 0.39 0.36 52 15 12 KCR 0.37 0.39 22 50 13 LAA 0.40 0.38 70 42 14 LAD 0.37 0.43 22 98 15 MIA 0.41 0.37 87 30 16 MIL 0.40 0.36 70 15 17 MIN 0.38 0.41 33 88 18 NYM 0.41 0.40 87 67 19 NYY 0.41 0.38 87 42 20 OAK 0.38 0.41 33 88 21 PHI 0.39 0.37 52 30 22 PIT 0.39 0.36 52 15 23 SDP 0.42 0.36 100 15 24 SEA 0.35 0.41 7 88 25 SFG 0.39 0.40 52 67 26 STL 0.37 0.43 22 98 27 TBR 0.36 0.40 13 67 28 TEX 0.39 0.40 52 67 29 TOR 0.35 0.37 7 30 30 WSN 0.41 0.40 87 67
You can play with the individual functions, or install the development version of the package using devtools. See here for instructions.
Next Steps
All of the development can be tracked on GitHub, including the development version of the package. My plan is to flesh out additional data acquisition functions largely through existing application program interfaces (API’s) or scraping of websites. Additional metrics will be added, specifically the ability to calculate things like wOBA on contact, wOBA per pitch based on PITCHf/x data, calculating Edge% from PITCHf/x data, and individual player consistency/volatility. I am also toying with some visualization functions as well, but more on those later.
Feel free to send suggestions or requests along, especially any feedback on the draft versions of the functions (which will be housed here). I can’t promise I will be able to incorporate all of them (or even most of them), but I will certainly do what I can.
You may already be planning on doing this but I’d love if you could include count splits.
So, batter or pitcher stats cut by each ball-strike count?
Yes
Awesome package! I was just thinking on working on similar functions that do exactly this for a piece I am planning to write. You’ve saved me a lot of time! Keep it up.
Great work! Very excited to see this come to fruition!
You are doing the Lord’s work.
I could kiss you, if you were into that sort of thing.
Good timing! I’m starting to learn R for work, and I can actually recognize some of the command lines above! Exciting! Look forward to playing around with this package.
Out of curiosity, did you consider any of the Python/Anaconda stack?
I’m an R guy, so no. But am going to dive deeper into Python this year.
Thanks for nothing Bill! Now I HAVE to get off my butt and finally learn R.
This requires a package? These functions are seriously easy.
I love this! Thank you for doing this – really cool.
this may be tough but I often have trouble finding batted ball data by pitch type and splits. Also pitch framing and defensive shift data would be awesome to have.
The timing is impeccable. I focused this weekend on trying to put my programming chops into actually getting into computational baseball analysis – i.e., learning R, finally, looking at ways at acquiring data from various sources, and just seeing what could be done with a little creativity, the data and the tools to question the data.
Thank you for putting this together and for posting this.
I am pretty R comfortable but o/w completely computer illeterate so forgive me if this is a simple question by I have been trying to scrape the Zips/Fans/Steamer projections in to R from fg with no success. Of course getting the data in R once is trivial but it would be great to be able to scrape the ROS projections daily. Friends have pointed me to tutorials/blogposts on APIs and scarping but they are 1) all in python and 2) seem to require knowledge of JAVA or the specific website in question. I think this might be an interesting add. Regardless, heading to devtools to download now…
I LOVE your tools and really appreciate you. The first time I recall seeing your name was at the Tableau Website seeking a little baseball data viz…That was good stuff, and ALL THESE NEW TOOLS are like Christmas Day for me…
Thank you !!!
I am a fan,
Andrea
It would be great to see the visualization options implemented in ggplot2 and/or Shiny. Great start!