A Short(-ish) Introduction to Using R Packages for Baseball Research

This visual from Jacob deGrom’s May 21 start is just one example of the outputs you can create in R.
Introduction
If you follow me at all you’ll know that I love R — the statistical programming language. There is a bit of a learning curve, but it’s pretty minimal compared to some other languages and software programs. Best of all, it’s free and there is a massive network of contributors that are constantly building new packages that make it extremely easy to apply all sorts of techniques and functions easily to your data.
Our fearless editor, Paul Swydan, asked if I would write up what R packages I regularly use.
There are some great resources out there for learning R and for learning how to analyze baseball data with it. In fact, a few pretty smart people wrote a fantastic book on the subject, coincidentally titled Analyzing Baseball Data with R. I can’t say enough about this book as a reference, both for baseball analysis and for R. Go and buy it. What follows is in no way a substitute for that book; instead, think of this as a quick reference based on some of the tools that I regularly use (or in some cases, should probably use more).
I would also highly recommend the free, on-line edX course Sabermetrics 101. The course is run by Andy Andres and features not only an introduction to sabermetric analysis, but also SQL and R. I walked through the first version and have heard that the latest version is even better. There’s also the three-part series (parts one, two and three) Brice Russ did at TechGraphs on using R for sports stats.
Note that what follows is really meant for those just getting started with R, or who haven’t yet used R for baseball research, rather than those who are more experienced.
Getting Started
Before we actually analyze anything, we need to make sure we have R set up. This is pretty simple to do, and there are about 440,000 search results for installing R and RStudio, so I’ll just provide a very high-level view of how to do this.
First, get yourself over to CRAN (The Comprehensive R Network) and, on the first page, you will see links to download and install R for either Linux, Mac or Windows.
Second, after you’ve installed the last version of R, I highly recommend grabbing an IDE (Integrated Development Environment), specifically RStudio. An IDE, in case you’re not familiar with the concept is a programming environment that has been packaged as an application program, typically consisting of a code editor, a compiler, a debugger, and a graphical user interface (GUI) builder. Pretty snazzy. RStudio is free to install and makes working with R even easier than it already is. Here’s a screen shot from my setup:

In the upper left is the console where you can input commands and view some output. You can view data sets and source code in the bottom left window. The right side has windows for viewing the objects available in your current environment–like data sets–as well as an area to view and install packages, plots that you’ve created, and search for help.
Also, you will need to load the various packages into R from CRAN (and from beyond the CRAN). An easy way to do this is by using the pacman package. First, install the package:
install.packages("pacman")
Once you have pacman installed you can use the p_load function to install and load multiple packages at once, simply by typing in the name of the package. For example:
p_load(Lahman, dplyr)
If either of the two packages (Lahman or dplyr) are not already installed on your system, p_load will do so before loading it into your R session, which is pretty convenient.
Baseball-specific Packages
You can’t analyze baseball data without the data. Thankfully, the Lahman package makes it easy to get started.
As the name suggests, the Lahman package allows you to access the incredible Lahman database without having to actually download and install the database itself. The package is essentially just a collection of all the tables from the Lahman database in a set of data frames. Let’s load the package:
p_load(Lahman)
When you load the package the global environment won’t show you anything, but, trust me, the data are there. The documentation linked to above has a full accounting of the data frames included, but basically they mirror the separate tables available in the regular Lahman database. The first few rows of the Master table can be viewed using the head function:
head(Master) playerID birthYear birthMonth birthDay birthCountry birthState birthCity 1 aardsda01 1981 12 27 USA CO Denver 2 aaronha01 1934 2 5 USA AL Mobile 3 aaronto01 1939 8 5 USA AL Mobile 4 aasedo01 1954 9 8 USA CA Orange 5 abadan01 1972 8 25 USA FL Palm Beach 6 abadfe01 1985 12 17 D.R. La Romana La Romana deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst 1 NA NA NA <NA> <NA> <NA> David 2 NA NA NA <NA> <NA> <NA> Hank 3 1984 8 16 USA GA Atlanta Tommie 4 NA NA NA <NA> <NA> <NA> Don 5 NA NA NA <NA> <NA> <NA> Andy 6 NA NA NA <NA> <NA> <NA> Fernando nameLast nameGiven weight height bats throws debut finalGame 1 Aardsma David Allan 205 75 R R 2004-04-06 2013-09-28 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13 6 Abad Fernando Antonio 220 73 L L 2010-07-28 2013-09-27 retroID bbrefID deathDate birthDate 1 aardd001 aardsda01 <NA> 1981-12-27 2 aaroh101 aaronha01 <NA> 1934-02-05 3 aarot101 aaronto01 1984-08-16 1939-08-05 4 aased001 aasedo01 <NA> 1954-09-08 5 abada001 abadan01 <NA> 1972-08-25 6 abadf001 abadfe01 <NA> 1985-12-17
The key to the Lahman package is that to get the most out of it you will need to perform SQL-like queries on the tables in R. There are multiple ways to do this, a few of which I will explore in the next section on data-manipulation packages like dplyr and sqldf.
Of course, Lahman doesn’t include play-by-play or PITCHf/x data. Thankfully, there are a few other packages you can use to grab this information.
In terms of PITCHf/x data, the best package I’ve seen is Carson Sievert’s pitchRx. It’s just phenomenal. There is no way to cover all of its features here, so I’ll just introduce it for the moment.
The package allows you to scrape specific data or build and store your own PITCHf/x database. Let’s say you want to view data from Jacob deGrom’s May 21 start against the Cardinals. The scrape function makes this extremely easy:
library(pitchRx) library(dplyr) dat <- scrape("2015-05-21", "2015-05-21") locations <- select(dat$pitch, pitch_type, start_speed, px, pz, des, num, gameday_link) names <- select(dat$atbat, pitcher, batter, pitcher_name, batter_name, num, gameday_link, event, stand) data <- names %>% filter(pitcher_name == "Jacob DeGrom") %>% inner_join(locations, ., by = c("num", "gameday_link"))
Essentially, you create two tables — locations, which pulls from the pitch table, and names, which pulls from the at-bat table — and then join them together, filtering on the pitcher’s name. Here are the first six rows of data:
head(data) pitch_type start_speed px pz des num 1 CU 80.5 0.064 2.930 Called Strike 1 2 SL 89.8 -0.839 2.037 Called Strike 1 3 FF 96.9 -1.568 4.558 Ball 1 4 FF 95.5 -0.565 4.135 Swinging Strike 1 5 FF 95.4 -1.381 2.949 Ball 2 6 SL 91.4 -0.596 1.755 Called Strike 2 gameday_link pitcher batter pitcher_name batter_name 1 gid_2015_05_21_slnmlb_nynmlb_1 594798 543939 Jacob DeGrom Kolten Wong 2 gid_2015_05_21_slnmlb_nynmlb_1 594798 543939 Jacob DeGrom Kolten Wong 3 gid_2015_05_21_slnmlb_nynmlb_1 594798 543939 Jacob DeGrom Kolten Wong 4 gid_2015_05_21_slnmlb_nynmlb_1 594798 543939 Jacob DeGrom Kolten Wong 5 gid_2015_05_21_slnmlb_nynmlb_1 594798 572761 Jacob DeGrom Matt Carpenter 6 gid_2015_05_21_slnmlb_nynmlb_1 594798 572761 Jacob DeGrom Matt Carpenter event stand 1 Strikeout L 2 Strikeout L 3 Strikeout L 4 Strikeout L 5 Single L 6 Single L
DeGrom threw a gem that day, striking out 11 and walking none over eight innings, and we now have pitch type, speed, location and result data for each of the 104 pitches he threw in that game. Analyzing the data, however, requires the use of some other packages–like dplyr–which we will get into below.
The problem with PITCHf/x data is that the system came online only in 2008, and the data took some time to become both comprehensive and reliable. Long before we had PITCHf/x, however, we had the amazing Retrosheet. The fact that these data are freely available is just tremendous, but what if you don’t want to deal with a bunch of csv files, or build your own database? Well, there is a new package out — creatively titled retrosheet — which looks promising.
I have not used this package much, but I think it’s worth exploring more. For example, you can pull roster data for a given year and look at specific teams with a single line of code. Here’s how to pull the 1969 Mets roster, with the first 10 players shown:
retro <- getRetrosheet("roster", 1969) retro$NYN retroID Last First Bat Throw Team Pos 1 ageet101 Agee Tommie R R NYN X 2 boswk101 Boswell Ken L R NYN X 3 cardd101 Cardwell Don R R NYN X 4 chare101 Charles Ed R R NYN X 5 clend101 Clendenon Donn R R NYN X 6 collk101 Collins Kevin L R NYN X 7 dilaj101 DiLauro Jack B L NYN X 8 dyerd101 Dyer Duffy R R NYN X 9 frisd101 Frisella Danny L R NYN X 10 garrw101 Garrett Wayne L R NYN X
With just a few more lines of code, you could also pull their schedule for that season:
retro_sch <- getRetrosheet("schedule", 1969) NYMa <- filter(retro_sch, VisTeam == "NYN") NYMh <- filter(retro_sch, HmTeam == "NYN") NYM1969 <- rbind(NYMa, NYMh) %>% arrange(Date) Date GameNo Day VisTeam VisLg VisGmNo HmTeam HmLg HmGmNo TimeOfDay 1 19690408 0 Tue MON NL 1 NYN NL 1 d 2 19690409 0 Wed MON NL 2 NYN NL 2 d 3 19690410 0 Thu MON NL 3 NYN NL 3 d 4 19690411 0 Fri SLN NL 4 NYN NL 4 d 5 19690412 0 Sat SLN NL 5 NYN NL 5 d 6 19690413 0 Sun SLN NL 6 NYN NL 6 d Postponed Makeup 1 NA NA 2 NA NA 3 NA NA 4 NA NA 5 NA NA 6 NA NA
I encourage you to play around with it, as you can also pull event and game log data as well.
The last baseball-specific package is the ambitious openWAR project by Ben Baumer, Shane Jensen and Gregory Matthews. I say it’s ambitious because it isn’t just a package that is useful for gathering data, but it aims to implement a more transparent and reproducible version of Wins Above Replacement, as well as provide transparency into the uncertainty of our estimates of individual player WAR.
For a full rundown you should read their detailed paper, which can be accessed here in PDF format, as well as this presentation from 2013.
The package relies on parsing data from MLB’s Gameday server, which is the same as the pitchRx package above, except that it pulls the results of at-bats instead of every pitch. I have not used open WAR much, but it is on my list to explore in greater detail. That being said, I highly recommend diving into it as it appears to be a great way to not only grab data, but also analyze player performance in a rigorous way.
As cool as these packages are, one can’t live by baseball-themed packages alone. You need some help manipulating the data, and that’s where we will focus next.
Data-manipulation Packages
You probably noticed in some of the code above some additional packages and functions that were not part of the baseball-specific packages. Those I am characterizing as data-manipulation packages and they are every bit as important to conducting any kind of analysis in R, baseball or otherwise.
Now, there are tons of packages one could use to manipulate data in R. Here, I’ll outline a few I find most useful on a day-to-day basis.
Connecting to Databases
Let’s say you have a database, either on your hard disk or one you connect to remotely, that you want to interact with from with the R environment. There are a few packages you could leverage, but the one I currently use is RMySQL. RMySQL allows you to establish a connection to your database and then perform regular SQL queries on the data to your heart’s content.
As an example, assume you have the Lahman database installed on your computer already. Rather than fire up your favorite SQL tool, run a query, export the data, and then import into R for analysis you can simply do all of this from within R.
First, you need to provide your connection information and save that as an object–we’ll call it con:
thirty_thirty <- dbGetQuery(con, "SELECT CONCAT(m.nameLast, ', ', m.nameFirst) as 'Player', yearID as 'Season', teamID as 'Team', HR, SB FROM batting b JOIN Master m ON b.playerID=m.playerID WHERE HR >= 30 AND SB >= 30 ORDER BY yearID DESC") # make sure to close your connection and detach the package from your environment before using another SQL-like package, like sqldf below dbDisconnect(con) detach("package:RMySQL")
We now have a data set with 58 cases, which align to all instances in baseball history where a player has amassed 30 home runs and 30 stolen bases in a single season. Here are the first six records:
head(thirty_thirty) Player Season Team HR SB 1 Braun, Ryan 2012 MIL 41 30 2 Trout, Mike 2012 LAA 30 49 3 Kemp, Matt 2011 LAN 39 40 4 Kinsler, Ian 2011 TEX 32 30 5 Braun, Ryan 2011 MIL 33 33 6 Ellsbury, Jacoby 2011 BOS 32 39
Data Munging
The vast majority of most analysis consists of data acquisition, and more importantly, data munging–essentially, cleaning and manipulating the data into the right form for whatever particular analysis you want to conduct.
Sometimes you already have a data set, or multiple data sets, loaded into R that are not accessible in some sort of database and you need to merge them together. For example, let’s say you had a table of player names along with some type of player IDs, and another table with player statistics but no names. This is a very simplified example, but one we run into all the time. Just look at our RMySQL example above. We needed to join the player name from the Master table to the player’s performance data from the Batting table. If you aren’t working in a database you might just pull open both tables in Excel and use the VLOOKUP function to merge the two. But if you have them in R you can just use the SQL syntax you are used to by leveraging the sqldf package.
Sqldf is an easy-to-use package that allows you to manipulate separate data objects in R as if they were tables in a database. For simplicity’s sake, assume you have the Lahman Master and Batting tables downloaded as csv files and you’ve uploaded both into R. Recreating the 30-30 club dataset above is incredibly easy with sqldf:
thirty_thirty_sqldf <- sqldf("SELECT m.nameLast||', '||m.nameFirst as 'Player', yearID as 'Season', teamID as 'Team', HR, SB FROM Batting b JOIN Master m ON b.playerID=m.playerID WHERE HR >= 30 AND SB >= 30 ORDER BY yearID DESC") head(thirty_thirty_sqldf) Player Season Team HR SB 1 Braun, Ryan 2012 MIL 41 30 2 Trout, Mike 2012 LAA 30 49 3 Braun, Ryan 2011 MIL 33 33 4 Ellsbury, Jacoby 2011 BOS 32 39 5 Kemp, Matt 2011 LAN 39 40 6 Kinsler, Ian 2011 TEX 32 30
Presto! Can’t get much easier than that.
The granddaddy of them all, however, is arguably Hadley Wickham’s dplyr package. As much as I loved using sqldf, someone told me that eventually dplyr would become my go-to package for almost any analysis and they were right. With dplyr you can filter and slice data, select and reorder columns and variables, group and summarize data, and join data sets in much the same way you would using SQL-style queries. It is the Swiss army knife of R for data junkies.
The most important thing to know about dplyr is how to use the pipe operator, or %>%. The pipe operator simply takes whatever value is on its left and pipes it to the first position on to its right, or wherever you place a period.
Let’s return to our 30-30 club data set. First, we can create that data set from scratch with dplyr:
require(dplyr) thirty_thirty_dplyr <- filter(Lahman::Batting, HR >= 30, SB >= 30) %>% left_join(Lahman::Master, by = "playerID") %>% arrange(desc(yearID)) %>% mutate(Player = paste(nameLast, nameFirst, sep = ", ")) %>% select(Player, yearID, teamID, HR, SB) head(thirty_thirty_dplyr) Player yearID teamID HR SB 1 Braun, Ryan 2012 MIL 41 30 2 Trout, Mike 2012 LAA 30 49 3 Braun, Ryan 2011 MIL 33 33 4 Ellsbury, Jacoby 2011 BOS 32 39 5 Kemp, Matt 2011 LAN 39 40 6 Kinsler, Ian 2011 TEX 32 30
We can then look at which players appeared the most times on the list:
count <- thirty_thirty_dplyr %>% group_by(Player) %>% summarise(Count = n()) %>% arrange(desc(Count)) count Player Count 1 Bonds, Barry 5 2 Bonds, Bobby 4 3 Soriano, Alfonso 4 4 Johnson, Howard 3 5 Abreu, Bobby 2 6 Bagwell, Jeff 2 7 Braun, Ryan 2 8 Gant, Ron 2 9 Guerrero, Vladimir 2 10 Kinsler, Ian 2 .. ... ...
Or we can look at which seasons produced the most 30-30 players:
count_season <- thirty_thirty_dplyr %>% group_by(yearID) %>% summarise(Count = n()) %>% arrange(desc(Count)) count_season yearID Count 1 1987 4 2 1996 4 3 1997 4 4 2011 4 5 2001 3 6 2007 3 7 1990 2 8 1991 2 9 1995 2 10 1998 2 .. ... ...
Or the teams with the most 30-30 seasons:
count_team <- thirty_thirty_dplyr %>% group_by(teamID) %>% summarise(Count = n()) %>% arrange(desc(Count)) count_team teamID Count 1 NYN 5 2 SFN 5 3 ATL 3 4 CIN 3 5 COL 3 6 LAN 3 7 NYA 3 8 PHI 3 9 TEX 3 10 CHN 2 .. ... ...
One of my favorite uses for dplyr is for creating year-to-year data sets when I want to compare player performance or create aging curves.
As a quick example, let’s say we want to see which players saw the greatest increase in their home run rate (home runs per 600 balls in play) between 2000 and 2010. We can use Lahman and dplyr to pull this together pretty easily. We will limit the data set to those that had at least 400 at-bats in both the first and second season:
hr_y2y <- filter(Lahman::Batting, yearID >= 2000, yearID < 2011) %>% left_join(Lahman::Master, by = "playerID") %>% arrange(desc(yearID)) %>% mutate(Player = paste(nameLast, nameFirst, sep = ", ")) %>% select(Player, yearID, teamID, HR, AB, SO, SF) %>% mutate(HR_rate = round((HR/(AB+SF-SO)*600),1)) %>% filter(AB >= 400) %>% mutate(Season_next = yearID + 1) %>% left_join(., ., by = c("Season_next" = "yearID", "Player" = "Player")) %>% filter(!is.na(HR.y)) %>% mutate(HR_rate_change = (HR_rate.y - HR_rate.x)) %>% arrange(desc(HR_rate_change)) %>% select(Player, yearID, teamID.x, HR.x, HR_rate.x, Season_next, teamID.y, HR.y, HR_rate.y, HR_rate_change) head(hr_y2y) Player yearID teamID.x HR.x HR_rate.x Season_next teamID.y HR.y 1 Bonds, Barry 2000 SFN 49 71.7 2001 SFN 73 2 Beltran, Carlos 2005 NYN 16 19.5 2006 NYN 41 3 Ensberg, Morgan 2004 HOU 10 16.3 2005 HOU 36 4 Gonzalez, Luis 2000 ARI 31 34.1 2001 ARI 57 5 Hall, Bill 2005 MIL 17 25.4 2006 MIL 35 6 Thome, Jim 2000 CLE 37 56.8 2001 CLE 49 HR_rate.y HR_rate_change 1 113.8 42.1 2 58.9 39.4 3 52.4 36.1 4 64.4 30.3 5 55.4 30.0 6 85.5 28.7
Man, there’s that Barry Bonds character again.
You get the picture. Bottom line, there a million ways to leverage dplyr and once you get up to speed on its functions you’ll be amazed how much easier it makes your life.
Scraping Data
No matter how robust your own database, there are usually more data you’d like to have access to. Take team records on every date in a given season. The best source for this I have seen for this is Baseball-Reference’s “Standings on Any Date” feature. But what if you want every day from Opening Day until the playoffs? That’s a lot of manual work. Here’s where R and a package like XML can come in very handy.
First, take a look at the url for any date, say opening day this year:
http://www.baseball-reference.com/games/standings.cgi?year=2015&month=4&day=5&submit=Submit+Date
All we need to do is change the year, month and day entries in the url to jump to another date. XML will allow you to scrape all data tables present at a given url, and then select the table you want. Here’s what this looks like in R if we want to pull the National League East standings for Aug. 31, 2015:
p_load(XML, dplyr) dat <- readHTMLTable("http://www.baseball-reference.com/games/standings.cgi?year=2015&month=08&day=31&submit=Submit+Date") ## here are the divisions and corresponding elements in the list # 2 AL EAST # 3 AL CENTRAL # 4 AL WEST # 5 NL EAST # 6 NL CENTRAL # 7 NL WEST dat[5] [[1]] Tm W L W-L% GB RS RA pythW-L% 1 NYM 73 58 .557 -- 533 478 .550 2 WSN 66 64 .508 6.5 555 525 .525 3 ATL 54 77 .412 19.0 475 615 .384 4 MIA 53 79 .402 20.5 485 548 .444 5 PHI 52 80 .394 21.5 502 666 .373
Now, what really saves time is creating a list of dates and then letting R do the work of pulling all the records for each date for you. Behold:
# create a function for scraping the data given a specific date date_scrape <- function(y,m,d) { url <- paste0("http://www.baseball-reference.com/games/standings.cgi?year=",y,"&month=",m, "&day=",d,"&submit=Submit+Date") d <- readHTMLTable(url, stringsAsFactors = FALSE) d <- as.data.frame(d[5]) d } # create a complete sequence of dates you want to scrape data for dates <- as.data.frame(seq(as.Date("2015/04/05"), as.Date("2015/08/31"), by = "days")) names(dates) <- "dates" # split the dates so that there are three separate inputs to feed the function dates <- colsplit(dates$dates, "-", c("y", "m", "d")) # use the do() function to iterate the scrape function over all the dates out <- dates %>% group_by(y,m,d) %>% do(date_scrape(.$y, .$m, .$d)) # view the first 10 rows head(out, 10) Source: local data frame [10 x 11] Groups: y, m, d y m d Tm W L W.L. GB RS RA pythW.L. 1 2015 4 5 MIA 0 0 .000 -- 0 0 2 2015 4 5 PHI 0 0 .000 -- 0 0 3 2015 4 5 WSN 0 0 .000 -- 0 0 4 2015 4 5 ATL 0 0 .000 -- 0 0 5 2015 4 5 NYM 0 0 .000 -- 0 0 6 2015 4 6 ATL 1 0 1.000 -- 2 1 .780 7 2015 4 6 NYM 1 0 1.000 -- 3 1 .882 8 2015 4 6 MIA 0 1 .000 1.0 1 2 .220 9 2015 4 6 PHI 0 1 .000 1.0 0 8 .000 10 2015 4 6 WSN 0 1 .000 1.0 1 3 .118
You now have a data set with 745 rows, one each for every team’s record on every date in your sequence.
Visualizing the Data
There are several books dedicated to using R for creating visualizations. Here I’ll just touch on my go-to package, which not surprisingly is ggplot2 — it is widely hailed as the best visualization package for R. The base of R does include various plotting tools, but ggplot2 gives you a ton of power over just about every aspect of the visual you want to create. The code does take some getting used to, but once you get the hang of it you can do some amazing stuff.
For now, let’s say we want to take and visualize all the National League East standings data. Here’s how you might approach it using our existing data, dplyr and ggplot2:
require(ggplot2) # pair down the data set and create a single column with the date of the standings nle_standings_2015 <- ungroup(out) %>% mutate(Date = paste0(y, sep = "-", m, sep = "-", d)) %>% select(Date, Tm, GB) # change the data type for the three columns nle_standings_2015$GB <- as.numeric(nle_standings_2015$GB) nle_standings_2015$Date <- as.Date(nle_standings_2015$Date) nle_standings_2015$Tm <- as.factor(nle_standings_2015$Tm) # make sure when a team is in first it has a 0 for the games back value nle_standings_2015$GB <- ifelse(is.na(nle_standings_2015$GB), 0, nle_standings_2015$GB) # set the color scheme for the teams team_colors = c("ATL" = "#01487E", "MIA" = "#0482CC", "NYM" = "#F7742C", "PHI" = "#CA1F2C", "WSN" = "#575959") # plot the data using ggplot2 plot <- ggplot(nle_standings_2015, aes(Date, GB, colour = factor(Tm), group = Tm)) + geom_line(size = 1.25, alpha = .75) + scale_colour_manual(values = team_colors, name = "Team") + scale_y_reverse(breaks = 0:25) + scale_x_date() + labs(title = "NLE East Race through August 2015") + geom_text(aes(label=ifelse(Date == "2015-08-31", as.character(GB),'')),hjust=-.5,just=0, size = 4, show_guide = FALSE) + theme(legend.title = element_text(size = 12)) + theme(legend.text = element_text(size = 12)) + theme(axis.text = element_text(size = 13, face = "bold"), axis.title = element_text(size = 18, color = "grey50", face = "bold"), plot.title = element_text(size = 35, face = "bold", vjust = 1)) # view the graphic plot
And here’s the result:
We could also plot our PITCHf/x data from earlier. The pitchRx package does have some native graphic options, but we can create our own just for practice. Let’s plot the location of each of deGrom’s swinging strikes from that May 21 start, and color code each pitch by velocity:
# subset the data, keeping all rows but only columns number 1 through 5 and 13 deGrom <- data[,c(1:5, 13)] # filter for swinging strikes deGrom_swing <- filter(deGrom, grepl("Swinging", des)) # plot the pitches, coloring them by velocity p <- ggplot(deGrom_swing, aes(px, pz, color = start_speed)) # add in customized axis and legend formatting and labels p <- p + scale_x_continuous(limits = c(-3,3)) + scale_y_continuous(limits = c(0,5)) + annotate("rect", xmin = -1, xmax = 1, ymin = 1.5, ymax = 3.5, color = "black", alpha = 0) + labs(title = "Jacob deGrom: Swinging Strikes, 5/21/2015") + ylab("Horizontal Location (ft.)") + xlab("Vertical Location (ft): Catcher's View") + labs(color = "Velocity (mph)") # format the points p <- p + geom_point(size = 10, alpha = .65) # finish formatting p <- p + theme(axis.title = element_text(size = 15, color = "black", face = "bold")) + theme(plot.title = element_text(size = 30, face = "bold", vjust = 1)) + theme(axis.text = element_text(size = 13, face = "bold", color = "black")) + theme(legend.title = element_text(size = 12)) + theme(legend.text = element_text(size = 12)) # view the plot p
And the result:

We can see that the velocity of the pitches that generated swinging strikes is directly related to how high in the zone they were. This makes sense when we see what pitch types were thrown to which locations:

Those low swinging strikes were generated off of curveballs, and the higher strikes were four-seam fastballs.
Wrapping Up
I hope this is helpful, especially to those who are new to using R and thinking about how to effectively conduct baseball research using the language. You can find all the code, images, and the openWAR 2015 data file at my GitHub repository for this post. I also have a number of public repositories that include R code for other baseball-related projects, so feel free to have a look around.
There is a lot more I could have covered, specifically inferential statistics, modeling and machine learning. If it’s useful, I might cover those packages and techniques in a follow-up post. Let me know in the comments. And feel free to suggest other packages I may have missed or should consider diving into further, as well as any code improvements.
References & Resources
- Bill Petti’s Github repository
- Max Marchi and Jim Albert, Analyzing Baseball Data With R
- The Comprehensive R Archive Network (CRAN)
- RStudio
- Carson Sievert, “pitchRx” data package
- Richard Scriven, “retrosheet” data package
- Ben Baumer, Shane Jensen and Gregory Matthews, “openWAR” data package
- Hadley Wickham, “ggplot2” data package
- CRAN, “Introduction to dplyr”
- CRAN, Lahman data package PDF
- CRAN, RMySql data package PDF
- CRAN, sqldf data package PDF
- Dan Kopf, Priceonomics, “Hadley Wickham, the Man Who Revolutionized R”
- Atmajitsinh Gohil, R Data Visualization Cookbook
- Winston Chang, R Graphics Cookbook
- Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (Use R!)
- Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics
Excellent starter article for R newbies. And very appropriate mention of the Analyzing Baseball Data with R book; in fact, in addition to the packages, there are some other nicely helpful resources here and in the references. And I hadn’t heard about that edX course either. Thank you!
Can’t recommend Andy’s course enough–you won’t be disappointed.
Hadley Wickham recently added a package, Rvest, for web scraping. Used it to collect and munge some box scores from Baseball Reference and it worked great!
I’ve used rvest sparsely at this point, just because I am so used to XML, but it’s on my list to dive into as it appears to have some definite advantages.
This is fantastic! I am definitely going to get Analyzing Baseball Data with R. I have taken the edX Sabr101x twice and am looking forward to Sabr201x. Thanks for posting this.
Sure thing, glad it’s helpful.
This is one of the best articles I’ve ever read. Thank you very much, Bill!
My pleasure.
Excellent, Thank You!! I did sign up for the EDX course, and although I did not finish it on time, I am working my way through it as i have time. I agree that it is great and the instruction is excellent.
Thanks again for this tremendous information.
Andrea
Nice article, have really enjoyed it and can’t stress how equally important SABR101x and Analyzing baseball data with R have helped me. Just one question: I’m trying to install the openWAR package but am failing miserably. What version of R are you running?
Never mind, found out that most of the packages were incompatible with Windows so I ended up downloading Linux. Now I get to do more cool graphics and analysis. Thanks a lot for this intro!
I installed R on an older Mac book, but I can’t do it on a newer laptop with Yosemite system. I get an error message.
I installed R on an older Mac book, but I can’t do it on a newer laptop with Yosemite system. I get an error message. I tried to transfer the file from the one Mac to the other, and it worked, but the file will not open on the other Mac. There is some incompatibility with the Yosemite system.
Boy, do I rue the day I installed that system. Several applications I used a lot no longer work on that system.
Posting a while after the article, but got a question: For the PITCHf/x example, you’re selecting particular columns, but I know there’s more data about vertical and horizontal movement, spin-rate, and release point. Is there a resource that says what each of those columns are and what they represent? Thanks