Research Notebook: Acquiring NCAA Baseball Stats with R

It's true, you too can pull college stats into R! (via Minda Haas Kuhlmann)

It’s true, you too can pull college stats into R! (via Minda Haas Kuhlmann)

In the latest release of my baseballr package I’ve included a function that makes it easy for a user to pull data from the NCAA’s website for baseball teams across the three major divisions (I, II, III).

The function, ncaa_scrape, requires the user to pass the function three parameters:

  • school_id: Numerical code used by the NCAA for each school
  • year: A four-digit year
  • type: Whether to pull data for batters or pitchers

For the latter two, the inputs are easy. The issue for most will be knowing what the school_id is that the NCAA website uses for the school they are interested in. To help, I decided to include a massive lookup table in the package so a user could easily identify the necessary school_id.

I thought it would be helpful to walk through how I built that file and then show how to use the ncaa_scrape function to start acquiring actual statistics.

Building a lookup table for all NCAA schools

Let’s load the following R packages:

require(baseballr)
require(dplyr)
require(rvest)
require(xml2)

 
The first thing we need to do is build a function that allows us to obtain the numerical codes that the NCAA uses for each of the conferences across their three divisions.

If you look at the url structure used at the site, there is a parameter–conf_id–that allows you to view conferences and the teams associated with each conference individually:

http://stats.ncaa.org/team/inst_team_list?academic_year=2016&conf_id=-1&division=1&sport_code=MBA

When set to -1 the site will show all schools in a given division. The only list I could find that details each conference is a javascript drop down that becomes available when you hover over “ALL TEAMS.”

screen-shot-2016-11-21-at-4.12.48-pm

Let’s write a function to grab the urls and url names for all conferences from the site:

conference_table <- function(year, div) {
  url <- paste0("http://stats.ncaa.org/team/inst_team_list?academic_year=", year, "&conf_id=-1&division=", div, "&sport_code=MBA")
  read <- read_html(url)
  links <- html_nodes(read, "a") %>%
    html_attr("href")
  link_names <- html_nodes(read, "a") %>%
    html_text()
  table <- as.data.frame(cbind(link_names, links))
  names(table) <- c("conference", "links")
  table$conference <- as.character(table$conference)
  links_conferences <- table %>%
    filter(grepl("changeConference", links))
  conference_ids <- sub("\\).*", "", sub(".*\\(", "", links_conferences$links))
  conference_ids <- as.data.frame(conference_ids)
  names(conference_ids) <- "conference_id"
  table <- cbind(links_conferences, conference_ids)
  table <- table %>%
    mutate(year = year, division = div, conference_id = as.numeric(as.character(conference_id))) %>%
    select(year, division, everything(), -links) %>% .[-1,]
  table
}

 
The function needs the user to pass two values: a four-digit year and the division they are interested in (either 1, 2, or 3).

The function takes those inputs and constructs the url based on the inputs. After the html is read and parsed we can extract the names of the conferences and then the links attached to each conference (link_names and links, respectively, in the function).

Now, this is pulling all the links available on the page, so the next step is to combine these two sets of data into a single table and then filter out irrelevant links. Links that include “changeConference” will have the numerical conference codes that we are looking for. After a little string manipulation we will be left with a table that looks like this (for division 1 in 2016):

screen-shot-2016-11-21-at-4.40.56-pm

Now, in some cases the composition of conferences can change from prior years. In order to account for this we can loop over each year and division combination to ensure that we know which schools were members of which conference each year.

A Hardball Times Update
Goodbye for now.

To do that we construct a table with all year/division combinations that we are interested in:

year <- c(2013, 2014, 2015, 2016)
division <- c(1,2,3)
div_yr <- expand.grid(year, division)

 

Then, we group the div_yr table by year and division (Var1 and Var2 respectively) and use the do function to loop over all the possible combinations and combine the results into a single table:

conference_code_lu <- div_yr %>% 
  group_by(Var1, Var2) %>%
  do(conference_table(.$Var1, .$Var2))

 

The result will look like this:

screen-shot-2016-11-21-at-4.47.38-pm

Next, we need another function that will scrape the teams and the team urls within each conference. You’ll notice it looks very similar to the conference_table function:

teams <- function(year, conference, div) {
    url <- paste0("http://stats.ncaa.org/team/inst_team_list?academic_year=", year, "&conf_id=", conference, "&division=", div, "&sport_code=MBA")
    read <- read_html(url)
    links <- html_nodes(read, "a") %>%
      html_attr("href")
    link_names <- html_nodes(read, "a") %>%
      html_text()
    table <- as.data.frame(cbind(link_names, links))
    table$links <- as.character(table$links)
    table$link_names <- as.character(table$link_names)
    table <- table %>%
      filter(grepl("team", links)) %>%
      filter(!grepl("inst_team", links)) %>%
      filter(!grepl("schedule", links))
    table$links <- gsub("/team/", "", table$links)
    table$links <- sub("/.*", "", table$links)
    table$year <- year
    table$division <- div
    table$conference_id <- conference
    names(table) <- c("school", "school_id", "year", "division", "conference_id")
    table
}

 

The major difference is this function also takes the conference_id we just acquired and customizes the url so that we are only scraping the available data for that conference. That allows us to ensure that teams are associated with the right conferences in a given year.

Armed with this function, we can now loop over our conference_code_lu table and apply the new function to all the possible values:

master_ncaa_team_lu <- conference_code_lu %>% 
  group_by(year, division, conference, conference_id) %>% 
  do(teams(.$year, .$conference_id, .$division)) %>%
  ungroup() %>%
  select(school, conference, everything()) %>%
  mutate(school_id = as.numeric(school_id)) %>%
  arrange(school)

 

The result is a tidy lookup table that includes each school, their conference, both their school and conference IDs, as well as the year and the division:

screen-shot-2016-11-21-at-4.53.14-pm

Acquiring NCAA Team Statistics

Now that we have our lookup table, acquiring actual NCAA baseball statistics is a piece of cake.

Let’s say we want all batter statistics for Vanderbilt from the 2015 season. To acquire the data you simply feed the ncaa_scrape function the relevant information:

ncaa_scrape(736, 2015, "batting") 

 

Want statistics for Vanderbilt’s pitchers? No problem, just change the type parameter to “pitching”:

> ncaa_scrape(736, 2015, "batting") %>%
+     select(year:SlgPct)
   year     school   conference division Jersey            Player Yr Pos GP GS    BA OBPct
1  2015 Vanderbilt Southeastern        1     43      Wiel, Zander Jr INF 72 71 0.316 0.406
2  2015 Vanderbilt Southeastern        1     20   Reynolds, Bryan So  OF 72 71 0.318 0.388
3  2015 Vanderbilt Southeastern        1     10      Toffey, Will Fr INF 71 67 0.294 0.380
4  2015 Vanderbilt Southeastern        1      8    Wiseman, Rhett Jr  OF 71 71 0.317 0.415
5  2015 Vanderbilt Southeastern        1      7   Swanson, Dansby Jr INF 71 71 0.335 0.423
6  2015 Vanderbilt Southeastern        1      1       Coleman, Ro So INF 66 53 0.295 0.402
7  2015 Vanderbilt Southeastern        1      2   Campbell, Tyler Jr INF 63 61 0.229 0.297
8  2015 Vanderbilt Southeastern        1      3    Kendall, Jeren Fr  OF 60 56 0.281 0.394
9  2015 Vanderbilt Southeastern        1     25     Ellison, Karl So   C 45 43 0.215 0.291
10 2015 Vanderbilt Southeastern        1     18    Rodgers, Nolan So  OF 44 17 0.247 0.396
11 2015 Vanderbilt Southeastern        1      5      Delay, Jason So   C 37 29 0.283 0.373
12 2015 Vanderbilt Southeastern        1     44      Wright, Kyle Fr   P 32  6    NA    NA
13 2015 Vanderbilt Southeastern        1     22   Pfeifer, Philip Jr   P 29 10    NA    NA
14 2015 Vanderbilt Southeastern        1     16      Murfee, Penn Fr INF 29 16 0.256 0.330
15 2015 Vanderbilt Southeastern        1     15    Fulmer, Carson Jr   P 26 25    NA    NA
16 2015 Vanderbilt Southeastern        1     35       Bowden, Ben So   P 26  0    NA    NA
17 2015 Vanderbilt Southeastern        1     24 Sheffield, Jordan Fr   P 22  6    NA    NA
18 2015 Vanderbilt Southeastern        1     21 Kilichowski, John So   P 21 14    NA    NA
19 2015 Vanderbilt Southeastern        1     13   Buehler, Walker Jr   P 18 18    NA    NA
20 2015 Vanderbilt Southeastern        1     55      Green, Tyler Fr INF 17  3 0.238 0.304
21 2015 Vanderbilt Southeastern        1     11      Sabino, Liam Fr INF 17  3 0.240 0.231
22 2015 Vanderbilt Southeastern        1     39       Smith, Kyle Jr  OF 16  0 0.062 0.211
23 2015 Vanderbilt Southeastern        1     40    Snider, Collin Fr   P 16  0    NA    NA
24 2015 Vanderbilt Southeastern        1     31     Johnson, Ryan Fr   P 15  5    NA    NA
25 2015 Vanderbilt Southeastern        1     45   Ferguson, Tyler Jr   P 15  3    NA    NA
26 2015 Vanderbilt Southeastern        1     36   McCarty, Aubrey Fr INF 10  0 0.167 0.167
27 2015 Vanderbilt Southeastern        1      6     Parker, Drake Fr  OF  8  0 0.200 0.200
28 2015 Vanderbilt Southeastern        1     28  Ruppenthal, Matt Fr   P  8  0    NA    NA
29 2015 Vanderbilt Southeastern        1     19       Mundy, Joey Fr INF  7  0 0.333 0.429
30 2015 Vanderbilt Southeastern        1     33    Chari, Tristan Fr   C  6  0 0.500 0.600
31 2015 Vanderbilt Southeastern        1     23     Abraham, Joey Fr   P  6  0    NA    NA
32 2015 Vanderbilt Southeastern        1      -            Totals  -   -  -  - 0.290 0.379
33 2015 Vanderbilt Southeastern        1      -   Opponent Totals  -   -  -  - 0.216 0.314
   SlgPct
1   0.571
2   0.462
3   0.420
4   0.566
5   0.623
6   0.394
7   0.286
8   0.530
9   0.282
10  0.288
11  0.394
12     NA
13     NA
14  0.385
15     NA
16     NA
17     NA
18     NA
19     NA
20  0.238
21  0.320
22  0.125
23     NA
24     NA
25     NA
26  0.167
27  0.200
28     NA
29  0.333
30  0.500
31     NA
32  0.455
33  0.300

 

Acquiring the pitching data is just as easy:

> ncaa_scrape(736, 2015, "pitching") %>%
+     select(year:ERA)
   year     school   conference division Jersey             Player Yr Pos GP App GS  ERA
1  2015 Vanderbilt Southeastern        1     43       Wiel, Zander Jr INF 72  NA NA   NA
2  2015 Vanderbilt Southeastern        1     20    Reynolds, Bryan So  OF 72  NA NA   NA
3  2015 Vanderbilt Southeastern        1     10       Toffey, Will Fr INF 71  NA NA   NA
4  2015 Vanderbilt Southeastern        1      8     Wiseman, Rhett Jr  OF 71  NA NA   NA
5  2015 Vanderbilt Southeastern        1      7    Swanson, Dansby Jr INF 71  NA NA   NA
6  2015 Vanderbilt Southeastern        1      1        Coleman, Ro So INF 66  NA NA   NA
7  2015 Vanderbilt Southeastern        1      2    Campbell, Tyler Jr INF 63  NA NA   NA
8  2015 Vanderbilt Southeastern        1      3     Kendall, Jeren Fr  OF 60  NA NA   NA
9  2015 Vanderbilt Southeastern        1     25      Ellison, Karl So   C 45  NA NA   NA
10 2015 Vanderbilt Southeastern        1     18     Rodgers, Nolan So  OF 44  NA NA   NA
11 2015 Vanderbilt Southeastern        1      5       Delay, Jason So   C 37  NA NA   NA
12 2015 Vanderbilt Southeastern        1     44       Wright, Kyle Fr   P 32  29  3 1.23
13 2015 Vanderbilt Southeastern        1     22    Pfeifer, Philip Jr   P 29  27  9 3.55
14 2015 Vanderbilt Southeastern        1     16       Murfee, Penn Fr INF 29  NA NA   NA
15 2015 Vanderbilt Southeastern        1     15     Fulmer, Carson Jr   P 26  19 19 1.83
16 2015 Vanderbilt Southeastern        1     35        Bowden, Ben So   P 26  26 NA 2.89
17 2015 Vanderbilt Southeastern        1     24  Sheffield, Jordan Fr   P 22  22  6 2.85
18 2015 Vanderbilt Southeastern        1     21  Kilichowski, John So   P 21  17 10 2.83
19 2015 Vanderbilt Southeastern        1     13    Buehler, Walker Jr   P 18  16 16 2.95
20 2015 Vanderbilt Southeastern        1     55       Green, Tyler Fr INF 17  NA NA   NA
21 2015 Vanderbilt Southeastern        1     11       Sabino, Liam Fr INF 17  NA NA   NA
22 2015 Vanderbilt Southeastern        1     39        Smith, Kyle Jr  OF 16  NA NA   NA
23 2015 Vanderbilt Southeastern        1     40     Snider, Collin Fr   P 16  16 NA 2.41
24 2015 Vanderbilt Southeastern        1     31      Johnson, Ryan Fr   P 15  15  5 3.02
25 2015 Vanderbilt Southeastern        1     45    Ferguson, Tyler Jr   P 15  15  3 6.30
26 2015 Vanderbilt Southeastern        1     36    McCarty, Aubrey Fr INF 10   5 NA 3.00
27 2015 Vanderbilt Southeastern        1      6      Parker, Drake Fr  OF  8  NA NA   NA
28 2015 Vanderbilt Southeastern        1     28   Ruppenthal, Matt Fr   P  8   8 NA 2.70
29 2015 Vanderbilt Southeastern        1     19        Mundy, Joey Fr INF  7  NA NA   NA
30 2015 Vanderbilt Southeastern        1     33     Chari, Tristan Fr   C  6  NA NA   NA
31 2015 Vanderbilt Southeastern        1     23      Abraham, Joey Fr   P  6   6 NA 7.94
32 2015 Vanderbilt Southeastern        1     32      Stone, Hayden So   P  3   3  1 9.64
33 2015 Vanderbilt Southeastern        1     34 Spagnuolo, Brendan Fr   P  1   1 NA 9.00
34 2015 Vanderbilt Southeastern        1      -             Totals  -   -  -  NA NA 2.84
35 2015 Vanderbilt Southeastern        1      -    Opponent Totals  -   -  -  NA NA 5.75

 

You can also acquire data from the same school across multiple years. You need to use both the ncaa_scrape function as well as the dplyr package.

You first create a table with each year you want and the school_id for the school. This can be built with the expand.grid function. Then, that table is grouped by the two relevant variables and fed into the do function. This will take the ncaa_scrape function and apply it to all relevant combinations of year and school_id found in the table you just made with expand.grid:

year <- c(2014, 2015, 2016)
school_id <- 736
division <- 1
v_table <- expand.grid(school_id, year)
v_table %>%
  group_by(Var1, Var2) %>%
  do(ncaa_scrape(.$Var1, .$Var2, "pitching"))
|=================================================================|100% ~0 s remaining     Source: local data frame [100 x 43]
Groups: Var1, Var2 [3]

    Var1  Var2  year     school   conference division Jersey          Player    Yr   Pos
   <dbl> <dbl> <dbl>      <chr>        <chr>    <dbl>  <chr>           <chr> <chr> <chr>
1    736  2014  2014 Vanderbilt Southeastern        1      3    Conde, Vince    Jr   INF
2    736  2014  2014 Vanderbilt Southeastern        1     20 Reynolds, Bryan    Fr    OF
3    736  2014  2014 Vanderbilt Southeastern        1      7 Swanson, Dansby    So   INF
4    736  2014  2014 Vanderbilt Southeastern        1     10   Norwood, John    Jr    OF
5    736  2014  2014 Vanderbilt Southeastern        1     43    Wiel, Zander    So   INF
6    736  2014  2014 Vanderbilt Southeastern        1      8  Wiseman, Rhett    So    OF
7    736  2014  2014 Vanderbilt Southeastern        1      9  Turner, Xavier    So   INF
8    736  2014  2014 Vanderbilt Southeastern        1     18  Rodgers, Nolan    Fr   INF
9    736  2014  2014 Vanderbilt Southeastern        1      1     Coleman, Ro    Fr   INF
10   736  2014  2014 Vanderbilt Southeastern        1      5    Delay, Jason    Fr     C
# ... with 90 more rows, and 33 more variables: GP <chr>, App <int>, GS <int>, ERA <dbl>,
#   IP <dbl>, H <int>, R <int>, ER <int>, BB <int>, SO <int>, SHO <int>, BF <chr>,
#   `P-OAB` <chr>, `2B-A` <int>, `3B-A` <int>, Bk <int>, `HR-A` <int>, WP <int>, HB <int>,
#   IBB <int>, `Inh Run` <int>, `Inh Run Score` <int>, SHA <int>, SFA <int>,
#   Pitches <chr>, GO <int>, FO <int>, W <int>, L <int>, SV <int>, KL <int>, teamid <dbl>,
#   conference_id <dbl>

 

The end result is a dataframe with all three years of data combined together.

You could also include a series of school_ids in that table that you feed into the do(ncaa_scrape) function if you wanted to pull multiple schools at once. The example below will provide data for both Florida and Vanderbilt from 2014-2016:

year <- c(2014, 2015, 2016)
school_id <- c(235, 736)
v_table <- expand.grid(school_id, year)
v_table %>%
  group_by(Var1, Var2) %>%
  do(ncaa_scrape(.$Var1, .$Var2, "pitching"))
|=================================================================|100% ~0 s remaining     Source: local data frame [197 x 43]
Groups: Var1, Var2 [6]

    Var1  Var2  year  school   conference division Jersey          Player    Yr   Pos
   <dbl> <dbl> <dbl>   <chr>        <chr>    <dbl>  <chr>           <chr> <chr> <chr>
1    235  2014  2014 Florida Southeastern        1     12  Martin, Richie    So   INF
2    235  2014  2014 Florida Southeastern        1     17  Gushue, Taylor    Jr     C
3    235  2014  2014 Florida Southeastern        1      2  Turgeon, Casey    Jr   INF
4    235  2014  2014 Florida Southeastern        1     23     Reed, Buddy    Fr    OF
5    235  2014  2014 Florida Southeastern        1     20   Alonso, Peter    Fr   INF
6    235  2014  2014 Florida Southeastern        1     33 Mattson, Braden    Jr     C
7    235  2014  2014 Florida Southeastern        1      5    Powers, Zack    Jr   INF
8    235  2014  2014 Florida Southeastern        1     66    Larson, Ryan    Fr    OF
9    235  2014  2014 Florida Southeastern        1     16  Shafer, Justin    Jr    OF
10   235  2014  2014 Florida Southeastern        1     11    Tobias, Josh    Jr   INF
# ... with 187 more rows, and 33 more variables: GP <chr>, App <int>, GS <int>, ERA <dbl>,
#   IP <dbl>, H <int>, R <int>, ER <int>, BB <int>, SO <int>, SHO <int>, BF <chr>,
#   `P-OAB` <chr>, `2B-A` <int>, `3B-A` <int>, Bk <int>, `HR-A` <int>, WP <int>, HB <int>,
#   IBB <int>, `Inh Run` <int>, `Inh Run Score` <int>, SHA <int>, SFA <int>,
#   Pitches <chr>, GO <int>, FO <int>, W <int>, L <int>, SV <int>, KL <int>, teamid <dbl>,
#   conference_id <dbl>

 

References & Resources


Bill leads Predictive Modeling and Data Science consulting at Gallup. In his free time, he writes for The Hardball Times, speaks about baseball research and analytics, has consulted for a Major League Baseball team, and has appeared on MLB Network's Clubhouse Confidential as well as several MLB-produced documentaries. He is also the creator of the baseballr package for the R programming language. Along with Jeff Zimmerman, he won the 2013 SABR Analytics Research Award for Contemporary Analysis. Follow him on Twitter @BillPetti.
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Graham
7 years ago

How would you scrape all the teams at once?

Graham
7 years ago
Reply to  Graham

I often get errors when doing more than 2 teams at once

eZ
7 years ago

Quick comment on R: the dplyr function slice() allows you to select rows of a data frame similar to select() with columns. For instance, slice(-1) removes the first row.