NCAA Men’s Volleyball web scrape for boxscore links using R

This blog post will be different than my previous posts. (Mainly because of the commentary…) I’ll walk through web scraping https://stats.ncaa.org/ for box score links.

Box scores links

Starting at https://stats.ncaa.org/scoreboards I click on the drop down and select men’s volleyball to get an idea of where the data is.
The first link which appears is http://stats.ncaa.org/contests/scoreboards?utf8=%E2%9C%93&sport_code=MVB&academic_year=&division=&game_date=&commit=Submit. I need to find the pattern of the URLs so I can build my web scrape for every day of the season for each box score. If I go to the first game day of the season on Jan 2. The url looks like this:

FYI after the ‘season_divisions’ on the URL, /17020/, it is actually a sport ID for the year. For every year that ID changes!

On to the web scrape.

library(rvest)
library(tidyverse)
page <- 'http://stats.ncaa.org/season_divisions/17020/scoreboards?utf8=%E2%9C%93&season_division_id=&game_date=01%2F02%2F2020&conference_id=0&tournament_id=&commit=Submit'
page <- read_html(page)

Loading up ‘rvest’ for the web scrape and ‘tidyverse’ to data wrangle this html code. I name the url ‘page’.

Now my webpage data is named page I can browse through the different objects I can scrape!
We want the ‘Box Score‘ url from each page. If you inspect each of the box scores their url’s have a class named ‘skipMask’. Luckily, rvest makes it easy to web scrape exactly that!

page %>% 
    html_nodes('.skipMask')
But this brings back a lot of stuff that isn’t really necessary for our task so we need to break it down a bit more.
page %>% 
    html_nodes('.skipMask') %>% 
    html_attr("href") 
This still is not returning what we need to get the box scores.
page %>% 
    html_nodes('.skipMask') %>% 
    html_attr("href") %>%
    as_tibble() 
Better, now we can work with this like it is a tibble
page %>% 
    html_nodes('.skipMask') %>%
    html_attr("href") %>%
    as_tibble() %>%
    filter(grepl('box_score', value)) %>%
    mutate(value = paste0('https://stats.ncaa.org', value))
This will give exactly what is needed!

After putting the ‘hrefs’ into a tibble I filter the ‘value’ column for ‘box_score’ text. Then mutate with ‘https://stats.ncaa.org&#8217; and paste the href after! Boom! All the box scores from one day in a tibble! (which you coerce to a dataframe).

Now I want to get each box score URL for every single day of the season… Looping through every day starting Jan 2, 2020.

ncaa <- tibble(
    dates = seq(as.Date('01/02/2020', format = '%m/%d/%Y'), 
                as.Date('05/15/2020', format = '%m/%d/%Y'), by = "day")) %>%            
    mutate(page =
paste0('https://stats.ncaa.org/season_divisions/17020/scoreboards?utf8=%E2%9C%93&season_division_id=&game_date=', dates, '&conference_id=0&tournament_id=&commit=Submit'))

Take the web scrape code and put it in a function:

bx_score <- function(page) {
  pages <- read_html(page)
  pages %>% 
    html_nodes('.skipMask') %>%
    html_attr("href") %>%
    as_tibble() %>%
    filter(grepl('box_score', value)) %>%
    mutate(value = paste0('https://stats.ncaa.org', value))
}

Now use map and bind_rows to to wrap the ncaa dates and function together and return box score links!

ncaa_box_score_links <- bind_rows(map(ncaa$page, bx_score))

Excellent! You have all the box score links!

Top 30 Final Fantasy Iv GIFs | Find the best GIF on Gfycat

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: