NCAA Men’s Volleyball web scrape for boxscore links using R

This blog post will be different than my previous posts. (Mainly because of the commentary…) I’ll walk through web scraping for box score links.

Box scores links

Starting at I click on the drop down and select men’s volleyball to get an idea of where the data is.
The first link which appears is I need to find the pattern of the URLs so I can build my web scrape for every day of the season for each box score. If I go to the first game day of the season on Jan 2. The url looks like this:

FYI after the ‘season_divisions’ on the URL, /17020/, it is actually a sport ID for the year. For every year that ID changes!

On to the web scrape.

page <- ''
page <- read_html(page)

Loading up ‘rvest’ for the web scrape and ‘tidyverse’ to data wrangle this html code. I name the url ‘page’.

Now my webpage data is named page I can browse through the different objects I can scrape!
We want the ‘Box Score‘ url from each page. If you inspect each of the box scores their url’s have a class named ‘skipMask’. Luckily, rvest makes it easy to web scrape exactly that!

page %>% 
But this brings back a lot of stuff that isn’t really necessary for our task so we need to break it down a bit more.
page %>% 
    html_nodes('.skipMask') %>% 
This still is not returning what we need to get the box scores.
page %>% 
    html_nodes('.skipMask') %>% 
    html_attr("href") %>%
Better, now we can work with this like it is a tibble
page %>% 
    html_nodes('.skipMask') %>%
    html_attr("href") %>%
    as_tibble() %>%
    filter(grepl('box_score', value)) %>%
    mutate(value = paste0('', value))
This will give exactly what is needed!

After putting the ‘hrefs’ into a tibble I filter the ‘value’ column for ‘box_score’ text. Then mutate with ‘; and paste the href after! Boom! All the box scores from one day in a tibble! (which you coerce to a dataframe).

Now I want to get each box score URL for every single day of the season… Looping through every day starting Jan 2, 2020.

ncaa <- tibble(
    dates = seq(as.Date('01/02/2020', format = '%m/%d/%Y'), 
                as.Date('05/15/2020', format = '%m/%d/%Y'), by = "day")) %>%            
    mutate(page =
paste0('', dates, '&conference_id=0&tournament_id=&commit=Submit'))

Take the web scrape code and put it in a function:

bx_score <- function(page) {
  pages <- read_html(page)
  pages %>% 
    html_nodes('.skipMask') %>%
    html_attr("href") %>%
    as_tibble() %>%
    filter(grepl('box_score', value)) %>%
    mutate(value = paste0('', value))

Now use map and bind_rows to to wrap the ncaa dates and function together and return box score links!

ncaa_box_score_links <- bind_rows(map(ncaa$page, bx_score))

Excellent! You have all the box score links!

Top 30 Final Fantasy Iv GIFs | Find the best GIF on Gfycat

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with
Get started
%d bloggers like this: