Webscraping for Research
Author: Brad Luff
What is Web Scraping?
Web scraping is the process of automatically collecting data from web pages without visiting them using a browser. This can be useful for collecting information from a multitude of sites very quickly. Also, because the scraper searches for the location of the information within the webpage it is possible to scrape pages that change daily to get the updated information. We will focus on scraping without any manipulation of the webpages that we visit. Webpage manipulation while scraping is possible, but it can not be done exclusively in R which is why we will be ignoring it for this lesson.
What is HTML?
HTML stands for HyperText Markup Language. All HTML pages are built using the same format. A very generalized version of this is that a page should always have a header, a body, and a footnote. This isn't alwasy the case though and is up to the developer.
Let's view the webpage we will use as an example...
https://www.stevenspass.com/site/mountain/reports/snow-and-weather-report/@@snow-and-weather-report
Try rvest
This is a package for R that should download the webpage as html for further manipulation.
1 # Load the library
2if(!require(rvest)){
3 install.packages("rvest")
4 library(rvest)
5}
6
7## Loading required package: rvest
8
9## Warning: package 'rvest' was built under R version 3.4.4
10
11## Loading required package: xml2
12
13# get the webpage
14stevens <- read_html('https://www.stevenspass.com/site/mountain/reports/snow-and-weather-report/@@snow-and-weather-report')
15
16# Get the current header information
17temperature <- html_nodes(stevens, xpath = '//div/header/div/div/div/div/a/span/span[@class="header-stats-value"]/text()')
18# Select only the temperature
19temperature <- as.character(temperature[1])
20# Strip the temperature down to the numeric digits
21temperature <- gsub(" ", "", temperature)
22temperature <- gsub("\n", "", temperature)
23temperature <- gsub("°", "", temperature)
24
25# Get the amount of snow that fell in the last 24 hours
26snow24 = as.character(html_nodes(stevens, xpath = '//div/div/div/div/div/main/div/div[3]/div[2]/div/div/div[1]/text()'))
27# Strip the inches symbol from the snowfall value
28snow24 <- gsub("â<u>³", "", snow24)
29
30# Print the report we just scraped
31cat("The temperature at Stevens Pass is",temperature, "F and in the last 24 hours there has been", snow24, "inches of snow!")
32</u>
33
34## The temperature at Stevens Pass is 45° F and in the last 24 hours there has been 2<U+2033> inches of snow!