6 Collecting unstructured data from the Web

Assuming that we have installed rvest in the previous chapter, we setup our environment by loading the following packages:

library(tidyverse)
library(rvest)

Web pages are written in various coding languages that web browses can read and understand. When scraping web pages, we deal with their code. This code is often written in three languages: Hypertext Markup Language (HTML), Cascading Style Sheets (CSS), and Javascript.

  • HTML code defines the structure and the content of a web page.
  • CSS code customizes the style and look of a page.
  • Javascript makes a page dynamic and interactive.

In this Section we’ll focus on how to use R to scrape to read the static parts of a webpage that are written in HTML and CSS.

At the end of this Section you can find a brief optional introduction on how to scrape dynamic web pages.

Unlike R, HTML is not a programming language. Instead, it is called a markup language — it describes the content and structure of a web page. HTML is organized using tags, which are surrounded by <> symbols. Different tags perform different functions. Together, many tags form the content of a web page.

When we scrape a web page, we are downloading its HTML, CSS and Javascript code. Hence, in order to extract any useful information from a web page,
we will need to know its HTML structure and target specific HTML and CSS tags that we care about.

6.1 Scraping random quotes

Lets start with something very simple. Let’s try to scrape some quotes from http://quotes.toscrape.com/

First, we need to download the contents of the page that we are interested in. The function read_html() from package rvest does that by reading any webpage from a given URL.

s = read_html("http://quotes.toscrape.com/")
s
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <div class="container">\n        <div class="row header-box"> ...

Instead of manually exploring HTML tags and trying to identify how to parse valuable information, we can use the Chrome extension SelectorGadget.

 

In our example, by clicking on the quote text of the http://quotes.toscrape.com/ web page we can identify that they are surrounded by the tag “.text”.

Once we have the relevant tag keyword that we are interested in, we can use the function html_nodes() from the package rvest to extract all the information stored within the tags that we selected. (Note that, the function html_nodes() works particularly well with the SelectorGadget extension.)

s %>% html_nodes(".text")
## {xml_nodeset (10)}
##  [1] <span class="text" itemprop="text">“The world as we have created it is a ...
##  [2] <span class="text" itemprop="text">“It is our choices, Harry, that show  ...
##  [3] <span class="text" itemprop="text">“There are only two ways to live your ...
##  [4] <span class="text" itemprop="text">“The person, be it gentleman or lady, ...
##  [5] <span class="text" itemprop="text">“Imperfection is beauty, madness is g ...
##  [6] <span class="text" itemprop="text">“Try not to become a man of success.  ...
##  [7] <span class="text" itemprop="text">“It is better to be hated for what yo ...
##  [8] <span class="text" itemprop="text">“I have not failed. I've just found 1 ...
##  [9] <span class="text" itemprop="text">“A woman is like a tea bag; you never ...
## [10] <span class="text" itemprop="text">“A day without sunshine is like, you  ...

This is nice, as we now got all the quotes that we were interested in. However, we only care about the actual quote text and not about the accompanying HTML tags. Thankfully, the rvest package offers the function html_text(), which extracts the text out of the html tags:

s1 = s %>% html_nodes(".text") %>% html_text()
s1
##  [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"                
##  [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"                                              
##  [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
##  [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"                           
##  [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"                    
##  [6] "“Try not to become a man of success. Rather become a man of value.”"                                                                
##  [7] "“It is better to be hated for what you are than to be loved for what you are not.”"                                                 
##  [8] "“I have not failed. I've just found 10,000 ways that won't work.”"                                                                  
##  [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"                                              
## [10] "“A day without sunshine is like, you know, night.”"

The result of this process is not a tibble, but instead, a vector of characters:

class(s1)
## [1] "character"

We can transform a vector into a single-columned tibble, with the function as_tibble_col:

s1 = s %>% html_nodes(".text") %>%  html_text() %>% as_tibble_col("quote")
s1
## # A tibble: 10 × 1
##    quote                                                                        
##    <chr>                                                                        
##  1 “The world as we have created it is a process of our thinking. It cannot be …
##  2 “It is our choices, Harry, that show what we truly are, far more than our ab…
##  3 “There are only two ways to live your life. One is as though nothing is a mi…
##  4 “The person, be it gentleman or lady, who has not pleasure in a good novel, …
##  5 “Imperfection is beauty, madness is genius and it's better to be absolutely …
##  6 “Try not to become a man of success. Rather become a man of value.”          
##  7 “It is better to be hated for what you are than to be loved for what you are…
##  8 “I have not failed. I've just found 10,000 ways that won't work.”            
##  9 “A woman is like a tea bag; you never know how strong it is until it's in ho…
## 10 “A day without sunshine is like, you know, night.”

Inside the function as_tibble_col, we can provide the column name that we want our single-column tibble to have.

The function as_tibble_col is a member of the larger family of functions as_tibble. Run ?as_tibble in the console for more information.

6.2 Binding columns

Let’s assume that besides the quote, we also care about the quote’s author. By using the SelectorGadget extension we identify that the tag “.author” encapsulates the author information:

s2 = s %>% html_nodes(".author") %>% html_text() %>% as_tibble_col("author")
s2
## # A tibble: 10 × 1
##    author           
##    <chr>            
##  1 Albert Einstein  
##  2 J.K. Rowling     
##  3 Albert Einstein  
##  4 Jane Austen      
##  5 Marilyn Monroe   
##  6 Albert Einstein  
##  7 André Gide       
##  8 Thomas A. Edison 
##  9 Eleanor Roosevelt
## 10 Steve Martin

Now we have two tibbles, s1 and s2, but we want to combine them so that we get one tibble with columns “quote, author”. The function bind_cols allows us to place the two tibbles side by side:

t = bind_cols(s1,s2)
t
## # A tibble: 10 × 2
##    quote                                                          author        
##    <chr>                                                          <chr>         
##  1 “The world as we have created it is a process of our thinking… Albert Einste…
##  2 “It is our choices, Harry, that show what we truly are, far m… J.K. Rowling  
##  3 “There are only two ways to live your life. One is as though … Albert Einste…
##  4 “The person, be it gentleman or lady, who has not pleasure in… Jane Austen   
##  5 “Imperfection is beauty, madness is genius and it's better to… Marilyn Monroe
##  6 “Try not to become a man of success. Rather become a man of v… Albert Einste…
##  7 “It is better to be hated for what you are than to be loved f… André Gide    
##  8 “I have not failed. I've just found 10,000 ways that won't wo… Thomas A. Edi…
##  9 “A woman is like a tea bag; you never know how strong it is u… Eleanor Roose…
## 10 “A day without sunshine is like, you know, night.”             Steve Martin

6.3 Scraping Yahoo! finance comments and reactions

Next, we will try to do something a little bit more substantial. Assume that we want to create a unique dataset about a set of securities that we care about. Perhaps, we can find some unique information into every day comments and reactions of people who post on the Yahoo! finance board. For instance, assume that we care about the the AAPL stock:

j = read_html("https://finance.yahoo.com/quote/AAPL/community?p=AAPL")
j
## {html_document}
## <html id="atomic" class="NoJs desktop" lang="en-US">
## [1] <head prefix="og: http://ogp.me/ns#">\n<meta http-equiv="Content-Type" co ...
## [2] <body>\n<div id="app"><div class="" data-reactroot="" data-reactid="1" da ...

By using the SelectorGadget Chrome extension, we identify the tag “.Pend\(8px\)”, which encapsulates users’ comments/responses:

j1 = j %>% html_nodes(".Pend\\(8px\\)") %>% html_text()
head(j1)
## [1] "from 125-147 took a long time-- new patterns emerging down in the mornings and closing green-- i prefer this to the green in the morning and closing red!  wednesday is all good news..but that doesn't mean much to the market in the short term"                                                                                                                                      
## [2] "MacBook Pro, iPhone 13 Pro and Apple Watch Series 7 are all home run products.  I know this because I’m buying all 3.  I’m not an ‘upgrade every year’ guy.  I wait for mature, refined, step up products and the aforementioned 3 check all the boxes.The best part is I will pay for these with my Apple dividends, my little ‘reward’ to myself for staying the course for a decade!"
## [3] "apple should of bought Tesla in 2018. now tesla is up 62 , 71 high intraday,  and Elon musk will continue to be the wealthiest person on the planet for the rest of his life.  no one will catch him.  Tim cook's biggest mistake."                                                                                                                                                     
## [4] "Man those Tesla investors are really pushing that stock to irrational exuberance….stock is going to need 10 years of eps catch up to give it a normal multiple."                                                                                                                                                                                                                        
## [5] "So the media is going to throw out all the doom and gloom buzzwords(crisis, chaos, panic, trouble etc…) they can this week to try and induce their needed selloff. They are going to point the mic at anyone, anyone! Who will talk about market panic and gloom. The worst two months(Aug-Sept) are gone and the best two(Nov-Dec)are upon us."                                        
## [6] "Very surprised with the action today!"

Let’s now add an additional column in the tibble that identifies the symbol of the stock:

d = j %>% html_nodes(".Pend\\(8px\\)") %>% 
  html_text() %>% as_tibble_col("comment") %>% 
  mutate(stockCode = "AAPL")
head(d)
## # A tibble: 6 × 2
##   comment                                                              stockCode
##   <chr>                                                                <chr>    
## 1 from 125-147 took a long time-- new patterns emerging down in the m… AAPL     
## 2 MacBook Pro, iPhone 13 Pro and Apple Watch Series 7 are all home ru… AAPL     
## 3 apple should of bought Tesla in 2018. now tesla is up 62 , 71 high … AAPL     
## 4 Man those Tesla investors are really pushing that stock to irration… AAPL     
## 5 So the media is going to throw out all the doom and gloom buzzwords… AAPL     
## 6 Very surprised with the action today!                                AAPL

6.4 Repetitive operations with purr::map

Often we are not interested only in a single stock, but instead, we want to collect information on multiple stocks. One way to do this would be to manually go and update each time the URL of the stock that we care about. However, this is not particularly efficient, or sustainable, especially when we are dealing with hundreds of stocks.

Luckily, the tidyverse package purr and its family of functions map() allow us to perform such repetitive operations efficiently. Specifically the map() function transforms the input object by applying a given function to each element of the input. For instance, we can apply the function nchar to each comment we scraped by calling map():

d %>% select(comment) %>% map(nchar)
## $comment
##  [1] 241 375 226 159 335  37 127  30  40  89 109  67 295  16  57 467 440 403 145
## [20]  38

Function nchar(x) calculates the number of cahracters of x.


Function map() returns an object of type list.

However, we often would like to get different results. For instance, we might want to estimate the average number of characters of each comment. The function map_dbl allows us to get this value directly as double, instead as a list:

d %>% select(comment) %>% map(nchar) %>% map_dbl(mean)
## comment 
##   184.8

6.5 Custom R functions

Now back to our original problem. Based on what we have seen so far, it would have been nice to use the map function and apply the same process repeatedly to different stocks to extract their comments and reactions into one unified tibble. But how can we do this? What function goes on yahoo finance and extracts the information we want for an arbitrary set of stocks?

This is a rhetoric question: there isn’t a function that does that. But luckily, we can create our own unique function to do so:

getYahooFinanceComments = function(stockSymbol){
  j = read_html(paste("https://finance.yahoo.com/quote/",stockSymbol,"/community?p=", stockSymbol, sep=""))
  j = j %>% html_nodes(".Pend\\(8px\\)") %>% 
    html_text() %>% as_tibble_col("comment") %>% 
    mutate(stockCode = stockSymbol)
  return(j)
} 

There are some things to point out here:

  • The keyword function(){} identifies that this will be a custom R function.
  • The name getYahooFinanceComments is arbitrary. You can give your function any name you like.
  • Inside the parentheses of function(), a function can have any number of parameters.
  • Inside the brackets {} we identify the functionality of the function. If a function has parameters, these parameters are accessed and used inside the brackets {}.
  • The special keyword return() returns the result of the function. This result can be a tibble, a vector, a number, a string, or any other R object we would like it to be.
  • To define the function you will have to run it. If you update the code of the function, you will need to re-rerun it in order for the updates to take effect.

Inside the getYahooFinanceComments custom function we use for the first time in this book the function paste(). This function concatenates vectors after converting them to characters. Run paste("one","apple") to see the result. Then re-run it with the option sep="_" as follows: paste("one","apple", sep="_").

Function getYahooFinanceComments consolidates the steps we discussed above and returns a final tibble with two columns for a given stock symbol. For clarity, this is what the paste() function does inside read_html:

stockSymbol = "ZM"
paste("https://finance.yahoo.com/quote/",stockSymbol,"/community?p=", stockSymbol, sep="")
## [1] "https://finance.yahoo.com/quote/ZM/community?p=ZM"

Once a function is defined in our environment, we can call it, similarly to how we call any other R function such us filter, summarize, mean, and so on. For instance, I can call the new function getYahooFinanceComments on the TSLA stock:

resultTibble = getYahooFinanceComments("TSLA")
resultTibble
## # A tibble: 20 × 2
##    comment                                                             stockCode
##    <chr>                                                               <chr>    
##  1 Elon Musk's chess game and current silence speak volumes. In my vi… TSLA     
##  2 remember guys Tesla has the technology that made other car compani… TSLA     
##  3 I love this company... Too bad I sold out when it had the big drop… TSLA     
##  4 I think Tesla will hit around $900 next week, hopefully. I don't o… TSLA     
##  5 Successful people don't become that way overnight. What most  peop… TSLA     
##  6 Tesla the money machine! Bought another load of shares last week, … TSLA     
##  7 Michael Burry says he’s no longer betting against Tesla and that h… TSLA     
##  8 I'm in stocks since 25 years and I never saw so much price manipul… TSLA     
##  9 Maybe $900 by earnings. Very possible. $1000 next. Amazing!         TSLA     
## 10 Market will realize soon that Tesla is undervalued and will double… TSLA     
## 11 This will be north of $855-860 after monday.                        TSLA     
## 12 You can slam this stock all you want but it is not a good short ca… TSLA     
## 13 Covering my SHORT position now ! I know when to give up .... I wan… TSLA     
## 14 Michael burry shorted the stock at 600 and crowed about how smart … TSLA     
## 15 Hey Tesla bulls. Do you drive a Tesla? Do you have a plan to buy o… TSLA     
## 16 I have no idea why this stock has done so well lately because it’s… TSLA     
## 17 More than 800b$ valued for 235k cars every quarter ! Growing deliv… TSLA     
## 18 Gigafactory Berlin will make the Model Y the best selling car in t… TSLA     
## 19 Successful people don't become that way overnight. What most  peop… TSLA     
## 20 See ya all on the other side of 1000                                TSLA

6.6 Combining custom functions with map

Now, we can use the function map along with our custom function getYahooFinanceComments to scrape multiple stocks with one simple command, without manually adjusting the code. Specifically, we will use the function map_dfr that will return the result as a single tibble (data frame):

d = c("AAPL",'TSLA','ZM') %>% map_dfr(getYahooFinanceComments)
d %>% head
## # A tibble: 6 × 2
##   comment                                                              stockCode
##   <chr>                                                                <chr>    
## 1 from 125-147 took a long time-- new patterns emerging down in the m… AAPL     
## 2 MacBook Pro, iPhone 13 Pro and Apple Watch Series 7 are all home ru… AAPL     
## 3 apple should of bought Tesla in 2018. now tesla is up 62 , 71 high … AAPL     
## 4 Man those Tesla investors are really pushing that stock to irration… AAPL     
## 5 So the media is going to throw out all the doom and gloom buzzwords… AAPL     
## 6 Very surprised with the action today!                                AAPL

6.7 Optional and Advanced: Scraping dynamic pages with R Selenium

So far we have talked about scraping static web pages. Often, many of the pages that we care about are dynamic (interactive). In those cases, we need more advanced techniques to web scrabe.

The package RSelenium automates web browser’s actions and facilitates the scraping of dynamic web pages. First we need to install and load the package:

install.packages("RSelenium") 
library(RSelenium)

For RSelenium to work we need to use a browser driver (code that mimics a browser). Hence, we will install a Firefox webdriver (this is different than your browser).

Download it from here (make sure you also have a Firefox on your machine): https://github.com/mozilla/geckodriver/releases/tag/v0.27.0

Store the driver in your Rmd folder.

If you do not have Java installed on your machine, you will also need to download a recent JDK from here: https://www.oracle.com/java/technologies/javase-jdk15-downloads.html

Once you complete the setup, you can run the following, and you should observe a Firefox browser opening up:

driver<- rsDriver(browser=c("firefox"),port=4449L)
remDr <- driver[["client"]]

Next we can use the command navigate to visit a URL. Let us visit the Yahoo! finance main web page (by running this code, you should be able to see the webpage in your Firefox browser):

remDr$navigate("https://finance.yahoo.com")

Assume that our goal is to be able to search anything on the Yahoo! finance. We first need to identify the tags of the search box.

Similar to the SelectorGadget chrome extension, you can install on chrome the Selenium IDE extension. This extension will allow you to record your actions, and extract the necessary tags.

Function findElement allows us to identify the element that we are looking for. Here, the next line, identifies the position of the Yahoo! finance search box in the main page:

search_box <- remDr$findElement(using = 'id', value = 'yfin-usr-qry')

Once we have identified the location of the search box, we can send to it any information we want to search for with the function sendKeysToElement(). For instance, the ETF VTI:

search_box$sendKeysToElement(list("VTI"))

With the previous command, you should be able to see the letters VTI appearing in the search box. Mind-blowing right? :)

Now we will identify the search box tag, and then click on it with the function clickElement:

clickSearch <- remDr$findElement(using = 'css', value = "#header-desktop-search-button > .Cur\\(p\\)")
clickSearch$clickElement()

 


For comments, suggestions, errors, and typos, please email me at: