8 Regular expressions

In this short chapter we will introduce regular expressions. Regular expressions allow us to match and etxtract useful patterns from unstructured text. The regular expression functions that we will use come from the package stringr, which is included in tidyverse:

library(tidyverse)

8.1 Atoms

An atom specifies what text is to be matched and where it is to be found. There are four types of atoms:

a single character
a dot
a class
an anchor

To match patterns, we will use the function str_detect.

8.1.1 A single character

The following tests whether the single characters “L” and “S” are found in “HELLO”:

str_detect("HELLO","L")

## [1] TRUE

str_detect("HELLO","S")

## [1] FALSE

8.1.2 The dot

The dot matches any single character:

str_detect("HELLO",".")

## [1] TRUE

str_detect("HELLO","H.")

## [1] TRUE

str_detect("HELLO","h.")

## [1] FALSE

str_detect("HELLO",".L.O")

## [1] TRUE

8.1.3 Class

A class matching returns TRUE if any of the single characters inside the class is found in the examined string. We define a class by using brackets. For instance, [ABL] matches A, B, or L:

str_detect("HELLO","[ABL]")

## [1] TRUE

str_detect("HELLO A or HELLO B","[AB]")

## [1] TRUE

Ranges are popular class atoms:

str_detect("HELLO A or HELLO B","[0-9]")

## [1] FALSE

str_detect("HELLO A  HELLO B","[a-z]")

## [1] FALSE

str_detect("HELLO A  HELLO B","[a-zA-Z]")

## [1] TRUE

str_detect("HELLO A  HELLO B.",".")

## [1] TRUE

8.1.4 Anchor

An anchor atom specifies the position in the string where a match must occurs. A caret (^) identifies the beginning of the line, while a dollar sign ($) the end of a line.

str_detect("HELLO A  HELLO B.","^H")

## [1] TRUE

str_detect("HELLO A  HELLO B.","^H$")

## [1] FALSE

str_detect("HELLO A  HELLO B.",".$")

## [1] TRUE

str_detect("HELLO A or HELLO B","[^0-9]")

## [1] TRUE

Note that a caret within a bracket represents a logical NOT. Hence, [^0-9] means NOT a digit.

8.2 Operators

Operators in regular expressions combine atoms. An operator can be:

a sequence of atoms
alternation of atoms
repetition of atoms
grouping of atoms.

8.2.1 Sequence

A sequence of atoms:

str_detect("HELLO A  HELLO B.","HELLO")

## [1] TRUE

str_detect("HELLO A  HELLO B.","[A-Z][A-Z]")

## [1] TRUE

str_detect("HELLO A 6 4 HELLO B.","[0-9][0-9]")

## [1] FALSE

8.2.2 Alternation

Alternation is similar to logical OR:

str_detect("HELLO A  HELLO B.","HELLO|A")

## [1] TRUE

str_detect("HELLO A  HELLO B.","[A-Z]|[0-9]")

## [1] TRUE

str_detect("HELLO A8  HELLO B.","[A-Za-z][0-9]|[0-9]")

## [1] TRUE

8.2.3 Repitition

The repetition operator is represented by curly brackets. For instance {m,n} matches the atoms or expressions that appear before the brackets from m to n times:

str_detect("HELLO A  HELLO B.","L{3,4}")

## [1] FALSE

str_detect("",".*")

## [1] TRUE

str_detect("HELLO A","H+")

## [1] TRUE

str_detect("HELLO A","K?")

## [1] TRUE

str_detect("HELLO A","L?")

## [1] TRUE

8.2.4 Grouping

Grouping is represented by parentheses and identifies the expression that a subsequent operator will be applied to:

str_detect("HELLO A","(LO)?")

## [1] TRUE

str_detect("HELLO A","(LOL)+")

## [1] FALSE

str_detect("HELLO A","(LOL)+|(LO)*")

## [1] TRUE

str_detect("HELLO A","(LOL)+|([0-9]|A){2,5}")

## [1] FALSE

8.3 Text manipulation: an example on news titles

Load the dataset newsTitles.csv in a tibble:

d = read_csv("../data/newsTitles.csv")
d

## # A tibble: 20,000 × 2
##    category title                                                               
##    <chr>    <chr>                                                               
##  1 Sports   BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS                    
##  2 Sports   Khorkina #39;s Final Act Centers on Bitterness                      
##  3 Sports   JOHNSON AND JONES PROGRESS                                          
##  4 Sports   World Cup notebook: Redden skates but probably won #39;t play Satur…
##  5 Business Costco Is Accused of Sex Bias                                       
##  6 Sports   Olympics: Greek Sprinters Finally Get Chance to Put Case            
##  7 Business China #39;s appetite boosts BHP                                     
##  8 Sports   Moving Beyond Naming Names                                          
##  9 Business FTC Seeks to Delay Arch-Triton Merger                               
## 10 Sports   Tennessee Titans Team Report                                        
## # … with 19,990 more rows

8.3.1 `stringr::str_to_lower`

The first thing we will do is to convert all tittles to lower case. Converting to lower case can streamline the process of identifying patterns in text. To do so, we will use the function str_to_lower():

d = d %>% mutate(title = str_to_lower(title))
d

## # A tibble: 20,000 × 2
##    category title                                                               
##    <chr>    <chr>                                                               
##  1 Sports   baseball: red-hot sox clip the angels #39; wings                    
##  2 Sports   khorkina #39;s final act centers on bitterness                      
##  3 Sports   johnson and jones progress                                          
##  4 Sports   world cup notebook: redden skates but probably won #39;t play satur…
##  5 Business costco is accused of sex bias                                       
##  6 Sports   olympics: greek sprinters finally get chance to put case            
##  7 Business china #39;s appetite boosts bhp                                     
##  8 Sports   moving beyond naming names                                          
##  9 Business ftc seeks to delay arch-triton merger                               
## 10 Sports   tennessee titans team report                                        
## # … with 19,990 more rows

8.3.2 NYC and Boston news subsetting

Now we can use the function str_detect along with some simple regular expressions to manipulate the text in this dataset. Let’s assume that we only care about news stories that are for New York or Boston:

d %>% filter(str_detect(title, "boston|nyc|new york"))

## # A tibble: 98 × 2
##    category title                                            
##    <chr>    <chr>                                            
##  1 Sports   mlb: philadelphia 9, new york mets 5             
##  2 Sports   2-run single by bellhorn lifts boston            
##  3 Business adv: the new york times home delivery            
##  4 Business stocks creep higher in new york                  
##  5 Business bofa pledges to move unit to boston              
##  6 Business boston scientific stent gets extension           
##  7 Sports   new york averts sweep by twins                   
##  8 Sports   boston eclipse yankees                           
##  9 Business boston scientific's ireland plant cleared        
## 10 Sports   federer aims to put out new york #39;s bush fires
## # … with 88 more rows

Let’s create a new binary column that stores whether or not the news story is for New York:

d1 = d %>% filter(str_detect(title, "boston|nyc|new york")) %>% 
  mutate(is_nyc = str_detect(title, "nyc|new york"))
d1

## # A tibble: 98 × 3
##    category title                                             is_nyc
##    <chr>    <chr>                                             <lgl> 
##  1 Sports   mlb: philadelphia 9, new york mets 5              TRUE  
##  2 Sports   2-run single by bellhorn lifts boston             FALSE 
##  3 Business adv: the new york times home delivery             TRUE  
##  4 Business stocks creep higher in new york                   TRUE  
##  5 Business bofa pledges to move unit to boston               FALSE 
##  6 Business boston scientific stent gets extension            FALSE 
##  7 Sports   new york averts sweep by twins                    TRUE  
##  8 Sports   boston eclipse yankees                            FALSE 
##  9 Business boston scientific's ireland plant cleared         FALSE 
## 10 Sports   federer aims to put out new york #39;s bush fires TRUE  
## # … with 88 more rows

8.3.3 `stringr::str_replace_all`

If you notice in the text, you will see that we get many occurrences of #39. This is the ASCII code for an apostrophe. We don’t really want to have #39 in our text, so we will replace this with an apostrophe. To do so, we will use the function str_replace_all:

d1 = d1 %>% mutate(title = str_replace_all(title, " #39;", "'")) 
d1

## # A tibble: 98 × 3
##    category title                                         is_nyc
##    <chr>    <chr>                                         <lgl> 
##  1 Sports   mlb: philadelphia 9, new york mets 5          TRUE  
##  2 Sports   2-run single by bellhorn lifts boston         FALSE 
##  3 Business adv: the new york times home delivery         TRUE  
##  4 Business stocks creep higher in new york               TRUE  
##  5 Business bofa pledges to move unit to boston           FALSE 
##  6 Business boston scientific stent gets extension        FALSE 
##  7 Sports   new york averts sweep by twins                TRUE  
##  8 Sports   boston eclipse yankees                        FALSE 
##  9 Business boston scientific's ireland plant cleared     FALSE 
## 10 Sports   federer aims to put out new york's bush fires TRUE  
## # … with 88 more rows

8.3.4 `stringr::str_extract`

Finally, we can use the function str_extract to extract numbers that appear in text. The following creates a new column that identifies the first number that appears in a title (or NA if no number appears):

d1 %>% mutate(numbers = str_extract(title,"[0-9]+"))

## # A tibble: 98 × 4
##    category title                                         is_nyc numbers
##    <chr>    <chr>                                         <lgl>  <chr>  
##  1 Sports   mlb: philadelphia 9, new york mets 5          TRUE   9      
##  2 Sports   2-run single by bellhorn lifts boston         FALSE  2      
##  3 Business adv: the new york times home delivery         TRUE   <NA>   
##  4 Business stocks creep higher in new york               TRUE   <NA>   
##  5 Business bofa pledges to move unit to boston           FALSE  <NA>   
##  6 Business boston scientific stent gets extension        FALSE  <NA>   
##  7 Sports   new york averts sweep by twins                TRUE   <NA>   
##  8 Sports   boston eclipse yankees                        FALSE  <NA>   
##  9 Business boston scientific's ireland plant cleared     FALSE  <NA>   
## 10 Sports   federer aims to put out new york's bush fires TRUE   <NA>   
## # … with 88 more rows

For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu