8 Regular expressions
In this short chapter we will introduce regular expressions. Regular expressions allow us to match and etxtract useful patterns from unstructured text.
The regular expression functions that we will use come from the package stringr
, which is included in tidyverse:
library(tidyverse)
8.1 Atoms
An atom specifies what text is to be matched and where it is to be found. There are four types of atoms:
- a single character
- a dot
- a class
- an anchor
To match patterns, we will use the function str_detect
.
8.1.1 A single character
The following tests whether the single characters “L” and “S” are found in “HELLO”:
str_detect("HELLO","L")
## [1] TRUE
str_detect("HELLO","S")
## [1] FALSE
8.1.2 The dot
The dot matches any single character:
str_detect("HELLO",".")
## [1] TRUE
str_detect("HELLO","H.")
## [1] TRUE
str_detect("HELLO","h.")
## [1] FALSE
str_detect("HELLO",".L.O")
## [1] TRUE
8.1.3 Class
A class matching returns TRUE if any of the single characters inside the class is found in the examined string.
We define a class by using brackets. For instance, [ABL]
matches A
, B
, or L
:
str_detect("HELLO","[ABL]")
## [1] TRUE
str_detect("HELLO A or HELLO B","[AB]")
## [1] TRUE
Ranges are popular class atoms:
str_detect("HELLO A or HELLO B","[0-9]")
## [1] FALSE
str_detect("HELLO A HELLO B","[a-z]")
## [1] FALSE
str_detect("HELLO A HELLO B","[a-zA-Z]")
## [1] TRUE
str_detect("HELLO A HELLO B.",".")
## [1] TRUE
8.1.4 Anchor
An anchor atom specifies the position in the string where a match must occurs. A caret (^)
identifies the beginning of the line, while a dollar sign ($)
the end of a line.
str_detect("HELLO A HELLO B.","^H")
## [1] TRUE
str_detect("HELLO A HELLO B.","^H$")
## [1] FALSE
str_detect("HELLO A HELLO B.",".$")
## [1] TRUE
str_detect("HELLO A or HELLO B","[^0-9]")
## [1] TRUE
Note that a caret within a bracket represents a logical NOT. Hence,
[^0-9]
meansNOT a digit
.
8.2 Operators
Operators in regular expressions combine atoms. An operator can be:
- a sequence of atoms
- alternation of atoms
- repetition of atoms
- grouping of atoms.
8.2.1 Sequence
A sequence of atoms:
str_detect("HELLO A HELLO B.","HELLO")
## [1] TRUE
str_detect("HELLO A HELLO B.","[A-Z][A-Z]")
## [1] TRUE
str_detect("HELLO A 6 4 HELLO B.","[0-9][0-9]")
## [1] FALSE
8.2.2 Alternation
Alternation is similar to logical OR:
str_detect("HELLO A HELLO B.","HELLO|A")
## [1] TRUE
str_detect("HELLO A HELLO B.","[A-Z]|[0-9]")
## [1] TRUE
str_detect("HELLO A8 HELLO B.","[A-Za-z][0-9]|[0-9]")
## [1] TRUE
8.2.3 Repitition
The repetition operator is represented by curly brackets. For instance {m,n}
matches the atoms or expressions that appear before the brackets from m
to n
times:
str_detect("HELLO A HELLO B.","L{3,4}")
## [1] FALSE
str_detect("",".*")
## [1] TRUE
str_detect("HELLO A","H+")
## [1] TRUE
str_detect("HELLO A","K?")
## [1] TRUE
str_detect("HELLO A","L?")
## [1] TRUE
8.2.4 Grouping
Grouping is represented by parentheses and identifies the expression that a subsequent operator will be applied to:
str_detect("HELLO A","(LO)?")
## [1] TRUE
str_detect("HELLO A","(LOL)+")
## [1] FALSE
str_detect("HELLO A","(LOL)+|(LO)*")
## [1] TRUE
str_detect("HELLO A","(LOL)+|([0-9]|A){2,5}")
## [1] FALSE
8.3 Text manipulation: an example on news titles
Load the dataset newsTitles.csv
in a tibble:
= read_csv("../data/newsTitles.csv")
d d
## # A tibble: 20,000 × 2
## category title
## <chr> <chr>
## 1 Sports BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS
## 2 Sports Khorkina #39;s Final Act Centers on Bitterness
## 3 Sports JOHNSON AND JONES PROGRESS
## 4 Sports World Cup notebook: Redden skates but probably won #39;t play Satur…
## 5 Business Costco Is Accused of Sex Bias
## 6 Sports Olympics: Greek Sprinters Finally Get Chance to Put Case
## 7 Business China #39;s appetite boosts BHP
## 8 Sports Moving Beyond Naming Names
## 9 Business FTC Seeks to Delay Arch-Triton Merger
## 10 Sports Tennessee Titans Team Report
## # … with 19,990 more rows
8.3.1 stringr::str_to_lower
The first thing we will do is to convert all tittles to lower case. Converting to lower case can streamline the process of identifying patterns in text. To do so, we will use the function str_to_lower()
:
= d %>% mutate(title = str_to_lower(title))
d d
## # A tibble: 20,000 × 2
## category title
## <chr> <chr>
## 1 Sports baseball: red-hot sox clip the angels #39; wings
## 2 Sports khorkina #39;s final act centers on bitterness
## 3 Sports johnson and jones progress
## 4 Sports world cup notebook: redden skates but probably won #39;t play satur…
## 5 Business costco is accused of sex bias
## 6 Sports olympics: greek sprinters finally get chance to put case
## 7 Business china #39;s appetite boosts bhp
## 8 Sports moving beyond naming names
## 9 Business ftc seeks to delay arch-triton merger
## 10 Sports tennessee titans team report
## # … with 19,990 more rows
8.3.2 NYC and Boston news subsetting
Now we can use the function str_detect
along with some simple regular expressions to manipulate the text in this dataset.
Let’s assume that we only care about news stories that are for New York or Boston:
%>% filter(str_detect(title, "boston|nyc|new york")) d
## # A tibble: 98 × 2
## category title
## <chr> <chr>
## 1 Sports mlb: philadelphia 9, new york mets 5
## 2 Sports 2-run single by bellhorn lifts boston
## 3 Business adv: the new york times home delivery
## 4 Business stocks creep higher in new york
## 5 Business bofa pledges to move unit to boston
## 6 Business boston scientific stent gets extension
## 7 Sports new york averts sweep by twins
## 8 Sports boston eclipse yankees
## 9 Business boston scientific's ireland plant cleared
## 10 Sports federer aims to put out new york #39;s bush fires
## # … with 88 more rows
Let’s create a new binary column that stores whether or not the news story is for New York:
= d %>% filter(str_detect(title, "boston|nyc|new york")) %>%
d1 mutate(is_nyc = str_detect(title, "nyc|new york"))
d1
## # A tibble: 98 × 3
## category title is_nyc
## <chr> <chr> <lgl>
## 1 Sports mlb: philadelphia 9, new york mets 5 TRUE
## 2 Sports 2-run single by bellhorn lifts boston FALSE
## 3 Business adv: the new york times home delivery TRUE
## 4 Business stocks creep higher in new york TRUE
## 5 Business bofa pledges to move unit to boston FALSE
## 6 Business boston scientific stent gets extension FALSE
## 7 Sports new york averts sweep by twins TRUE
## 8 Sports boston eclipse yankees FALSE
## 9 Business boston scientific's ireland plant cleared FALSE
## 10 Sports federer aims to put out new york #39;s bush fires TRUE
## # … with 88 more rows
8.3.3 stringr::str_replace_all
If you notice in the text, you will see that we get many occurrences of #39
. This is the ASCII code for an apostrophe. We don’t really want to have #39
in our text, so we will replace this with an apostrophe. To do so, we will use the function str_replace_all
:
= d1 %>% mutate(title = str_replace_all(title, " #39;", "'"))
d1 d1
## # A tibble: 98 × 3
## category title is_nyc
## <chr> <chr> <lgl>
## 1 Sports mlb: philadelphia 9, new york mets 5 TRUE
## 2 Sports 2-run single by bellhorn lifts boston FALSE
## 3 Business adv: the new york times home delivery TRUE
## 4 Business stocks creep higher in new york TRUE
## 5 Business bofa pledges to move unit to boston FALSE
## 6 Business boston scientific stent gets extension FALSE
## 7 Sports new york averts sweep by twins TRUE
## 8 Sports boston eclipse yankees FALSE
## 9 Business boston scientific's ireland plant cleared FALSE
## 10 Sports federer aims to put out new york's bush fires TRUE
## # … with 88 more rows
8.3.4 stringr::str_extract
Finally, we can use the function str_extract
to extract numbers that appear in text. The following creates a new column that identifies the first number that appears in a title (or NA if no number appears):
%>% mutate(numbers = str_extract(title,"[0-9]+")) d1
## # A tibble: 98 × 4
## category title is_nyc numbers
## <chr> <chr> <lgl> <chr>
## 1 Sports mlb: philadelphia 9, new york mets 5 TRUE 9
## 2 Sports 2-run single by bellhorn lifts boston FALSE 2
## 3 Business adv: the new york times home delivery TRUE <NA>
## 4 Business stocks creep higher in new york TRUE <NA>
## 5 Business bofa pledges to move unit to boston FALSE <NA>
## 6 Business boston scientific stent gets extension FALSE <NA>
## 7 Sports new york averts sweep by twins TRUE <NA>
## 8 Sports boston eclipse yankees FALSE <NA>
## 9 Business boston scientific's ireland plant cleared FALSE <NA>
## 10 Sports federer aims to put out new york's bush fires TRUE <NA>
## # … with 88 more rows
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu