4 Data formats
In this chapter we will explore datasets stored in different data formats, such as csv, json, and xls.
4.1 Loading CSV files into tibbles
A CSV file is a file with comma-separated values.
Function read_csv of the package readr allows us to load CSV files into tibbles. The package readr is included in the set of tidyverse packages. Recall (see Section 1.7) that in order to use functions from a package we first need to load the package by running library():
library(tidyverse)Now we will use the function read_csv() to load the data file yelp.csv which is stored in our data directory (see Section 2.1.2):
t = read_csv("../data/yelp.csv")
t## # A tibble: 2,529 × 25
## avg_stars median_stars std_stars avg_cool median_cool std_cool avg_useful
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 5 0 7 7 4.06 8.4
## 2 4.2 4 0.837 0.8 1 0.837 0.8
## 3 4.6 5 0.548 3.6 3 1.34 6.2
## 4 4 4 1.22 0.2 0 0.447 1.2
## 5 4 4 0.707 0.6 0 0.894 1.2
## 6 3.2 3 0.837 0.8 1 0.837 2.8
## 7 3.2 3 0.837 0.6 0 0.894 0.6
## 8 3 3 1 0.4 0 0.548 0.4
## 9 4.8 5 0.447 0.4 0 0.548 1
## 10 4.4 5 0.894 0.4 0 0.894 0.4
## # … with 2,519 more rows, and 18 more variables: median_useful <dbl>,
## # std_useful <dbl>, avg_funny <dbl>, median_funny <dbl>, std_funny <dbl>,
## # Take.out <lgl>, Caters <lgl>, Takes.Reservations <lgl>, Delivery <lgl>,
## # Has.TV <lgl>, Outdoor.Seating <lgl>, Alcohol <chr>, Waiter.Service <lgl>,
## # Accepts.Credit.Cards <lgl>, Good.for.Kids <lgl>, Good.For.Groups <lgl>,
## # Price.Range <dbl>, top10percent <dbl>
4.2 Loading JSON files into tibbles
JSON format uses two structures:
objectsandarrays. An object is an un-ordered set ofname-valuepairs (AKAkey-valuepairs). We can define a JSON object by starting with a curly bracket{, typing thename, followed by a colon:and then thevalue. We can separate multiple name-value pairs with acomma. A JSONarrayis an ordered sequence of values. We can define anarrayby opening a left bracket[, and then separate the values by commas. Note that avaluecan be a string, a number, a logical value, or an object, which essentially allows for having nestedname-valuepair objects.
To load JSON files into tibbles, we will use the function fromJSON from the package jsonlite.
If the package jsonlite is not already installed in your machine, you might first need to install it:
install.packages("jsonlite")Then, we can load it into our working environment:
library(jsonlite)We can now use the function fromJSON to load the json file into a dataframe. The as_tibble function customizes the dataframe into a tibble.
For instance, we can read the senators.json file as follows (taken from here: https://www.govtrack.us/api/v2/role?current=true&role_type=senator)
d = fromJSON("../data/senators.json")Tip: When loading an object that you do not know how it looks and what fields it stores, you can call the function names() as follows:
names(d)## [1] "meta" "objects"
We can customize the result in to a tibble by accessing the objects field of object d as follows:
t =as_tibble(d$objects)
t %>% head ## # A tibble: 6 × 22
## caucus congress_numbers current description district enddate extra$address
## <chr> <list> <lgl> <chr> <lgl> <chr> <chr>
## 1 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 239 Dirksen Se…
## 2 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 135 Hart Senat…
## 3 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 437 Russell Se…
## 4 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 522 Hart Senat…
## 5 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 154 Russell Se…
## 6 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 322 Hart Senat…
## # … with 15 more variables: leadership_title <chr>, party <chr>,
## # person <df[,17]>, phone <chr>, role_type <chr>, role_type_label <chr>,
## # senator_class <chr>, senator_class_label <chr>, senator_rank <chr>,
## # senator_rank_label <chr>, startdate <chr>, state <chr>, title <chr>,
## # title_long <chr>, website <chr>
Recall that for most practical purposes, a dataframe is the same object as a tible. We customize a dataframe into a tibble with the
as_tibblefunction for consistency.
And if we want more information, we can access the column person:
t$person %>% head## bioguideid birthday cspanid firstname gender gender_label lastname
## 1 C000880 1951-05-20 26440 Michael male Male Crapo
## 2 G000386 1933-09-17 1167 Charles male Male Grassley
## 3 L000174 1940-03-31 1552 Patrick male Male Leahy
## 4 M001153 1957-05-22 1004138 Lisa female Female Murkowski
## 5 M001111 1950-10-11 25277 Patty female Female Murray
## 6 S000148 1950-11-23 5929 Charles male Male Schumer
## link middlename
## 1 https://www.govtrack.us/congress/members/michael_crapo/300030 D.
## 2 https://www.govtrack.us/congress/members/charles_grassley/300048 E.
## 3 https://www.govtrack.us/congress/members/patrick_leahy/300065 J.
## 4 https://www.govtrack.us/congress/members/lisa_murkowski/300075 A.
## 5 https://www.govtrack.us/congress/members/patty_murray/300076
## 6 https://www.govtrack.us/congress/members/charles_schumer/300087 E.
## name namemod nickname osid pvsid
## 1 Sen. Michael “Mike” Crapo [R-ID] Mike N00006267 26830
## 2 Sen. Charles “Chuck” Grassley [R-IA] Chuck N00001758 53293
## 3 Sen. Patrick Leahy [D-VT] N00009918 53353
## 4 Sen. Lisa Murkowski [R-AK] N00026050 15841
## 5 Sen. Patty Murray [D-WA] N00007876 53358
## 6 Sen. Charles “Chuck” Schumer [D-NY] Chuck N00001093 26976
## sortname twitterid youtubeid
## 1 Crapo, Michael “Mike” (Sen.) [R-ID] MikeCrapo senatorcrapo
## 2 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley senchuckgrassley
## 3 Leahy, Patrick (Sen.) [D-VT] SenatorLeahy SenatorPatrickLeahy
## 4 Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski senatormurkowski
## 5 Murray, Patty (Sen.) [D-WA] PattyMurray SenatorPattyMurray
## 6 Schumer, Charles “Chuck” (Sen.) [D-NY] SenSchumer SenatorSchumer
4.3 Excel files: readxl
The readxl library (part of tidyverse) allows us to load excel files into tibbles:
library(readxl)To see an example, you can use the retail.xlsx file:
d = read_excel("../data/retail.xlsx")
d %>% head## # A tibble: 6 × 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 536365 85123A WHITE HANGING HEAR… 6 2010-12-01 08:26:00 2.55
## 2 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39
## 3 536365 84406B CREAM CUPID HEARTS… 8 2010-12-01 08:26:00 2.75
## 4 536365 84029G KNITTED UNION FLAG… 6 2010-12-01 08:26:00 3.39
## 5 536365 84029E RED WOOLLY HOTTIE … 6 2010-12-01 08:26:00 3.39
## 6 536365 22752 SET 7 BABUSHKA NES… 2 2010-12-01 08:26:00 7.65
## # … with 2 more variables: CustomerID <dbl>, Country <chr>
We can apply the tidy functions we learned to manipulate the tibble as usual:
d %>% select(StockCode,InvoiceDate,UnitPrice) %>%
filter(UnitPrice > 10 & InvoiceDate > '2011-01-01') %>% head## # A tibble: 6 × 3
## StockCode InvoiceDate UnitPrice
## <chr> <dttm> <dbl>
## 1 22846 2011-01-04 10:37:00 17.0
## 2 22847 2011-01-04 10:37:00 17.0
## 3 84352 2011-01-04 10:48:00 17.0
## 4 22423 2011-01-04 10:48:00 12.8
## 5 85066 2011-01-04 11:30:00 12.8
## 6 21628 2011-01-04 11:34:00 15.0
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu