4 Data formats
In this chapter we will explore datasets stored in different data formats, such as csv
, json
, and xls
.
4.1 Loading CSV files into tibbles
A CSV file is a file with comma-separated values.
Function read_csv
of the package readr
allows us to load CSV files into tibbles. The package readr
is included in the set of tidyverse
packages. Recall (see Section 1.7) that in order to use functions from a package we first need to load the package by running library()
:
library(tidyverse)
Now we will use the function read_csv()
to load the data file yelp.csv
which is stored in our data directory (see Section 2.1.2):
= read_csv("../data/yelp.csv")
t t
## # A tibble: 2,529 × 25
## avg_stars median_stars std_stars avg_cool median_cool std_cool avg_useful
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 5 0 7 7 4.06 8.4
## 2 4.2 4 0.837 0.8 1 0.837 0.8
## 3 4.6 5 0.548 3.6 3 1.34 6.2
## 4 4 4 1.22 0.2 0 0.447 1.2
## 5 4 4 0.707 0.6 0 0.894 1.2
## 6 3.2 3 0.837 0.8 1 0.837 2.8
## 7 3.2 3 0.837 0.6 0 0.894 0.6
## 8 3 3 1 0.4 0 0.548 0.4
## 9 4.8 5 0.447 0.4 0 0.548 1
## 10 4.4 5 0.894 0.4 0 0.894 0.4
## # … with 2,519 more rows, and 18 more variables: median_useful <dbl>,
## # std_useful <dbl>, avg_funny <dbl>, median_funny <dbl>, std_funny <dbl>,
## # Take.out <lgl>, Caters <lgl>, Takes.Reservations <lgl>, Delivery <lgl>,
## # Has.TV <lgl>, Outdoor.Seating <lgl>, Alcohol <chr>, Waiter.Service <lgl>,
## # Accepts.Credit.Cards <lgl>, Good.for.Kids <lgl>, Good.For.Groups <lgl>,
## # Price.Range <dbl>, top10percent <dbl>
4.2 Loading JSON files into tibbles
JSON format uses two structures:
objects
andarrays
. An object is an un-ordered set ofname-value
pairs (AKAkey-value
pairs). We can define a JSON object by starting with a curly bracket{
, typing thename
, followed by a colon:
and then thevalue
. We can separate multiple name-value pairs with acomma
. A JSONarray
is an ordered sequence of values. We can define anarray
by opening a left bracket[
, and then separate the values by commas. Note that avalue
can be a string, a number, a logical value, or an object, which essentially allows for having nestedname-value
pair objects.
To load JSON files into tibbles, we will use the function fromJSON
from the package jsonlite
.
If the package jsonlite
is not already installed in your machine, you might first need to install it:
install.packages("jsonlite")
Then, we can load it into our working environment:
library(jsonlite)
We can now use the function fromJSON
to load the json file into a dataframe. The as_tibble
function customizes the dataframe into a tibble.
For instance, we can read the senators.json
file as follows (taken from here: https://www.govtrack.us/api/v2/role?current=true&role_type=senator)
= fromJSON("../data/senators.json") d
Tip: When loading an object that you do not know how it looks and what fields it stores, you can call the function names()
as follows:
names(d)
## [1] "meta" "objects"
We can customize the result in to a tibble by accessing the objects
field of object d
as follows:
=as_tibble(d$objects)
t %>% head t
## # A tibble: 6 × 22
## caucus congress_numbers current description district enddate extra$address
## <chr> <list> <lgl> <chr> <lgl> <chr> <chr>
## 1 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 239 Dirksen Se…
## 2 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 135 Hart Senat…
## 3 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 437 Russell Se…
## 4 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 522 Hart Senat…
## 5 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 154 Russell Se…
## 6 <NA> <int [3]> TRUE Senior Senat… NA 2023-0… 322 Hart Senat…
## # … with 15 more variables: leadership_title <chr>, party <chr>,
## # person <df[,17]>, phone <chr>, role_type <chr>, role_type_label <chr>,
## # senator_class <chr>, senator_class_label <chr>, senator_rank <chr>,
## # senator_rank_label <chr>, startdate <chr>, state <chr>, title <chr>,
## # title_long <chr>, website <chr>
Recall that for most practical purposes, a dataframe is the same object as a tible. We customize a dataframe into a tibble with the
as_tibble
function for consistency.
And if we want more information, we can access the column person
:
$person %>% head t
## bioguideid birthday cspanid firstname gender gender_label lastname
## 1 C000880 1951-05-20 26440 Michael male Male Crapo
## 2 G000386 1933-09-17 1167 Charles male Male Grassley
## 3 L000174 1940-03-31 1552 Patrick male Male Leahy
## 4 M001153 1957-05-22 1004138 Lisa female Female Murkowski
## 5 M001111 1950-10-11 25277 Patty female Female Murray
## 6 S000148 1950-11-23 5929 Charles male Male Schumer
## link middlename
## 1 https://www.govtrack.us/congress/members/michael_crapo/300030 D.
## 2 https://www.govtrack.us/congress/members/charles_grassley/300048 E.
## 3 https://www.govtrack.us/congress/members/patrick_leahy/300065 J.
## 4 https://www.govtrack.us/congress/members/lisa_murkowski/300075 A.
## 5 https://www.govtrack.us/congress/members/patty_murray/300076
## 6 https://www.govtrack.us/congress/members/charles_schumer/300087 E.
## name namemod nickname osid pvsid
## 1 Sen. Michael “Mike” Crapo [R-ID] Mike N00006267 26830
## 2 Sen. Charles “Chuck” Grassley [R-IA] Chuck N00001758 53293
## 3 Sen. Patrick Leahy [D-VT] N00009918 53353
## 4 Sen. Lisa Murkowski [R-AK] N00026050 15841
## 5 Sen. Patty Murray [D-WA] N00007876 53358
## 6 Sen. Charles “Chuck” Schumer [D-NY] Chuck N00001093 26976
## sortname twitterid youtubeid
## 1 Crapo, Michael “Mike” (Sen.) [R-ID] MikeCrapo senatorcrapo
## 2 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley senchuckgrassley
## 3 Leahy, Patrick (Sen.) [D-VT] SenatorLeahy SenatorPatrickLeahy
## 4 Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski senatormurkowski
## 5 Murray, Patty (Sen.) [D-WA] PattyMurray SenatorPattyMurray
## 6 Schumer, Charles “Chuck” (Sen.) [D-NY] SenSchumer SenatorSchumer
4.3 Excel files: readxl
The readxl
library (part of tidyverse
) allows us to load excel files into tibbles:
library(readxl)
To see an example, you can use the retail.xlsx
file:
= read_excel("../data/retail.xlsx")
d %>% head d
## # A tibble: 6 × 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 536365 85123A WHITE HANGING HEAR… 6 2010-12-01 08:26:00 2.55
## 2 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39
## 3 536365 84406B CREAM CUPID HEARTS… 8 2010-12-01 08:26:00 2.75
## 4 536365 84029G KNITTED UNION FLAG… 6 2010-12-01 08:26:00 3.39
## 5 536365 84029E RED WOOLLY HOTTIE … 6 2010-12-01 08:26:00 3.39
## 6 536365 22752 SET 7 BABUSHKA NES… 2 2010-12-01 08:26:00 7.65
## # … with 2 more variables: CustomerID <dbl>, Country <chr>
We can apply the tidy functions we learned to manipulate the tibble as usual:
%>% select(StockCode,InvoiceDate,UnitPrice) %>%
d filter(UnitPrice > 10 & InvoiceDate > '2011-01-01') %>% head
## # A tibble: 6 × 3
## StockCode InvoiceDate UnitPrice
## <chr> <dttm> <dbl>
## 1 22846 2011-01-04 10:37:00 17.0
## 2 22847 2011-01-04 10:37:00 17.0
## 3 84352 2011-01-04 10:48:00 17.0
## 4 22423 2011-01-04 10:48:00 12.8
## 5 85066 2011-01-04 11:30:00 12.8
## 6 21628 2011-01-04 11:34:00 15.0
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu