4 Data formats

In this chapter we will explore datasets stored in different data formats, such as csv, json, and xls.

4.1 Loading CSV files into tibbles

A CSV file is a file with comma-separated values.

Function read_csv of the package readr allows us to load CSV files into tibbles. The package readr is included in the set of tidyverse packages. Recall (see Section 1.7) that in order to use functions from a package we first need to load the package by running library():

library(tidyverse)

Now we will use the function read_csv() to load the data file yelp.csv which is stored in our data directory (see Section 2.1.2):

t = read_csv("../data/yelp.csv")
t
## # A tibble: 2,529 × 25
##    avg_stars median_stars std_stars avg_cool median_cool std_cool avg_useful
##        <dbl>        <dbl>     <dbl>    <dbl>       <dbl>    <dbl>      <dbl>
##  1       5              5     0          7             7    4.06         8.4
##  2       4.2            4     0.837      0.8           1    0.837        0.8
##  3       4.6            5     0.548      3.6           3    1.34         6.2
##  4       4              4     1.22       0.2           0    0.447        1.2
##  5       4              4     0.707      0.6           0    0.894        1.2
##  6       3.2            3     0.837      0.8           1    0.837        2.8
##  7       3.2            3     0.837      0.6           0    0.894        0.6
##  8       3              3     1          0.4           0    0.548        0.4
##  9       4.8            5     0.447      0.4           0    0.548        1  
## 10       4.4            5     0.894      0.4           0    0.894        0.4
## # … with 2,519 more rows, and 18 more variables: median_useful <dbl>,
## #   std_useful <dbl>, avg_funny <dbl>, median_funny <dbl>, std_funny <dbl>,
## #   Take.out <lgl>, Caters <lgl>, Takes.Reservations <lgl>, Delivery <lgl>,
## #   Has.TV <lgl>, Outdoor.Seating <lgl>, Alcohol <chr>, Waiter.Service <lgl>,
## #   Accepts.Credit.Cards <lgl>, Good.for.Kids <lgl>, Good.For.Groups <lgl>,
## #   Price.Range <dbl>, top10percent <dbl>

4.2 Loading JSON files into tibbles

JSON format uses two structures: objects and arrays. An object is an un-ordered set of name-value pairs (AKA key-value pairs). We can define a JSON object by starting with a curly bracket {, typing the name, followed by a colon : and then the value. We can separate multiple name-value pairs with a comma. A JSON array is an ordered sequence of values. We can define an array by opening a left bracket [, and then separate the values by commas. Note that a value can be a string, a number, a logical value, or an object, which essentially allows for having nested name-value pair objects.

To load JSON files into tibbles, we will use the function fromJSON from the package jsonlite.

If the package jsonlite is not already installed in your machine, you might first need to install it:

install.packages("jsonlite")

Then, we can load it into our working environment:

library(jsonlite)

We can now use the function fromJSON to load the json file into a dataframe. The as_tibble function customizes the dataframe into a tibble.

For instance, we can read the senators.json file as follows (taken from here: https://www.govtrack.us/api/v2/role?current=true&role_type=senator)

d = fromJSON("../data/senators.json")

Tip: When loading an object that you do not know how it looks and what fields it stores, you can call the function names() as follows:

names(d)
## [1] "meta"    "objects"

We can customize the result in to a tibble by accessing the objects field of object d as follows:

t =as_tibble(d$objects)
t %>% head 
## # A tibble: 6 × 22
##   caucus congress_numbers current description   district enddate extra$address  
##   <chr>  <list>           <lgl>   <chr>         <lgl>    <chr>   <chr>          
## 1 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 239 Dirksen Se…
## 2 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 135 Hart Senat…
## 3 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 437 Russell Se…
## 4 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 522 Hart Senat…
## 5 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 154 Russell Se…
## 6 <NA>   <int [3]>        TRUE    Senior Senat… NA       2023-0… 322 Hart Senat…
## # … with 15 more variables: leadership_title <chr>, party <chr>,
## #   person <df[,17]>, phone <chr>, role_type <chr>, role_type_label <chr>,
## #   senator_class <chr>, senator_class_label <chr>, senator_rank <chr>,
## #   senator_rank_label <chr>, startdate <chr>, state <chr>, title <chr>,
## #   title_long <chr>, website <chr>

Recall that for most practical purposes, a dataframe is the same object as a tible. We customize a dataframe into a tibble with the as_tibble function for consistency.

And if we want more information, we can access the column person:

t$person %>% head
##   bioguideid   birthday cspanid firstname gender gender_label  lastname
## 1    C000880 1951-05-20   26440   Michael   male         Male     Crapo
## 2    G000386 1933-09-17    1167   Charles   male         Male  Grassley
## 3    L000174 1940-03-31    1552   Patrick   male         Male     Leahy
## 4    M001153 1957-05-22 1004138      Lisa female       Female Murkowski
## 5    M001111 1950-10-11   25277     Patty female       Female    Murray
## 6    S000148 1950-11-23    5929   Charles   male         Male   Schumer
##                                                               link middlename
## 1    https://www.govtrack.us/congress/members/michael_crapo/300030         D.
## 2 https://www.govtrack.us/congress/members/charles_grassley/300048         E.
## 3    https://www.govtrack.us/congress/members/patrick_leahy/300065         J.
## 4   https://www.govtrack.us/congress/members/lisa_murkowski/300075         A.
## 5     https://www.govtrack.us/congress/members/patty_murray/300076           
## 6  https://www.govtrack.us/congress/members/charles_schumer/300087         E.
##                                   name namemod nickname      osid pvsid
## 1     Sen. Michael “Mike” Crapo [R-ID]             Mike N00006267 26830
## 2 Sen. Charles “Chuck” Grassley [R-IA]            Chuck N00001758 53293
## 3            Sen. Patrick Leahy [D-VT]                  N00009918 53353
## 4           Sen. Lisa Murkowski [R-AK]                  N00026050 15841
## 5             Sen. Patty Murray [D-WA]                  N00007876 53358
## 6  Sen. Charles “Chuck” Schumer [D-NY]            Chuck N00001093 26976
##                                  sortname     twitterid           youtubeid
## 1     Crapo, Michael “Mike” (Sen.) [R-ID]     MikeCrapo        senatorcrapo
## 2 Grassley, Charles “Chuck” (Sen.) [R-IA] ChuckGrassley    senchuckgrassley
## 3            Leahy, Patrick (Sen.) [D-VT]  SenatorLeahy SenatorPatrickLeahy
## 4           Murkowski, Lisa (Sen.) [R-AK] LisaMurkowski    senatormurkowski
## 5             Murray, Patty (Sen.) [D-WA]   PattyMurray  SenatorPattyMurray
## 6  Schumer, Charles “Chuck” (Sen.) [D-NY]    SenSchumer      SenatorSchumer

4.3 Excel files: readxl

The readxl library (part of tidyverse) allows us to load excel files into tibbles:

library(readxl)

To see an example, you can use the retail.xlsx file:

d = read_excel("../data/retail.xlsx")
d %>%  head
## # A tibble: 6 × 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANGING HEAR…        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE METAL LANTERN        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPID HEARTS…        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UNION FLAG…        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY HOTTIE …        6 2010-12-01 08:26:00      3.39
## 6 536365    22752     SET 7 BABUSHKA NES…        2 2010-12-01 08:26:00      7.65
## # … with 2 more variables: CustomerID <dbl>, Country <chr>

We can apply the tidy functions we learned to manipulate the tibble as usual:

d %>% select(StockCode,InvoiceDate,UnitPrice)  %>% 
        filter(UnitPrice > 10 & InvoiceDate > '2011-01-01') %>%  head
## # A tibble: 6 × 3
##   StockCode InvoiceDate         UnitPrice
##   <chr>     <dttm>                  <dbl>
## 1 22846     2011-01-04 10:37:00      17.0
## 2 22847     2011-01-04 10:37:00      17.0
## 3 84352     2011-01-04 10:48:00      17.0
## 4 22423     2011-01-04 10:48:00      12.8
## 5 85066     2011-01-04 11:30:00      12.8
## 6 21628     2011-01-04 11:34:00      15.0

For comments, suggestions, errors, and typos, please email me at: