Data import with readr

Prerequisites

Load flat files in R with the readr package, which is part of the core tidyverse.

library(tidyverse)

Getting started

Most of readr’s functions are concerned with turning flat files into data frames:

Function File format read
read_csv() comma delimited files
read_csv2() semicolon separated files (common in countries where , is used as the decimal place)
read_tsv() tab delimited files
read_delim() files with any delimiter
read_fwf() fixed width files
read_table() files where columns are separated by white space
read_log() Apache style log files

These functions all have similar syntax.

heights <- read_csv("data/heights.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can also supply an inline csv file.

read_csv("a,b,c
1,2,3
4,5,6")
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.

read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

read_csv("1,2,3\n4,5,6", col_names = FALSE)
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Alternatively you can pass col_names a character vector which will be used as the column names:

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## lgl (1): c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 3
##       a     b c    
##   <dbl> <dbl> <lgl>
## 1     1     2 NA

Compared to base R

  • They are typically much faster (~10x).

Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread().

  • They produce tibbles, they don’t convert character vectors to factors.

Parsing a vector

parse_*() functions take a character vector and return a more specialised vector like a logical, integer, or date:

str(parse_logical(c("TRUE", "FALSE", "NA")))
##  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
##  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
##  Date[1:2], format: "2010-01-01" "1979-10-14"

parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing:

parse_integer(c("1", "231", ".", "456"), na = ".")
## [1]   1 231  NA 456

If parsing fails, you’ll get a warning:

x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning: 2 parsing failures.
## row col               expected actual
##   3  -- no trailing characters abc   
##   4  -- no trailing characters 123.45

And the failures will be missing in the output:

x
## [1] 123 345  NA  NA
## attr(,"problems")
## # A tibble: 2 × 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA no trailing characters abc   
## 2     4    NA no trailing characters 123.45

If there are many parsing failures, you’ll need to use problems() to get the complete set.

problems(x)
## # A tibble: 2 × 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA no trailing characters abc   
## 2     4    NA no trailing characters 123.45

There are eight particularly important parsers:

Functions Job
parse_logical() parse logicals
parse_integer(), parse_double() , parse_double() parse numbers
parse_character() parse characters
parse_datetime(), parse_date(), and parse_time() parse various date & time specifications

Numbers

It seems like it should be straightforward to parse a number, but three problems make it tricky:

  1. People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

  2. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

  3. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”.

When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument:

parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
  • parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20

The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:

# Used in America
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789

Factors

R uses factors to represent categorical variables that have a known set of possible values. Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present:

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row col           expected   actual
##   3  -- value in level set bananana
## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 × 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

Dates, date-times, and times

You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:

  • parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

    parse_datetime("2010-10-01T2010")
    ## [1] "2010-10-01 20:10:00 UTC"
    # If time is omitted, it will be set to midnight
    parse_datetime("20101010")
    ## [1] "2010-10-10 UTC"
  • parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:

    parse_date("2010-10-01")
    ## [1] "2010-10-01"
  • parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:

    library(hms)
    parse_time("01:10 am")
    ## 01:10:00
    parse_time("20:10:01")
    ## 20:10:01

    Base R doesn’t have a great built in class for time data, so we use the one provided in the hms package.

If these defaults don’t work for your data you can supply your own date-time format, built up of the following pieces:

Year

%Y (4 digits).

%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

Month

%m (2 digits).

%b (abbreviated name, like “Jan”).

%B (full name, “January”).

Day

%d (2 digits).

%e (optional leading space).

Time

%H 0-23 hour.

%I 0-12, must be used with %p.

%p AM/PM indicator.

%M minutes.

%S integer seconds.

%OS real seconds.

%Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this [time zones].

%z (as offset from UTC, e.g. +0800).

Non-digits

%. skips one non-digit character.

%* skips any number of non-digits.

The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions. For example:

parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"

If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"

Writing to a file

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). Both functions increase the chances of the output file being read back incorrectly by:

  • Always encoding strings in UTF-8.

  • Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

If you want to export a csv file to Excel, use write_excel_csv().

The most important arguments are x (the data frame to save), and path (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

challenge <- read_csv("data/heights.csv")
write_csv(challenge, "challenge.csv")

Note that the type information is lost when you save to csv:

challenge <- read_csv("data/heights.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
## Rows: 1192 Columns: 6
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,192 × 6
##     earn height sex       ed   age race    
##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
##  1 50000   74.4 male      16    45 white   
##  2 60000   65.5 female    16    58 white   
##  3 30000   63.6 female    16    29 white   
##  4 50000   63.1 female    16    91 other   
##  5 51000   63.4 female    17    39 white   
##  6  9000   64.4 female    15    26 white   
##  7 29000   61.7 female    12    49 white   
##  8 32000   72.7 male      17    46 white   
##  9  2000   72.0 male      15    21 hispanic
## 10 27000   72.2 male      12    26 white   
## # … with 1,182 more rows

There are two alternatives:

  1. write_rds() and read_rds() are uniform wrappers around the base unctions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS:

    write_rds(challenge, "challenge.rds")
    read_rds("challenge.rds")
    ## # A tibble: 1,192 × 6
    ##     earn height sex       ed   age race    
    ##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
    ##  1 50000   74.4 male      16    45 white   
    ##  2 60000   65.5 female    16    58 white   
    ##  3 30000   63.6 female    16    29 white   
    ##  4 50000   63.1 female    16    91 other   
    ##  5 51000   63.4 female    17    39 white   
    ##  6  9000   64.4 female    15    26 white   
    ##  7 29000   61.7 female    12    49 white   
    ##  8 32000   72.7 male      17    46 white   
    ##  9  2000   72.0 male      15    21 hispanic
    ## 10 27000   72.2 male      12    26 white   
    ## # … with 1,182 more rows
  2. The feather package implements a fast binary file format that can be shared across programming languages:

    library(feather)
    write_feather(challenge, "challenge.feather")
    read_feather("challenge.feather")
    #> # A tibble: 2,000 x 2
    #>       x      y
    #>   <dbl> <date>
    #> 1   404   <NA>
    #> 2  4172   <NA>
    #> 3  3004   <NA>
    #> 4   787   <NA>
    #> 5    37   <NA>
    #> 6  2332   <NA>
    #> # ... with 1,994 more rows

Feather tends to be faster than RDS.


Other types of data

To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:

  • haven reads SPSS, Stata, and SAS files.

  • readxl reads excel files (both .xls and .xlsx).

  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

For hierarchical data: use jsonlite (by Jeroen Ooms) for json, and xml2 for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/examples.html.

For other file types, try the R data import/export manual and the rio package.

A work by Matteo Cereda and Fabio Iannelli