This post explores where to expect efficiency gains when using the new dtplyr to import and manipulate large flat files.
# Install packages if you need to
install.packages(c("tidyverse", "fs"))
library(data.table)
library(dtplyr)
library(tidyverse)
library(microbenchmark)
Problem
This week, we got the following exciting announcement from Hadley Wickham regarding a big dtplyr release!
When dealing with large flat files, I have often resorted to datatable’s fread
function, which is a very fast alternative to readr’s read_csv
(for example). Unfortunately, I’m not too comfortable with datatable syntax for data munging, so I have a few ugly pipelines laying around where I mash data from fread
into some tibble
-ish format that accepts dplyr verbs. In this setting, dtplyr feels like the idyllic solution but, being a lesser mortal than Hadley, I’ve had trouble connecting all the dots.
Specific questions:
- Does dtplyr let me avoid
fread
altogether? (Spoiler: Not really, that’s not dtplyr’s purpose.) - If not, does the main dtplyr function
lazy_dt
still give me efficiency gains when I’ve loaded something fromfread
? (Spoiler: Absolutely, that is the point.) - Does
lazy_dt
help when I’ve loaded something fully into memory viareadr
? (Spoiler: No.)
Example Data
To illustrate, we’ll use a modest 150MB csv dataset provided by the Gun Violence Archive and available in Kaggle which reports over 260k gun violence incidents in the US between 2013 and 2018. Note that we don’t directly repost the data here in accordance with use agreements; if you’d like to reproduce the below, please download the csv via the above link and stuff it into your working directory.
All we’ll do below is simply load the data, then group by state and print a sorted count of incidents. For each strategy, we’ll keep track of compute time.
Benchmarks
Using read_csv
Here’s the traditional strategy: use read_csv
to load the data, and do the usual group_by() %>% count()
work.
microbenchmark(
read_csv("gun-violence-data_01-2013_03-2018.csv", progress = FALSE) %>%
group_by(state) %>%
count(sort = TRUE) %>%
print(),
times = 1,
unit = "s"
)$time/1e9
## # A tibble: 51 x 2
## # Groups: state [51]
## state n
## <chr> <int>
## 1 Illinois 17556
## 2 California 16306
## 3 Florida 15029
## 4 Texas 13577
## 5 Ohio 10244
## 6 New York 9712
## 7 Pennsylvania 8929
## 8 Georgia 8925
## 9 North Carolina 8739
## 10 Louisiana 8103
## # … with 41 more rows
## [1] 3.373726
Using fread
Here’s the same result, but loading with fread
. The cool part here, brought to us by dtplyr, is that we don’t have to bring the large data table into memory to use the group_by() %>% count()
verbs; we simply cast to lazy_dt
and then as_tibble
the much smaller results table for printing.
microbenchmark(
fread("gun-violence-data_01-2013_03-2018.csv") %>%
lazy_dt() %>%
group_by(state) %>%
count(sort = TRUE) %>%
as_tibble() %>%
print(),
times = 1,
unit = "s"
)$time/1e9
## # A tibble: 51 x 2
## state n
## <chr> <int>
## 1 Illinois 17556
## 2 California 16306
## 3 Florida 15029
## 4 Texas 13577
## 5 Ohio 10244
## 6 New York 9712
## 7 Pennsylvania 8929
## 8 Georgia 8925
## 9 North Carolina 8739
## 10 Louisiana 8103
## # … with 41 more rows
## [1] 1.161118
This method is about twice as fast!!!
What about objects already in memory?
Maybe the above performance gain isn’t that surprising: a lot of the above boost is likely due to speed improvements with fread
, which we already knew about. Does lazy_dt()
still save us time when data are already in memory?
Here, we load with read_csv
and store as the tibble dat_readr
. Then, we do the group_by() %>% count()
100 times.
dat_readr <- read_csv("gun-violence-data_01-2013_03-2018.csv", progress = FALSE)
microbenchmark(
dat_readr %>%
group_by(state) %>%
count(sort = TRUE),
times = 100
)
## Unit: milliseconds
## expr min lq
## dat_readr %>% group_by(state) %>% count(sort = TRUE) 9.454512 10.20473
## mean median uq max neval
## 11.15166 10.52119 11.14279 53.51589 100
Here, we use the same dat_readr
object, but cast it to lazy_dt
before doing the group_by() %>% count()
100 times.
microbenchmark(
dat_readr %>%
lazy_dt() %>%
group_by(state) %>%
count(sort = TRUE),
times = 100
)
## Unit: milliseconds
## expr min
## dat_readr %>% lazy_dt() %>% group_by(state) %>% count(sort = TRUE) 10.64056
## lq mean median uq max neval
## 11.99145 49.00027 14.5085 83.51196 243.6867 100
This second approach is actually slower, which totally made sense to me once I saw the answer! Why would taking extra steps to lazily evaluate something already in memory be faster? Doh!
For completeness, here’s the same example where we store an object dat_dt
using fread
.
dat_dt <- fread("gun-violence-data_01-2013_03-2018.csv")
microbenchmark(
dat_dt %>%
lazy_dt() %>%
group_by(state) %>%
count(sort = TRUE) %>%
as_tibble(),
times = 100
)
## Unit: milliseconds
## expr
## dat_dt %>% lazy_dt() %>% group_by(state) %>% count(sort = TRUE) %>% as_tibble()
## min lq mean median uq max neval
## 3.85039 4.322646 5.870348 4.492541 4.892027 63.86446 100
That’s at least 3 times faster! Word.