This post demonstrates how you can use two packages from the tidyverse – purrr and readr – to read a list of CSV files into a single data frame in R. We’ll also use a newer package called fs for file system operations.
# Install packages if you need to
install.packages(c("tidyverse", "fs"))
library(tidyverse) # attaches purrr and readr
library(fs)
Problem
Suppose we have a directory of CSV files that contain data that are regularly shaped, meaning that columns in each data set are all the same, or at least the columns that overlap are the same.
We would like to read in all of the CSV files inside the directory and build one large dataset by merging each of the smaller data sets together.
Example Data
To make the example more concrete, we’ll use a dataset provided by the Government of Ireland showing the yearly number of e-referrals per hospital department, where the data from each year reside in individual CSV files.
We’ve collected the data for you, which you can download as a zipfile here, or you can download the original CSV files from data.gov.ie.
Once we’ve extracted the zip file or downloaded the CSV files a single folder, we store the location of the unzipped folder in data_dir
.
data_dir <- "ie-general-referrals-by-hospital"
We can then list the CSV files using fs::dir_ls()
.
fs::dir_ls(data_dir)
## ie-general-referrals-by-hospital/README.txt
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2015.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2016.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2017.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2018.csv
Notice that there is an additional README.txt
file that we don’t want to import, so we limit our directory listing to just the CSV files, i.e. the files that end with .csv
.
csv_files <- fs::dir_ls(data_dir, regexp = "\\.csv$")
csv_files
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2015.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2016.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2017.csv
## ie-general-referrals-by-hospital/general-referrals-by-hospital-department-2018.csv
Import all the Files
Start with one file
Each of the CSV files can be individual read in using readr::read_csv()
.
Let’s read one file as an example:
readr::read_csv(csv_files[1])
## # A tibble: 837 x 6
## Month_Year Hospital_Name Hospital_ID Hospital_Department ReferralType
## <chr> <chr> <int> <chr> <chr>
## 1 Aug-15 AMNCH 1049 Paediatric ENT General Ref…
## 2 Aug-15 AMNCH 1049 Paediatric Gastroe… General Ref…
## 3 Aug-15 AMNCH 1049 Paediatric General… General Ref…
## 4 Aug-15 Bantry General… 704 Gastroenterology General Ref…
## 5 Aug-15 Bantry General… 704 General Medicine General Ref…
## 6 Aug-15 Bantry General… 704 General Surgery General Ref…
## 7 Aug-15 Bantry General… 704 Medicine for the E… General Ref…
## 8 Aug-15 Bantry General… 704 Outreach Dermatolo… General Ref…
## 9 Aug-15 Bantry General… 704 Outreach Orthopaed… General Ref…
## 10 Aug-15 Bantry General… 704 Outreach Surgical General Ref…
## # ... with 827 more rows, and 1 more variable: TotalReferrals <int>
Scale up to all files
To read all of the files in the directory, we map read_csv()
onto the list of files, using purrr::map()
.
But knowing that each list element will be a tibble
(or data.frame
) and that each data frame has the same columns, we can use purrr’s typed functions to return a single data frame containing each of the imported CSV files using purrr::map_dfr()
.
The additional _dfr()
tells purrr to return a data frame (df
) by row-binding each element together (r
).
(It’s the same as calling map() %>% bind_rows()
.)
csv_files %>%
map_dfr(read_csv)
## # A tibble: 12,278 x 6
## Month_Year Hospital_Name Hospital_ID Hospital_Department ReferralType
## <chr> <chr> <int> <chr> <chr>
## 1 Aug-15 AMNCH 1049 Paediatric ENT General Ref…
## 2 Aug-15 AMNCH 1049 Paediatric Gastroe… General Ref…
## 3 Aug-15 AMNCH 1049 Paediatric General… General Ref…
## 4 Aug-15 Bantry General… 704 Gastroenterology General Ref…
## 5 Aug-15 Bantry General… 704 General Medicine General Ref…
## 6 Aug-15 Bantry General… 704 General Surgery General Ref…
## 7 Aug-15 Bantry General… 704 Medicine for the E… General Ref…
## 8 Aug-15 Bantry General… 704 Outreach Dermatolo… General Ref…
## 9 Aug-15 Bantry General… 704 Outreach Orthopaed… General Ref…
## 10 Aug-15 Bantry General… 704 Outreach Surgical General Ref…
## # ... with 12,268 more rows, and 1 more variable: TotalReferrals <int>
Modify settings for all files
Notice that the Month_Year
column was imported as a character
instead of a date-time
.
We can modify the arguments of read_csv()
inside the call to map_dfr()
, which sets the arguments for each CSV import.
csv_files %>%
map_dfr(read_csv, col_types = cols("Month_Year" = col_date(format = "%b-%y")))
## # A tibble: 12,278 x 6
## Month_Year Hospital_Name Hospital_ID Hospital_Department ReferralType
## <date> <chr> <int> <chr> <chr>
## 1 2015-08-01 AMNCH 1049 Paediatric ENT General Ref…
## 2 2015-08-01 AMNCH 1049 Paediatric Gastroe… General Ref…
## 3 2015-08-01 AMNCH 1049 Paediatric General… General Ref…
## 4 2015-08-01 Bantry General… 704 Gastroenterology General Ref…
## 5 2015-08-01 Bantry General… 704 General Medicine General Ref…
## 6 2015-08-01 Bantry General… 704 General Surgery General Ref…
## 7 2015-08-01 Bantry General… 704 Medicine for the E… General Ref…
## 8 2015-08-01 Bantry General… 704 Outreach Dermatolo… General Ref…
## 9 2015-08-01 Bantry General… 704 Outreach Orthopaed… General Ref…
## 10 2015-08-01 Bantry General… 704 Outreach Surgical General Ref…
## # ... with 12,268 more rows, and 1 more variable: TotalReferrals <int>
Warning messages:
1: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
Fix date parsing after importing
Setting read_csv
options for all of the files almost works.
Unfortunately, the format of the Month_Year
column changed in 2016 from Jan-15
to Jan-2016
, so the Month_Year
column is NA
for the years 2016 and 2017.
To fix this, the easiest solution is to import as a character and use lubridate to parse the date-times, as readr’s col_date()
format only allows for one format.
library(lubridate)
csv_files %>%
map_dfr(read_csv) %>%
mutate(Month_Year = myd(Month_Year, truncated = 1))
## # A tibble: 12,278 x 6
## Month_Year Hospital_Name Hospital_ID Hospital_Department ReferralType
## <date> <chr> <int> <chr> <chr>
## 1 2015-08-01 AMNCH 1049 Paediatric ENT General Ref…
## 2 2015-08-01 AMNCH 1049 Paediatric Gastroe… General Ref…
## 3 2015-08-01 AMNCH 1049 Paediatric General… General Ref…
## 4 2015-08-01 Bantry General… 704 Gastroenterology General Ref…
## 5 2015-08-01 Bantry General… 704 General Medicine General Ref…
## 6 2015-08-01 Bantry General… 704 General Surgery General Ref…
## 7 2015-08-01 Bantry General… 704 Medicine for the E… General Ref…
## 8 2015-08-01 Bantry General… 704 Outreach Dermatolo… General Ref…
## 9 2015-08-01 Bantry General… 704 Outreach Orthopaed… General Ref…
## 10 2015-08-01 Bantry General… 704 Outreach Surgical General Ref…
## # ... with 12,268 more rows, and 1 more variable: TotalReferrals <int>
Add a source indicator
Finally, we may sometimes wish to include an indicator variable that tracks the source of each row, for example to keep track of the file from which the data was collected.
To do this, we set the .id
parameter of map_dfr()
to the name of the indicator variable, and the source file name will be included in the final data set.
csv_files %>%
map_dfr(read_csv, .id = "source") %>%
mutate(Month_Year = myd(Month_Year, truncated = 1))
## # A tibble: 12,278 x 7
## source Month_Year Hospital_Name Hospital_ID Hospital_Departm…
## <chr> <date> <chr> <int> <chr>
## 1 ie-general-ref… 2015-08-01 AMNCH 1049 Paediatric ENT
## 2 ie-general-ref… 2015-08-01 AMNCH 1049 Paediatric Gastr…
## 3 ie-general-ref… 2015-08-01 AMNCH 1049 Paediatric Gener…
## 4 ie-general-ref… 2015-08-01 Bantry Genera… 704 Gastroenterology
## 5 ie-general-ref… 2015-08-01 Bantry Genera… 704 General Medicine
## 6 ie-general-ref… 2015-08-01 Bantry Genera… 704 General Surgery
## 7 ie-general-ref… 2015-08-01 Bantry Genera… 704 Medicine for the…
## 8 ie-general-ref… 2015-08-01 Bantry Genera… 704 Outreach Dermato…
## 9 ie-general-ref… 2015-08-01 Bantry Genera… 704 Outreach Orthopa…
## 10 ie-general-ref… 2015-08-01 Bantry Genera… 704 Outreach Surgical
## # ... with 12,268 more rows, and 2 more variables: ReferralType <chr>,
## # TotalReferrals <int>
Summary
This post provides one small example of how you can combine fs, purrr, and readr to find data files in a directory and read them all together into a single data frame in R, ready for analysis or processing with tools from the tidyverse.
Here’s the code we used, all in one place.
data_dir %>%
dir_ls(regexp = "\\.csv$") %>%
map_dfr(read_csv, .id = "source") %>%
mutate(Month_Year = myd(Month_Year, truncated = 1))
We use dir_ls()
to list the files in our data directory, data_dir
, and then combine map_dfr()
and read_csv()
to read in the CSV files we found.
We use the .id
argument of map_dfr()
to track the source of each row in the final data frame, which we then begin to process using dplyr
.
Keep in mind, this same pattern can easily be used with other data types, such as SAS or Excel files, with just a few modifications – like replacing the regexp
and using haven::read_sas()
or readxl::read_xlsx()
in place of read_csv
.
In future posts, we’ll also look at other ways we merge a folder of data files besides “stapling” them together row-wise with map_dfr()
.