Suppose your data set has a column that contains errors in a small number of values. Additionally, you know that the location of these errors can be determined based on another column, such as an ID. Finally, you’d like to make it extremely clear in your code which values were changed and to what so that there is no confusion among your collaborators (or your future self).
For example, consider the following completely made up data containing a few issues in the sequence
column.
In short, these imaginary data capture patients’ hospital visits in which they are diagnosed with cancer.
Some patients are diagnosed with multiple cancers and the sequence
variable records in what order cancers were diagnosed.
Problems with the sequence
values can occur from errors at the time of manual data entry or through historical changes in coding standards for this variable.
Note that, while the data entries are fictitious, the problem is based on the real experiences of our group and others who use cancer registry systems.
example_data
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 0
## 2 3839 Bernie O'Reilly Prostate 60
## 3 5643 Brionna Kuphal-Fadel Breast 99
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 0
## 7 5757 Gabriela Kassulke Cervical 60
## 8 6702 Ty Lynch Lung 99
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 1
## 12 3622 Regis Stracke-Bartell <NA> 0
There are three issues in the sequence
column that need to be rectified:
Two values are
99
when they should be1
(rows 3 and 8).One of the
sequence
values for Abigale Senger-Schimmel (row 11) was miscoded at the time of data entry and should be3
.Two pairs of values came from an old coding system when the values
0
and60
were used instead of1
and2
, respectively (rows 1, 2, 6, and 7).
Note that in this context, because we are changing values from the source data, we want it to be especially clear to any readers of our code (collaborators or ourselves when reporting results) which changes were made.
To do this, we’ll build on the dplyr function recode()
to create a version called recode_if()
that only applies the recoding when a secondary condition is met.
Standard recode()
The first issue can be fixed with a standard use of recode()
from dplyr.
example_data <- example_data %>%
mutate(sequence = recode(sequence, "99" = "1"))
example_data
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 0
## 2 3839 Bernie O'Reilly Prostate 60
## 3 5643 Brionna Kuphal-Fadel Breast 1
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 0
## 7 5757 Gabriela Kassulke Cervical 60
## 8 6702 Ty Lynch Lung 1
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 1
## 12 3622 Regis Stracke-Bartell <NA> 0
recode()
is a vectorized version of switch()
.
In the above example, recode()
works by taking any value of sequence
equal to 99
and recoding it as 1
.
recode_if()
For the second two issues, we need to condition the value updating on a second column.
By this, we mean that we can’t use the value of sequence
directly to choose which value to update – for example, we need to change the value of sequence
when it equals 1
, but only for id == 2702 & cancerSite == "Brain"
.
To do this we introduce a simple function called recode_if()
that provides a wrapper around if_else()
and recode()
.
recode_if <- function(x, condition, ...) {
if_else(condition, recode(x, ...), x)
}
Then we apply this function to change the value of sequence
to 3
for the person with id == 2702 & cancerSite == "Brain"
.
example_data <- example_data %>%
mutate(sequence = recode_if(sequence, id == 2702 & cancerSite == "Brain", "1" = "3"))
example_data
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 0
## 2 3839 Bernie O'Reilly Prostate 60
## 3 5643 Brionna Kuphal-Fadel Breast 1
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 0
## 7 5757 Gabriela Kassulke Cervical 60
## 8 6702 Ty Lynch Lung 1
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 3
## 12 3622 Regis Stracke-Bartell <NA> 0
And finally, we correct the historical uses of 0
and 60
in the sequence
variable using recode_if()
.
example_data <- example_data %>%
mutate(sequence = recode_if(sequence, !is.na(cancerSite), "0" = "1", "60" = "2"))
example_data
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 1
## 2 3839 Bernie O'Reilly Prostate 2
## 3 5643 Brionna Kuphal-Fadel Breast 1
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 1
## 7 5757 Gabriela Kassulke Cervical 2
## 8 6702 Ty Lynch Lung 1
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 3
## 12 3622 Regis Stracke-Bartell <NA> 0
Comparison
The example here was contrived, but it’s not uncommon to need to change the value in a single row/column pair in a data frame during cleaning.
recode()
and recode_if()
are two methods that are useful, but there are others.
In our opinion, the explict mapping of old values to new values in recode()
and recode_if()
makes the code clearer and easier to understand from a distance.
Here’s the full method using recode()
and recode_if()
.
example_data_orig %>%
mutate(
sequence = recode(sequence, "99" = "1"),
sequence = recode_if(sequence, id == 2702 & cancerSite == "Brain", "1" = "3"),
sequence = recode_if(sequence, !is.na(cancerSite), "0" = "1", "60" = "2")
)
Another option is to use if_else()
directly
example_data_orig %>%
mutate(
sequence = if_else(sequence == "99", "1", sequence),
sequence = if_else(id == 2702 & cancerSite == "Brain", "3", sequence),
sequence = if_else(!is.na(cancerSite) & sequence == "0", "1", sequence),
sequence = if_else(!is.na(cancerSite) & sequence == "60", "2", sequence)
)
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 1
## 2 3839 Bernie O'Reilly Prostate 2
## 3 5643 Brionna Kuphal-Fadel Breast 1
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 1
## 7 5757 Gabriela Kassulke Cervical 2
## 8 6702 Ty Lynch Lung 1
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 3
## 12 3622 Regis Stracke-Bartell <NA> 0
but the collection of if_else()
statements is difficult to read overall, and even moreso if the third issue is coded as a series of nested if_else()
statements.
example_data_orig %>%
mutate(
sequence = if_else(sequence == "99", "1", sequence),
sequence = if_else(id == 2702 & cancerSite == "Brain", "3", sequence),
sequence = if_else(!is.na(cancerSite),
if_else(sequence == "0", "1",
if_else(sequence == "60", "2",
sequence),
sequence),
sequence)
)
A third option is to use case_when()
, as in
example_data_orig %>%
mutate(
sequence = case_when(
sequence == "99" ~ "1",
id == 2702 & cancerSite == "Brain" ~ "3",
!is.na(cancerSite) & sequence == "0" ~ "1",
!is.na(cancerSite) & sequence == "60" ~ "2",
TRUE ~ sequence
)
)
## # A tibble: 12 x 4
## id name cancerSite sequence
## <dbl> <chr> <chr> <chr>
## 1 3839 Bernie O'Reilly Prostate 1
## 2 3839 Bernie O'Reilly Prostate 2
## 3 5643 Brionna Kuphal-Fadel Breast 1
## 4 1506 Samual Fay Prostate 1
## 5 1506 Samual Fay Bone 2
## 6 5757 Gabriela Kassulke Breast 1
## 7 5757 Gabriela Kassulke Cervical 2
## 8 6702 Ty Lynch Lung 1
## 9 2702 Abigale Senger-Schimmel Breast 1
## 10 2702 Abigale Senger-Schimmel Ovarian 2
## 11 2702 Abigale Senger-Schimmel Brain 3
## 12 3622 Regis Stracke-Bartell <NA> 0
but the repeated !is.na(cancerSite)
filter makes the recoding overly verbose and potentially less clear.