1. Why do I care?
In R, when we want to create a new column we often use mutate() function. A little complex situation is that this new column is base on some existing columns such that we have to prepare our logic first.
One common solution is using if_else() function. For, example:
library(tidyverse) (tips <- read_csv("tips.csv"))
# A tibble: 244 × 7 total_bill tip sex smoker day time size <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 17.0 1.01 Female No Sun Dinner 2 2 10.3 1.66 Male No Sun Dinner 3 3 21.0 3.5 Male No Sun Dinner 3 4 23.7 3.31 Male No Sun Dinner 2 5 24.6 3.61 Female No Sun Dinner 4 6 25.3 4.71 Male No Sun Dinner 4 7 8.77 2 Male No Sun Dinner 2 8 26.9 3.12 Male No Sun Dinner 4 9 15.0 1.96 Male No Sun Dinner 2 10 14.8 3.23 Male No Sun Dinner 2 # … with 234 more rows
When the logic is simple if_else() is convenience.
# a common workflow is... tips %>% mutate(tips_type = if_else(tip >= total_bill * 0.20, "well paid", "under paid"))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tips_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 under paid 3 21.0 3.5 Male No Sun Dinner 3 under paid 4 23.7 3.31 Male No Sun Dinner 2 under paid 5 24.6 3.61 Female No Sun Dinner 4 under paid 6 25.3 4.71 Male No Sun Dinner 4 under paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 under paid 9 15.0 1.96 Male No Sun Dinner 2 under paid 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows
But a disadvantage of this method is that when the logic is growing, the code will become chaos. There will be many layers of if_else() overlapping, like a sandiwch.
Needless to say, if you are going to use more than just two columns, you will confuse yourself easily.
For example:
# more layers of if_else() is difficult to write and to read: tips %>% mutate(tips_type = if_else(tip >= total_bill * 0.2, "well paid", if_else(tip >= total_bill * 0.15, "fare paid", if_else(tip >= total_bill * 0.1, "acceptable", "under paid")))) # many layers of if_else() overlapping, like a sandwich # needless to say, if we are using more columns than just two: #tips %>% # mutate(tips_type = if_else((tip >= total_bill * 0.2) & (day %in% c("Sat", "Sun") & time == "Dinner"), "well paid", ...))
That is, we need a separated place to arrange our business logic and prepare our function, instead of making huge if_else() sandwich. Luckly, R is a functional programming language and it has a convenience tool call map(), from tidyverse package.
2. Before map()
Firstly, in base R, there are functions called apply(), lapply() and sapple(). They are designed for the same purpose like map(). But we don't talk about them in this article. And I don't use them because they lack consistence just like most of other base R functions.
Secondly, map() is no magic and it is only a wrapper of for-loop. For-loop is a good choice to handle multiple inputs with a same process. This is exactly what we need. However, for-loop is hard to write inside a mutate() function. Don't forget we are talking about create a new column problem. So we can use the wrapper of for-loop, the map().
Of course, you can use for-loop all the same but that will break your data flow pipline. For example:
# you can use for-loop all the same but that breaks your data flow pipline: tips %>% filter(time == "Dinner") %>% mutate(tips_type = "oh waite a minute, I first write a for-loop to find the result") # (joke)mY cOoL foR-lO0p tips_type_result <- vector("character", nrow(tips)) for (i in seq_along(tips$tip)) { if (tips$tip[[i]] > tips$total_bill[[i]] * 0.2) { tips_type_result[[i]] = "well paid" } else { tips_type_result[[i]] = "under paid" } } # (joke)lEt's gO bAck to mUtaTe tips %>% filter(time == "Dinner") %>% mutate(tips_type = tips_type_result) %>% summarise()... # you will not want to do things like this. That's why we need map().
Lastly, when I say map() but actually it is a family of functions, like map_chr() which output is character string, map_dbl() which output is double floating point number and so on. You can lookup them with ?map at the R console.
3. A workflow of using map()
As we have talked above, one of the biggest advantage map() gives us is a calm to write our logic and function separately.
After that, we can use this function with map() inside mutate(), which also keeps out data flow pipline to next step.
Using map() increases readability and rubust of our code. If our business logic changes, we can change the independ function instead change mutate clause.
Note that I actually use map2_chr(). It means we have two variables as input, and output is a character string. You can use map_dbl() or map2_dbl() if your result is double floating point numbers.
tip_type_judge <- function(tip, total_bill) { if (tip >= total_bill * 0.2) { return("well paid") } else if (tip >= total_bill * 0.15) { return("fare paid") } else if (tip >= total_bill * 0.1) { return("acceptable") } else { return("under paid") } } tips %>% mutate(tip_type = map2_chr(tip, total_bill, tip_type_judge))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tip_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 fare paid 3 21.0 3.5 Male No Sun Dinner 3 fare paid 4 23.7 3.31 Male No Sun Dinner 2 acceptable 5 24.6 3.61 Female No Sun Dinner 4 acceptable 6 25.3 4.71 Male No Sun Dinner 4 fare paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 acceptable 9 15.0 1.96 Male No Sun Dinner 2 acceptable 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows
If you have more than just 2 columns as input, you can use pmap_*(). It takse a list as input and in the list we can use as many columns as we need. Note that I use pmap_chr() below. You can use pmap_dbl() if your result is double floating point numbers.
# if you have more than 2 columns as input, use pmap_*() instead tip_type_judge_v2 <- function(tip, total_bill, day) { if ((tip >= total_bill * 0.2) & (day %in% c("Sun", "Sat"))) { return("well paid") } else if ((tip >= total_bill * 0.15) & !(day %in% c("Sun", "Sat"))) { return("well paid") } else if ((tip >= total_bill) * 0.1) { return("acceptable") } else { return("under paid") } } tips %>% mutate(tip_type = pmap_chr(list(tip, total_bill, day), tip_type_judge_v2))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tip_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 under paid 3 21.0 3.5 Male No Sun Dinner 3 under paid 4 23.7 3.31 Male No Sun Dinner 2 under paid 5 24.6 3.61 Female No Sun Dinner 4 under paid 6 25.3 4.71 Male No Sun Dinner 4 under paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 under paid 9 15.0 1.96 Male No Sun Dinner 2 under paid 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows