How to use map() instead of if_else() sandwich?

2023-10-09 12:34:52

1. Why do I care?

In R, when we want to create a new column we often use mutate() function. A little complex situation is that this new column is base on some existing columns such that we have to prepare our logic first.

One common solution is using if_else() function. For, example:

library(tidyverse)
(tips <- read_csv("tips.csv"))

# A tibble: 244 × 7
   total_bill   tip sex    smoker day   time    size
        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl>
 1      17.0   1.01 Female No     Sun   Dinner     2
 2      10.3   1.66 Male   No     Sun   Dinner     3
 3      21.0   3.5  Male   No     Sun   Dinner     3
 4      23.7   3.31 Male   No     Sun   Dinner     2
 5      24.6   3.61 Female No     Sun   Dinner     4
 6      25.3   4.71 Male   No     Sun   Dinner     4
 7       8.77  2    Male   No     Sun   Dinner     2
 8      26.9   3.12 Male   No     Sun   Dinner     4
 9      15.0   1.96 Male   No     Sun   Dinner     2
10      14.8   3.23 Male   No     Sun   Dinner     2
# … with 234 more rows

When the logic is simple if_else() is convenience.

# a common workflow is...
tips %>%
  mutate(tips_type = if_else(tip >= total_bill * 0.20, "well paid", "under paid"))

# A tibble: 244 × 8
   total_bill   tip sex    smoker day   time    size tips_type 
        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <chr>     
 1      17.0   1.01 Female No     Sun   Dinner     2 under paid
 2      10.3   1.66 Male   No     Sun   Dinner     3 under paid
 3      21.0   3.5  Male   No     Sun   Dinner     3 under paid
 4      23.7   3.31 Male   No     Sun   Dinner     2 under paid
 5      24.6   3.61 Female No     Sun   Dinner     4 under paid
 6      25.3   4.71 Male   No     Sun   Dinner     4 under paid
 7       8.77  2    Male   No     Sun   Dinner     2 well paid 
 8      26.9   3.12 Male   No     Sun   Dinner     4 under paid
 9      15.0   1.96 Male   No     Sun   Dinner     2 under paid
10      14.8   3.23 Male   No     Sun   Dinner     2 well paid 
# … with 234 more rows

But a disadvantage of this method is that when the logic is growing, the code will become chaos. There will be many layers of if_else() overlapping, like a sandiwch.

Needless to say, if you are going to use more than just two columns, you will confuse yourself easily.

For example:

# more layers of if_else() is difficult to write and to read:
tips %>%
  mutate(tips_type = if_else(tip >= total_bill * 0.2, "well paid", 
                     if_else(tip >= total_bill * 0.15, "fare paid",
                     if_else(tip >= total_bill * 0.1, "acceptable", "under paid")))) # many layers of if_else() overlapping, like a sandwich

# needless to say, if we are using more columns than just two:
#tips %>%
#  mutate(tips_type = if_else((tip >= total_bill * 0.2) & (day %in% c("Sat", "Sun") & time == "Dinner"), "well paid", ...))

That is, we need a separated place to arrange our business logic and prepare our function, instead of making huge if_else() sandwich. Luckly, R is a functional programming language and it has a convenience tool call map(), from tidyverse package.

2. Before map()

Firstly, in base R, there are functions called apply(), lapply() and sapple(). They are designed for the same purpose like map(). But we don't talk about them in this article. And I don't use them because they lack consistence just like most of other base R functions.

Secondly, map() is no magic and it is only a wrapper of for-loop. For-loop is a good choice to handle multiple inputs with a same process. This is exactly what we need. However, for-loop is hard to write inside a mutate() function. Don't forget we are talking about create a new column problem. So we can use the wrapper of for-loop, the map().

Of course, you can use for-loop all the same but that will break your data flow pipline. For example:

# you can use for-loop all the same but that breaks your data flow pipline:
tips %>%
  filter(time == "Dinner") %>%
  mutate(tips_type = "oh waite a minute, I first write a for-loop to find the result")

# (joke)mY cOoL foR-lO0p
tips_type_result <- vector("character", nrow(tips))
for (i in seq_along(tips$tip)) {
  if (tips$tip[[i]] > tips$total_bill[[i]] * 0.2) {
    tips_type_result[[i]] = "well paid"
  } else {
    tips_type_result[[i]] = "under paid"
  }
}

# (joke)lEt's gO bAck to mUtaTe
tips %>%
  filter(time == "Dinner") %>%
  mutate(tips_type = tips_type_result) %>%
  summarise()...

# you will not want to do things like this. That's why we need map().

Lastly, when I say map() but actually it is a family of functions, like map_chr() which output is character string, map_dbl() which output is double floating point number and so on. You can lookup them with ?map at the R console.　　

3. A workflow of using map()

As we have talked above, one of the biggest advantage map() gives us is a calm to write our logic and function separately.

After that, we can use this function with map() inside mutate(), which also keeps out data flow pipline to next step.

Using map() increases readability and rubust of our code. If our business logic changes, we can change the independ function instead change mutate clause.

Note that I actually use map2_chr(). It means we have two variables as input, and output is a character string. You can use map_dbl() or map2_dbl() if your result is double floating point numbers.　　

tip_type_judge <- function(tip, total_bill) {
  if (tip >= total_bill * 0.2) {
    return("well paid")
  } else if (tip >= total_bill * 0.15) {
    return("fare paid")
  } else if (tip >= total_bill * 0.1) {
    return("acceptable")
  } else {
    return("under paid")
  }
}

tips %>%
  mutate(tip_type = map2_chr(tip, total_bill, tip_type_judge))

# A tibble: 244 × 8
   total_bill   tip sex    smoker day   time    size tip_type  
        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <chr>     
 1      17.0   1.01 Female No     Sun   Dinner     2 under paid
 2      10.3   1.66 Male   No     Sun   Dinner     3 fare paid 
 3      21.0   3.5  Male   No     Sun   Dinner     3 fare paid 
 4      23.7   3.31 Male   No     Sun   Dinner     2 acceptable
 5      24.6   3.61 Female No     Sun   Dinner     4 acceptable
 6      25.3   4.71 Male   No     Sun   Dinner     4 fare paid 
 7       8.77  2    Male   No     Sun   Dinner     2 well paid 
 8      26.9   3.12 Male   No     Sun   Dinner     4 acceptable
 9      15.0   1.96 Male   No     Sun   Dinner     2 acceptable
10      14.8   3.23 Male   No     Sun   Dinner     2 well paid 
# … with 234 more rows

If you have more than just 2 columns as input, you can use pmap_*(). It takse a list as input and in the list we can use as many columns as we need. Note that I use pmap_chr() below. You can use pmap_dbl() if your result is double floating point numbers.　

# if you have more than 2 columns as input, use pmap_*() instead
tip_type_judge_v2 <- function(tip, total_bill, day) {
  if ((tip >= total_bill * 0.2) & (day %in% c("Sun", "Sat"))) {
    return("well paid")
  } else if ((tip >= total_bill * 0.15) & !(day %in% c("Sun", "Sat"))) {
    return("well paid")
  } else if ((tip >= total_bill) * 0.1) {
    return("acceptable")
  } else {
    return("under paid")
  }
}

tips %>%
  mutate(tip_type = pmap_chr(list(tip, total_bill, day), tip_type_judge_v2))

# A tibble: 244 × 8
   total_bill   tip sex    smoker day   time    size tip_type  
        <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <chr>     
 1      17.0   1.01 Female No     Sun   Dinner     2 under paid
 2      10.3   1.66 Male   No     Sun   Dinner     3 under paid
 3      21.0   3.5  Male   No     Sun   Dinner     3 under paid
 4      23.7   3.31 Male   No     Sun   Dinner     2 under paid
 5      24.6   3.61 Female No     Sun   Dinner     4 under paid
 6      25.3   4.71 Male   No     Sun   Dinner     4 under paid
 7       8.77  2    Male   No     Sun   Dinner     2 well paid 
 8      26.9   3.12 Male   No     Sun   Dinner     4 under paid
 9      15.0   1.96 Male   No     Sun   Dinner     2 under paid
10      14.8   3.23 Male   No     Sun   Dinner     2 well paid 
# … with 234 more rows

码农公寓

相关文章