Magritte tube in R functions

Are there cases where it is not recommended to use a magrittr tube inside R-functions in terms of (1) speed and (2) ability to debug effectively?

+3


source to share


1 answer


There are advantages and disadvantages to using a pipe inside a function. The biggest advantage is that it's easier to see what's going on inside the function as you read the code. The biggest downsides are that error messages become more difficult to interpret, and the pipe violates some of the R-scoring rules.

Here's an example. Let's say we want to do a meaningless conversion to a dataset mtcars

. This is how we can do it with pipes ...

library(tidyverse)
tidy_function <- function() {
  mtcars %>%
    group_by(cyl) %>%
    summarise(disp = sum(disp)) %>%
    mutate(disp = (disp ^ 4) / 10000000000)
}

      

You can clearly see what's going on at each step, even if it doesn't do anything useful. Now let's take a look at the timecode using the Dagwood sandwich approach ...

base_function <- function() {
  mutate(summarise(group_by(mtcars, cyl), disp = sum(disp)), disp = (disp^5) / 10000000000)
}

      

Much harder to read, although it gives us the same result ...

all.equal(tidy_function(), base_function())
# [1] TRUE

      

The most common way to avoid using a Dagwood pipe or sandwich is to store the results of each step down to an intermediate variable ...



intermediate_function <- function() {
  x <- mtcars
  x <- group_by(x, cyl)
  x <- summarise(x, disp = sum(disp))
  mutate(x, disp = (disp^5) / 10000000000)
}

      

More readable than the last function, and R will give you a little more detail if there is a bug. Plus it obeys traditional valuation rules. Again, it gives the same results as the other two functions ...

all.equal(tidy_function(), intermediate_function())
# [1] TRUE

      

You specifically asked the question about speed, so let's compare these three functions by executing each one 1000 times ...

library(microbenchmark)
timing <-
  microbenchmark(tidy_function(),
                 intermediate_function(),
                 base_function(),
                 times = 1000L)
timing
#Unit: milliseconds
                    #expr      min       lq     mean   median       uq       max neval cld
         #tidy_function() 3.809009 4.403243 5.531429 4.800918 5.860111  23.37589  1000   a
 #intermediate_function() 3.560666 4.106216 5.154006 4.519938 5.538834  21.43292  1000   a
         #base_function() 3.610992 4.136850 5.519869 4.583573 5.696737 203.66175  1000   a

      

Even in this trivial example, the pipe is slightly smaller than the other two options.

Conclusion

Feel free to use a pipe in your functions if this is the most convenient way to write code. If you start to run into problems or need your code to be as fast as humanly possible, then switch to a different paradigm.

+4


source







All Articles