Magritte tube in R functions
There are advantages and disadvantages to using a pipe inside a function. The biggest advantage is that it's easier to see what's going on inside the function as you read the code. The biggest downsides are that error messages become more difficult to interpret, and the pipe violates some of the R-scoring rules.
Here's an example. Let's say we want to do a meaningless conversion to a dataset mtcars
. This is how we can do it with pipes ...
library(tidyverse)
tidy_function <- function() {
mtcars %>%
group_by(cyl) %>%
summarise(disp = sum(disp)) %>%
mutate(disp = (disp ^ 4) / 10000000000)
}
You can clearly see what's going on at each step, even if it doesn't do anything useful. Now let's take a look at the timecode using the Dagwood sandwich approach ...
base_function <- function() {
mutate(summarise(group_by(mtcars, cyl), disp = sum(disp)), disp = (disp^5) / 10000000000)
}
Much harder to read, although it gives us the same result ...
all.equal(tidy_function(), base_function())
# [1] TRUE
The most common way to avoid using a Dagwood pipe or sandwich is to store the results of each step down to an intermediate variable ...
intermediate_function <- function() {
x <- mtcars
x <- group_by(x, cyl)
x <- summarise(x, disp = sum(disp))
mutate(x, disp = (disp^5) / 10000000000)
}
More readable than the last function, and R will give you a little more detail if there is a bug. Plus it obeys traditional valuation rules. Again, it gives the same results as the other two functions ...
all.equal(tidy_function(), intermediate_function())
# [1] TRUE
You specifically asked the question about speed, so let's compare these three functions by executing each one 1000 times ...
library(microbenchmark)
timing <-
microbenchmark(tidy_function(),
intermediate_function(),
base_function(),
times = 1000L)
timing
#Unit: milliseconds
#expr min lq mean median uq max neval cld
#tidy_function() 3.809009 4.403243 5.531429 4.800918 5.860111 23.37589 1000 a
#intermediate_function() 3.560666 4.106216 5.154006 4.519938 5.538834 21.43292 1000 a
#base_function() 3.610992 4.136850 5.519869 4.583573 5.696737 203.66175 1000 a
Even in this trivial example, the pipe is slightly smaller than the other two options.
Conclusion
Feel free to use a pipe in your functions if this is the most convenient way to write code. If you start to run into problems or need your code to be as fast as humanly possible, then switch to a different paradigm.
source to share