Combine selection and mutate

Quite often I find myself manually combining the select () and mutate () functions in dplyr. This is usually because I am dumping data, want to create new columns based on the old columns, and only want to keep the new columns.

For example, if I had height and width data but wanted to use it to calculate and store the area, I would use:

library(dplyr)
df <- data.frame(height = 1:3, width = 10:12)

df %>% 
  mutate(area = height * width) %>% 
  select(area)

      

When many variables are created in the mutat step, it can be difficult to make sure that they are all in the select step. Is there a more elegant way to keep the variables defined in the mutate step?

The workaround I have used is the following:

df %>%
  mutate(id = row_number()) %>%
  group_by(id) %>%
  summarise(area = height * width) %>%
  ungroup() %>%
  select(-id)

      

This works, but is rather verbose, and using sumize () means there is performance:

library(microbenchmark)

microbenchmark(

  df %>% 
    mutate(area = height * width) %>% 
    select(area),

  df %>%
    mutate(id = row_number()) %>%
    group_by(id) %>%
    summarise(area = height * width) %>%
    ungroup() %>%
    select(-id)
)

      

Output:

      min       lq     mean   median       uq      max neval cld
  868.822  954.053 1258.328 1147.050 1363.251 4369.544   100  a 
 1897.396 1958.754 2319.545 2247.022 2549.124 4025.050   100   b

      

I'm thinking of another workaround where you can compare the original data names with the new data names and get the correct padding, but maybe a better way?

I feel like something really obvious is missing from the dplyr documentation, so I apologize if this is trivial!

+3


source to share


1 answer


Just create your own function that combines the two steps:

mutate_only = function (.data, ...) {
    names = names(match.call(expand.dots = FALSE)$...)
    .data %>% mutate(...) %>% select(one_of(names))
}

      



It takes some work to get the standard score to work properly. Unfortunately, the dplyr API is currently developing in this direction, so I don't know what recommendation there will be for this in a few weeks. So I'll just link to the relevant documentation .

+1


source







All Articles