Best way to automate variable creation in R using dplyr

Question

Best way to automate variable creation in R using dplyr

df <- as.data.frame(cbind(c(1:10), c(15, 70, 29, 64, 57, 29, 10, 80,81, 71)))

   V1 V2
1   1 15
2   2 70
3   3 29
4   4 64
5   5 57
6   6 29
7   7 10
8   8 80
9   9 81
10 10 71

cuts <- c(5, 10, 90, 95)

I would like to create a logical variables for all (in this case four) cut values x

(such as P5

, P10

, P90

and P95

) that indicate whether v2 <= x

. The direct way to add variables "manually" doesn't scale out of scope:

df %<>% 
    mutate( P5 = V2 <=  5) %>% 
    mutate(P10 = V2 <= 10) %>% 
    mutate(P90 = V2 <= 90) %>% 
    mutate(P95 = V2 <= 95)

   V1 V2    P5   P10  P90  P95
1   1 15 FALSE FALSE TRUE TRUE
2   2 70 FALSE FALSE TRUE TRUE
3   3 29 FALSE FALSE TRUE TRUE
4   4 64 FALSE FALSE TRUE TRUE
5   5 57 FALSE FALSE TRUE TRUE
6   6 29 FALSE FALSE TRUE TRUE
7   7 10 FALSE  TRUE TRUE TRUE
8   8 80 FALSE FALSE TRUE TRUE
9   9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE

Obviously, to save data in a "neat" format, the final one should be used gather(year, islegal, c(3;6))

.

The alternative I've tried is

do.call(rbind, lapply(cuts, function(x) { 
                df %>% mutate(year = x, islegal = V2 <= x) 
        })) %>% spread(year, islegal)

   V1 V2     5    10   90   95
1   1 15 FALSE FALSE TRUE TRUE
2   2 70 FALSE FALSE TRUE TRUE
3   3 29 FALSE FALSE TRUE TRUE
4   4 64 FALSE FALSE TRUE TRUE
5   5 57 FALSE FALSE TRUE TRUE
6   6 29 FALSE FALSE TRUE TRUE
7   7 10 FALSE  TRUE TRUE TRUE
8   8 80 FALSE FALSE TRUE TRUE
9   9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE

Obviously I would drop the final spread()

one to keep the data in a "neat" format.

The question is : are there more or more general uses {dplyr}

than the second approach for automating the creation of variables (such as quantum-like slices like here, or dummies or similar) that don't require explicitly typing content cuts

like the first approach?

+3

r dplyr

TemplateRex Apr 21 15 at 20:34

source to share

3 answers

If you want to "programmatically" work with dplyr

, you should look at alternatives to standard evaluation for regular versions of the function. See Custom Evaluation Vignette ( vignette("nse", "dplyr")

).

Basically, in addition to a function, mutate

there is a function mutate_

that allows you to specify a list of transformations. In your case, you can create your list with something like this

cuts <- c(5,10,90,95)
mymutate <- setNames(lapply(cuts , function(x) 
     lazyeval::interp(~V2<=x, x=x)), paste0("P", cuts ))

Then you can do the conversion with

df %>% mutate_(.dots=mymutate )

#    V1 V2    P5   P10  P90  P95
# 1   1 15 FALSE FALSE TRUE TRUE
# 2   2 70 FALSE FALSE TRUE TRUE
# 3   3 29 FALSE FALSE TRUE TRUE
# 4   4 64 FALSE FALSE TRUE TRUE
# 5   5 57 FALSE FALSE TRUE TRUE
# 6   6 29 FALSE FALSE TRUE TRUE
# 7   7 10 FALSE  TRUE TRUE TRUE
# 8   8 80 FALSE FALSE TRUE TRUE
# 9   9 81 FALSE FALSE TRUE TRUE
# 10 10 71 FALSE FALSE TRUE TRUE

+4

MrFlick Apr 21 15 at 20:42

source to share

If you plan on converting your data to neat data eventually, you can simply start with one:

library(dplyr)
df <- as.data.frame(cbind(c(1:10), c(15, 70, 29, 64, 57, 29, 10, 80,81, 71)))
cuts <- data_frame(P=c(5, 10, 90, 95))

p_df <- df %>% tidyr::crossing(cuts) %>%
  mutate(flag=V2<=P)
p_df

#   V1 V2  P  flag
#1   1 15  5 FALSE
#2   1 15 10 FALSE
#3   1 15 90  TRUE
#4   1 15 95  TRUE
#5   2 70  5 FALSE
#...

If the original format is really what you want, the tidyr::spread

result is

p_df %>% 
  tidyr::spread(P, flag, sep="")
#   V1 V2    P5   P10  P90  P95
#1   1 15 FALSE FALSE TRUE TRUE
#2   2 70 FALSE FALSE TRUE TRUE
#3   3 29 FALSE FALSE TRUE TRUE
#4   4 64 FALSE FALSE TRUE TRUE
#5   5 57 FALSE FALSE TRUE TRUE
#6   6 29 FALSE FALSE TRUE TRUE
#7   7 10 FALSE  TRUE TRUE TRUE
#8   8 80 FALSE FALSE TRUE TRUE
#9   9 81 FALSE FALSE TRUE TRUE
#10 10 71 FALSE FALSE TRUE TRUE

0

LmW. June 29. 17 at 21:35

source to share

Hong ooi · Accepted Answer · 2015-04-21T20:55:54+0000

Of course, you don't need dplyr for something this simple.

names(cuts) <- paste0("p", cuts)
data.frame(df, lapply(cuts, function(x) df$V2 <= x))

   V1 V2    p5   p10  p90  p95
1   1 15 FALSE FALSE TRUE TRUE
2   2 70 FALSE FALSE TRUE TRUE
3   3 29 FALSE FALSE TRUE TRUE
4   4 64 FALSE FALSE TRUE TRUE
5   5 57 FALSE FALSE TRUE TRUE
6   6 29 FALSE FALSE TRUE TRUE
7   7 10 FALSE  TRUE TRUE TRUE
8   8 80 FALSE FALSE TRUE TRUE
9   9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE

Best way to automate variable creation in R using dplyr

More articles: