Best way to automate variable creation in R using dplyr
df <- as.data.frame(cbind(c(1:10), c(15, 70, 29, 64, 57, 29, 10, 80,81, 71)))
V1 V2
1 1 15
2 2 70
3 3 29
4 4 64
5 5 57
6 6 29
7 7 10
8 8 80
9 9 81
10 10 71
cuts <- c(5, 10, 90, 95)
I would like to create a logical variables for all (in this case four) cut values x
(such as P5
, P10
, P90
and P95
) that indicate whether v2 <= x
. The direct way to add variables "manually" doesn't scale out of scope:
df %<>%
mutate( P5 = V2 <= 5) %>%
mutate(P10 = V2 <= 10) %>%
mutate(P90 = V2 <= 90) %>%
mutate(P95 = V2 <= 95)
V1 V2 P5 P10 P90 P95
1 1 15 FALSE FALSE TRUE TRUE
2 2 70 FALSE FALSE TRUE TRUE
3 3 29 FALSE FALSE TRUE TRUE
4 4 64 FALSE FALSE TRUE TRUE
5 5 57 FALSE FALSE TRUE TRUE
6 6 29 FALSE FALSE TRUE TRUE
7 7 10 FALSE TRUE TRUE TRUE
8 8 80 FALSE FALSE TRUE TRUE
9 9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE
Obviously, to save data in a "neat" format, the final one should be used gather(year, islegal, c(3;6))
.
The alternative I've tried is
do.call(rbind, lapply(cuts, function(x) {
df %>% mutate(year = x, islegal = V2 <= x)
})) %>% spread(year, islegal)
V1 V2 5 10 90 95
1 1 15 FALSE FALSE TRUE TRUE
2 2 70 FALSE FALSE TRUE TRUE
3 3 29 FALSE FALSE TRUE TRUE
4 4 64 FALSE FALSE TRUE TRUE
5 5 57 FALSE FALSE TRUE TRUE
6 6 29 FALSE FALSE TRUE TRUE
7 7 10 FALSE TRUE TRUE TRUE
8 8 80 FALSE FALSE TRUE TRUE
9 9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE
Obviously I would drop the final spread()
one to keep the data in a "neat" format.
The question is : are there more or more general uses {dplyr}
than the second approach for automating the creation of variables (such as quantum-like slices like here, or dummies or similar) that don't require explicitly typing content cuts
like the first approach?
source to share
Of course, you don't need dplyr for something this simple.
names(cuts) <- paste0("p", cuts)
data.frame(df, lapply(cuts, function(x) df$V2 <= x))
V1 V2 p5 p10 p90 p95
1 1 15 FALSE FALSE TRUE TRUE
2 2 70 FALSE FALSE TRUE TRUE
3 3 29 FALSE FALSE TRUE TRUE
4 4 64 FALSE FALSE TRUE TRUE
5 5 57 FALSE FALSE TRUE TRUE
6 6 29 FALSE FALSE TRUE TRUE
7 7 10 FALSE TRUE TRUE TRUE
8 8 80 FALSE FALSE TRUE TRUE
9 9 81 FALSE FALSE TRUE TRUE
10 10 71 FALSE FALSE TRUE TRUE
source to share
If you want to "programmatically" work with dplyr
, you should look at alternatives to standard evaluation for regular versions of the function. See Custom Evaluation Vignette ( vignette("nse", "dplyr")
).
Basically, in addition to a function, mutate
there is a function mutate_
that allows you to specify a list of transformations. In your case, you can create your list with something like this
cuts <- c(5,10,90,95)
mymutate <- setNames(lapply(cuts , function(x)
lazyeval::interp(~V2<=x, x=x)), paste0("P", cuts ))
Then you can do the conversion with
df %>% mutate_(.dots=mymutate )
# V1 V2 P5 P10 P90 P95
# 1 1 15 FALSE FALSE TRUE TRUE
# 2 2 70 FALSE FALSE TRUE TRUE
# 3 3 29 FALSE FALSE TRUE TRUE
# 4 4 64 FALSE FALSE TRUE TRUE
# 5 5 57 FALSE FALSE TRUE TRUE
# 6 6 29 FALSE FALSE TRUE TRUE
# 7 7 10 FALSE TRUE TRUE TRUE
# 8 8 80 FALSE FALSE TRUE TRUE
# 9 9 81 FALSE FALSE TRUE TRUE
# 10 10 71 FALSE FALSE TRUE TRUE
source to share
If you plan on converting your data to neat data eventually, you can simply start with one:
library(dplyr)
df <- as.data.frame(cbind(c(1:10), c(15, 70, 29, 64, 57, 29, 10, 80,81, 71)))
cuts <- data_frame(P=c(5, 10, 90, 95))
p_df <- df %>% tidyr::crossing(cuts) %>%
mutate(flag=V2<=P)
p_df
# V1 V2 P flag
#1 1 15 5 FALSE
#2 1 15 10 FALSE
#3 1 15 90 TRUE
#4 1 15 95 TRUE
#5 2 70 5 FALSE
#...
If the original format is really what you want, the tidyr::spread
result is
p_df %>%
tidyr::spread(P, flag, sep="")
# V1 V2 P5 P10 P90 P95
#1 1 15 FALSE FALSE TRUE TRUE
#2 2 70 FALSE FALSE TRUE TRUE
#3 3 29 FALSE FALSE TRUE TRUE
#4 4 64 FALSE FALSE TRUE TRUE
#5 5 57 FALSE FALSE TRUE TRUE
#6 6 29 FALSE FALSE TRUE TRUE
#7 7 10 FALSE TRUE TRUE TRUE
#8 8 80 FALSE FALSE TRUE TRUE
#9 9 81 FALSE FALSE TRUE TRUE
#10 10 71 FALSE FALSE TRUE TRUE
source to share