Passing a variable for a column name?

Question

Passing a variable for a column name?

For example, suppose you had a function that used some of the DPLYR functions, but you couldn't expect the datasets passed to that function to have the same column names.

For a simplified example of what I mean, let's say you have a data frame arizona.trees

:

arizona.trees
group arizona.redwoods   arizona.oaks 
A     23                 11        
A     24                 12  
B     9                  8 
B     10                 7
C     88                 22

and another very similar to the data frame california.trees

:

california.trees
group    california.redwoods california.oaks 
A        25                  50        
A        11                  33  
B        90                  5 
B        77                  3
C        90                  35

And you wanted to implement a function that returns the average of the group data (A, B, ... Z) for a given type of tree, which will work for both of these data frames.

foo <- function(dataset, group1, group2, tree.type) { 
     column.name <- colnames(dataset[2])
     result <- filter(dataset, group %in% c(group1, group2) %>%
               select(group, contains(tree.type)) %>%
               group_by(group) %>%
               summarize("mean" = mean(column.name))
     return(result)
}

The desired output for the call foo(california.trees, A, B, redwoods)

would be:

result
       mean
A       18
B       83.5

For some reason, doing something like the implementation foo()

just doesn't work. This is probably due to some kind of error with indexing the dataframe - it seems like the function is trying to get the average of the row column.name

, instead of fetching the column and passing the column to mean()

. I don't know how to avoid this. There arises the problem of implicitly passing the modified data frame that cannot be directly linked to the pipe operator, which can cause the problem.

Why is this? Is there some alternative implementation?

+3

variables r dplyr

user3450277 Apr 26. 17 at 18:25

source to share

1 answer

akrun · Accepted Answer · 2017-04-26T18:30:57+0000

We can use a solution based on quosure

from devel version dplyr

(will be released soon 0.6.0

)

foo <- function(dataset, group1, group2, tree.type){
        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))
         colN <- rlang::parse_quosure(names(dataset)[2])
         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise(mean = mean(UQ(colN)))
        }


foo(california.trees, A, B, redwoods)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo(arizona.trees, A, B, redwoods)
# A tibble: 2 × 2
#   group  mean
#  <chr> <dbl>
#1     A  23.5
#2     B   9.5

enquo

takes input arguments and converts them to quosure

, with quo_name

, it converts to a string for use with %in%

, converts the second column name to quosure

from string using parse_quosure

, and then it incorrectly ( UQ

or !!

) evaluates withinsummarise

NOTE. This is based on the OP's feature about choosing the second column

The above solution was based on position based column selection (as per OP's code) and this may not work for other columns. So we can match "tree.type" and get the "mean" of the columns based on that

foo1 <- function(dataset, group1, group2, tree.type){

        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))


         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise_at(vars(contains(tree.type)), funs(mean = mean(.)))
        }

The function can be tested for different columns in two datasets

foo1(arizona.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  11.5
#2     B   7.5

foo1(arizona.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  23.5
#2     B   9.5

foo1(california.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo1(california.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  41.5
#2     B   4.0

data

arizona.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
arizona.redwoods = c(23L, 
24L, 9L, 10L, 88L), arizona.oaks = c(11L, 12L, 8L, 7L, 22L)),
.Names = c("group", 
"arizona.redwoods", "arizona.oaks"), class = "data.frame",
 row.names = c(NA, -5L))

california.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
 california.redwoods = c(25L, 
11L, 90L, 77L, 90L), california.oaks = c(50L, 33L, 5L, 3L, 35L
)), .Names = c("group", "california.redwoods", "california.oaks"
), class = "data.frame", row.names = c(NA, -5L))

Passing a variable for a column name?

data

More articles: