Check if a pair of columns is in a row of a dataframe
I would like to know if there is an efficient way to check if a given pair (or tuple of more than two) columns is in the dataframe.
For example, suppose I have the following dataframe:
df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1))
names(df)=c('col1','col2','col3')
col1 col2 col3
1 a e 1
2 b f 0
3 c g 0
4 d h 1
and I want to check if this table contains a list of column pairs like: (a, b), (a, c), (a, e), (c, a), (c, g), (a, f)
to which it should output:
FALSE FALSE TRUE FALSE TRUE FALSE
Edit: Added a new pair (a, f) to avoid confusion
I thought about this by concatenating columns into rows and then comparing them to% in%, but that is pretty inefficient. I also thought about doing a loop with a dplyr filter, but it also takes quite a long time when the table is huge and needs format conversions (i.e. writing multiple lines).
Is there an efficient way to accomplish this in R?
source to share
This is similar to the case for one of the function families apply
or lapply
. If you define pairs.list
how list
, you can use lapply
:
df = data.frame(c("a","b","c","d"), c("e","f","g","h"), c(1,0,0,1))
names(df) = c('col1','col2','col3')
pairs.list = list(c("a", "b"), c("a", "c"), c("a", "e"), c("c", "a"), c("c", "g"))
lapply(pairs.list, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] TRUE
new.pairs = list(c("a", "b"), c("a", "c"), c("e", "a"), c("c", "a"), c("c", "g"))
lapply(new.pairs, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
[[4]]
[1] FALSE
[[5]]
[1] TRUE
With this method, if you want to find out the string df
that matches, you can get rid of the call any()
and get a list of vectors of gates, where each vector is the same length as df
.
I think it should be relatively efficient because it's logical logic, not string manipulation, but I'm not an expert on benchmarking performance in R, so I don't know for sure.
source to share
If you only need to check that the combinations of columns are in the table or not, you can use unique
to reduce the number of comparisons:
df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1), stringsAsFactors=FALSE)
names(df)=c('col1','col2','col3')
df$to_check = paste(df$col1, df$col2, sep=',')
cols <- c("a,b", "a,c", "a,e", "c,a", "c,g")
cols %in% unique(df$to_check)
source to share