Check if a pair of columns is in a row of a dataframe

I would like to know if there is an efficient way to check if a given pair (or tuple of more than two) columns is in the dataframe.

For example, suppose I have the following dataframe:

df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1))
names(df)=c('col1','col2','col3')

  col1 col2 col3
1    a    e    1
2    b    f    0
3    c    g    0
4    d    h    1

      

and I want to check if this table contains a list of column pairs like: (a, b), (a, c), (a, e), (c, a), (c, g), (a, f)

to which it should output:

FALSE FALSE TRUE FALSE TRUE FALSE

      

Edit: Added a new pair (a, f) to avoid confusion

I thought about this by concatenating columns into rows and then comparing them to% in%, but that is pretty inefficient. I also thought about doing a loop with a dplyr filter, but it also takes quite a long time when the table is huge and needs format conversions (i.e. writing multiple lines).

Is there an efficient way to accomplish this in R?

+3


source to share


2 answers


This is similar to the case for one of the function families apply

or lapply

. If you define pairs.list

how list

, you can use lapply

:

df = data.frame(c("a","b","c","d"), c("e","f","g","h"), c(1,0,0,1))
names(df) = c('col1','col2','col3')
pairs.list = list(c("a", "b"), c("a", "c"), c("a", "e"), c("c", "a"), c("c", "g"))
lapply(pairs.list, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] TRUE

[[4]]
[1] FALSE

[[5]]
[1] TRUE

new.pairs = list(c("a", "b"), c("a", "c"), c("e", "a"), c("c", "a"), c("c", "g"))

lapply(new.pairs, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] FALSE

[[4]]
[1] FALSE

[[5]]
[1] TRUE

      



With this method, if you want to find out the string df

that matches, you can get rid of the call any()

and get a list of vectors of gates, where each vector is the same length as df

.

I think it should be relatively efficient because it's logical logic, not string manipulation, but I'm not an expert on benchmarking performance in R, so I don't know for sure.

+1


source


If you only need to check that the combinations of columns are in the table or not, you can use unique

to reduce the number of comparisons:



df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1), stringsAsFactors=FALSE)
names(df)=c('col1','col2','col3')

df$to_check = paste(df$col1, df$col2, sep=',')
cols <- c("a,b", "a,c", "a,e", "c,a", "c,g")

cols %in% unique(df$to_check)

      

+1


source







All Articles