Column of unique values ​​between two other columns

sample data:

col1    col2
<NA>    cc
a       a
ab      a
z       a

      

I want to add a column unique

with these values ​​- any values ​​that are not shared between col1 and col2.

col1    col2    unique
<NA>    cc      cc
a       a   
ab      a       b
z       a       za 

      

I tried to use setdiff

but

(for replication purposes :)

df <- read.table(header=TRUE, stringsAsFactors = FALSE, text = 
                   "col1    col2
    NA  cc
                 a      a
                 ab     a
                 z      a
                 ")

      

Like this:

df$unique <- paste0(setdiff(df$col1, df$col2), setdiff(df$col2, df$col1))

      

But it returns

Error in `$<-.data.frame`(`*tmp*`, "unique", value = c("<NA>cc", "abcc" : 
  replacement has 2 rows, data has 3

      

From the error, it looks like it generates a vector of differences between columns, not differences between elements ...

Edit: Added z

and a

sample data on the last line.

+3


source to share


3 answers


You can do this by using setdiff

and Reduce

in base R:



cols <- c(1,2)    
df$unique <- unlist(lapply(apply(df[cols], 1, function(x) 
                  Reduce(setdiff, strsplit(na.omit(x), split = ""))), paste0, collapse=""))

  # col1 col2 unique
# 1 <NA>   cc     cc
# 2    a    a       
# 3   ab    a      b

      

+2


source


Here is a method of length c apply

.

apply(df, 1, function(i) {
              i <- i[!is.na(i)] # remove NAs
              if(length(i[!is.na(i)]) == 1) i # check length and return singletons untouched
              else { # for non-singletons
                i <- unlist(strsplit(i, split="")) # strsplit and turn into a vector
                i <- i[!(duplicated(i) | duplicated(i, fromLast=TRUE))] # drop duplicates
                paste(i, collapse="")}}) # return collapsed singleton set of characters
[1] "cc" ""   "b" 

      



Note that for c ("cc", "a", "c") this will return "a" because "cc" and "c" will be marked as duplicates.

+1


source


We need to split the line first:

df$unique <- mapply(function(x, y){
    u <- setdiff(union(x, y), intersect(x, y))
    paste0(u[!is.na(u)], collapse = '')
}, strsplit(df$col1, ''), strsplit(df$col2, ''))

# >df
#   col1 col2 unique
# 1 <NA>   cc      c
# 2    a    a       
# 3   ab    a      b

      

+1


source







All Articles