How to match strings in different combinations in R
I have a data frame df
with words separated +
, but I don't want the order to matter when doing parsing. For example, I have
df <- as.data.frame(
c(("Yellow + Blue + Green"),
("Blue + Yellow + Green"),
("Green + Yellow + Blue")))
There are currently three unique answers among them, but I want to be considered the same. I have tried brute force methods such as ifelse
but they are not suitable for large datasets.
Is there a way to reorder the terms so that they match, or something like a reverse function combn
that recognizes that they are the same combination but in a different order?
Thank!
source to share
#DATA
df <- data.frame(cols =
c(("Yellow + Blue + Green"),
("Blue + Yellow + Green"),
("Green + Yellow + Blue"),
("Green + Yellow + Red")), stringsAsFactors = FALSE)
#Split, sort, and then paste together
df$group = sapply(df$cols, function(a)
paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))
df
# cols group
#1 Yellow + Blue + Green Blue, Green, Yellow
#2 Blue + Yellow + Green Blue, Green, Yellow
#3 Green + Yellow + Blue Blue, Green, Yellow
#4 Green + Yellow + Red Green, Red, Yellow
#Or you can convert to factors too (and back to numeric, if you like)
df$group2 = as.numeric(as.factor(sapply(df$cols, function(a)
paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))))
df
# cols group group2
#1 Yellow + Blue + Green Blue, Green, Yellow 1
#2 Blue + Yellow + Green Blue, Green, Yellow 1
#3 Green + Yellow + Blue Blue, Green, Yellow 1
#4 Green + Yellow + Red Green, Red, Yellow 2
source to share
I would like to give my opinion on this, since he didn't understand what format you want to get:
I am using packages stringr
and iterators
. Using df
createdd.b.
search <- c("Yellow", "Green", "Blue")
L <- str_extract_all(df$cols, boundary("word"))
sapply(iter(L), function(x) all(search %in% x))
[1] TRUE TRUE TRUE FALSE
source to share