Extract words that differ between two sentences
I have a very large dataframe with two columns called sentence1
and sentence2
. I am trying to create a new column with words that differ between two sentences, for example:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
My data frame has the following structure:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
And my expected output:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
In R, I tried to split sentences and get items that differ between lists, for example:
df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)
But this approach doesn't work when applied setdiff
...
In Python, I tried to apply NLTK by trying to get tokens first and then extract the difference between the two lists, like this:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
And at this point, I don't find a function that gives me the result I want.
I hope you can help me. Thanks to
source to share
Here's the R solution.
I've created a function exclusiveWords
that finds unique words between two sets and returns a "sentence" of those words. I wrapped it in Vectorize()
so that it works on all lines of the data.frame at once.
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)
exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}
exclusiveWords <- Vectorize(exclusiveWords)
df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three
source to share
Essentially the same as @ SymbolixAU's answer as an app function.
df$Dif <- apply(df, 1, function(r) {
paste(setdiff(union (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))),
collapse = " ")
})
source to share
In Python, you can build a function that treats the words in a sentence as a set and calculates a set of theoretical exclusive "or" (a set of words that are in one sentence but not in another):
df.apply(lambda x:
set(word_tokenize(x['sentence1'])) \
^ set(word_tokenize(x['sentence2'])), axis=1)
The result is a dataframe of sets.
#0 {one, the, four}
#1 {the, two, five}
#2 {the, three, six}
#dtype: object
source to share