Count common words in two lines

Question

Count common words in two lines

I have two lines:

a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"

I want to get the number of common words between these lines.

The answer should be 3.

"Roy"
"travels"
"Africa"

- common words

This is what I tried:

stra <- as.data.frame(t(read.table(textConnection(a), sep = " ")))
strb <- as.data.frame(t(read.table(textConnection(b), sep = " ")))

Taking unique to avoid re-counting

stra_unique <-as.data.frame(unique(stra$V1))
strb_unique <- as.data.frame(unique(strb$V1))
colnames(stra_unique) <- c("V1")
colnames(strb_unique) <- c("V1")

common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1)

I need this for a dataset with over 2000 and 1200 rows. The total number of times I have to estimate is 2000 X 1200. Any quick way without using loops.

+3

string r data-analysis text-mining

Jaimik Jain 19 Sep '14 at 9:22

source to share

3 answers

Alex reynolds · Answer 1 · 2014-09-19T09:30:47+0000

You can use strsplit

and intersect

from the library base

:

> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3

akrun · Answer 2 · 2014-09-19T09:25:44+0000

Perhaps using intersect

and str_extract

For multiple strings

you can put them like list

or likevector

 vec1 <- c(a,b)
 Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
 #[1] "Roy"     "travels" "Africa"

For parameters faster

considerstringi

 library(stringi)
 Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
 #[1] "Roy"     "travels" "Africa"

To count:

 length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
 #[1] 3

Or using base R

  Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
  #[1] "Roy"     "travels" "Africa"

Tyler rinker · Answer 3 · 2016-01-29T13:09:06+0000

This approach is generalized to n vectors:

a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."

library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]

[1] "Africa"  "Roy"     "travels"

Count common words in two lines

More articles: