Count common words in two lines
I have two lines:
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
I want to get the number of common words between these lines.
The answer should be 3.
-
"Roy"
-
"travels"
- "Africa"
- common words
This is what I tried:
stra <- as.data.frame(t(read.table(textConnection(a), sep = " ")))
strb <- as.data.frame(t(read.table(textConnection(b), sep = " ")))
Taking unique to avoid re-counting
stra_unique <-as.data.frame(unique(stra$V1))
strb_unique <- as.data.frame(unique(strb$V1))
colnames(stra_unique) <- c("V1")
colnames(strb_unique) <- c("V1")
common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1)
I need this for a dataset with over 2000 and 1200 rows. The total number of times I have to estimate is 2000 X 1200. Any quick way without using loops.
source to share
Perhaps using intersect
and str_extract
For multiple strings
you can put them like list
or likevector
vec1 <- c(a,b)
Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
#[1] "Roy" "travels" "Africa"
For parameters faster
considerstringi
library(stringi)
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
#[1] "Roy" "travels" "Africa"
To count:
length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
#[1] 3
Or using base R
Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
#[1] "Roy" "travels" "Africa"
source to share
This approach is generalized to n vectors:
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
c <- "Bob also travels Africa for trips but lives in the US unlike Roy."
library(stringi);library(qdapTools)
X <- stri_extract_all_words(list(a, b, c))
X <- mtabulate(X) > 0
Y <- colSums(X) == nrow(X); names(Y)[Y]
[1] "Africa" "Roy" "travels"
source to share