R text mining - intersection of text fields

I was wondering if there is a quick way to find a directional intersection between two text lines, for example.

 t1 <- "I have achieved my goals over the past 20 years and look forward for my next chalanges"
 t2 <- " have achieved goals and look my chalanges some other words bla bla"

      

t1 isContainedIn t2 will return 7 because the 7 words that look at t1 also look at t2. Also, t1 and t2 are 2 columns in the dataframe, so I will need to apply this function throughout the entire dataframe and bind the result column to the original dataframe. This is what "data.selected" looks like in a data frame:

        keywords                                         title
1  Samsung UN48H6350 48" Samsung UN48H6350 48" Full 1080p Smart HDTV 120Hz with Wi-Fi +$50 Visa Gift Card
2  Samsung UN48H6350 48"     Samsung UN48H6350 48" Full HD Smart LED TV -Bundle- (See Below for Contents)
3  Samsung UN48H6350 48"      Samsung UN48H6350 48" Class Full HD Smart LED TV -BUNDLE- See below Details
4  Samsung UN48H6350 48"     Samsung UN48H6350 48" Full HD Smart LED TV With BD-H5100 Blu-ray Disc Player
5  Samsung UN48H6350 48"                 Samsung UN48H6350 48" Smart 1080p Clear Motion Rate 240 LED HDTV
6  Samsung UN48H6350 48"            Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi
7  Samsung UN48H6350 48"               Samsung 6350 Series UN48H6350 48" 1080p HD LED LCD Internet TV NEW
8  Samsung UN48H6350 48"  Samsung Un48h6350af 75" 1080p Led-lcd Tv - 16:9 - Hdtv 1080p - (un75h6350afxza)
9  Samsung UN48H6350 48"                         Samsung UN48H6350 - 48" HD 1080p Smart HDTV 120Hz Bundle
10 Samsung UN48H6350 48"   Samsung UN48H6350 - 48-Inch Full HD 1080p Smart HDTV 120Hz with Wi-Fi, (R#416)

      

+3


source to share


2 answers


I guess another similar way is to just use a simple match

string <- strsplit(c(t1, t2), "\\s+") # similar to @Richard
length(na.omit(match(string[[2]], string[[1]])))
## [1] 7

      



Or maybe, lapply

length(unlist(lapply(string[[2]], intersect, string[[1]])))
## [1] 7

      

+4


source


I don't quite understand what you mean by directions. The length of the intersection should not change unless you change the data. This might be what you are looking for.

length(Reduce(intersect, strsplit(c(t1, t2), "\\s+")))
# [1] 7

      



If you switch c(t1, t2)

to c(t2, t1)

, you will see a difference in the output Reduce

. But as I said, the length will still be the same. It is only the order of the sets that are different.

+3


source







All Articles