Detecting similar paragraphs in two documents
I am trying to find similar paragraphs in two docs. Each document contains many paragraphs of several lines of text. The text in the paragraphs has some changes. Words can be inserted or deleted or omitted. for example
Doc1.Para
This is one line of text
Doc2.Para
This is one text text
Here you can see that some words are missing ('of') and some in different ways. Hence the paras is not the same, but analogous . And the similarities are not based on semantics or essence. Its just based on words.
Items are not in the same order. for example
Dock 1
Para 1
Para 2 Para 3 Para 4
Dock 2
Para 3 Para
4
Para 1.1 Para 2 Para 1.2
Here you can see that the order is not the same. Also, pairs can be split , like Doc1.Para1 is divided into 2 paragraphs Doc2.Para1.1 + Doc2.Para1.2.
I need to determine which pair in Doc1 is the same as in Doc2. Looking for some open source tool or some kind of algorithm.
source to share
I have had success in the past with using word attachments to grab a paragraph. Word attachments, such as those created by Google word2vec , model words in high-dimensional vector space. Thus, they allow the computation of semantic similarities between two words, such as the cosine between their individual vectors. You can download these attachments directly from the word2vec site or from related project sites such as Polyglot .
To simulate the similarity between paragraphs, one simple solution is to calculate the nesting of paragraphs by taking the weighted sum of the nestings of all words in that paragraph. Since some words are more informative than others, you can weigh the words of a tf-idf word attachment , for example. Then you can calculate the similarity between two paragraphs as the cosine between their nestings.
source to share
A good way to compare two paragraphs is only by comparing the similarity of text and word using the Levenshtein Distance algorithm . It compares the distance between two texts and you can use the threshold that suits you best.
For example, all text above 90% similarity should be considered the same.
source to share
There were many common questions in the NLP community on textual similarity / intro (STS 2015, 2014, 2013, RTE 2010, ...) etc. This is the last competition:
http://alt.qcri.org/semeval2015/task2/index.php?id=semantic-textual-similarity-for-english
Some of them free up the presented systems or baseline for use, which I believe you can also use for your task. Look at this:
source to share
-
You can also use Apache Spark 2.0 , which has rich ML and has many robust algorithms built in terms of document similarity.
-
Sample Programs / Sample Programs can be downloaded from this site .
source to share
You write that you are worried about time complexity - one of the common ways to avoid slow product comparisons of all paragraphs is shingling .
In short, you
- create a sketch / fingerprint set of all paragraphs (run each paragraph through a set of hash functions),
- make a map from sketches to paragraphs that have this sketch.
- change this to a map from pairs of paragraphs to the number of shared sketches,
- a filter with a specific threshold and
- go from pairs to groups by clustering with union-find
See http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html for details. I have a simple OCaml implementation that might be helpful, but there are probably libraries in a bunch of other languages if you're looking for documentation.
source to share