Detecting similar paragraphs in two documents

I am trying to find similar paragraphs in two docs. Each document contains many paragraphs of several lines of text. The text in the paragraphs has some changes. Words can be inserted or deleted or omitted. for example

Doc1.Para

This is one line of text

Doc2.Para

This is one text text

Here you can see that some words are missing ('of') and some in different ways. Hence the paras is not the same, but analogous . And the similarities are not based on semantics or essence. Its just based on words.

Items are not in the same order. for example

Dock 1

Para 1
Para 2 Para 3 Para 4

Dock 2

Para 3 Para
4
Para 1.1 Para 2 Para 1.2

Here you can see that the order is not the same. Also, pairs can be split , like Doc1.Para1 is divided into 2 paragraphs Doc2.Para1.1 + Doc2.Para1.2.

I need to determine which pair in Doc1 is the same as in Doc2. Looking for some open source tool or some kind of algorithm.

+5


source to share


6 answers


I have had success in the past with using word attachments to grab a paragraph. Word attachments, such as those created by Google word2vec , model words in high-dimensional vector space. Thus, they allow the computation of semantic similarities between two words, such as the cosine between their individual vectors. You can download these attachments directly from the word2vec site or from related project sites such as Polyglot .



To simulate the similarity between paragraphs, one simple solution is to calculate the nesting of paragraphs by taking the weighted sum of the nestings of all words in that paragraph. Since some words are more informative than others, you can weigh the words of a tf-idf word attachment , for example. Then you can calculate the similarity between two paragraphs as the cosine between their nestings.

0


source


A good way to compare two paragraphs is only by comparing the similarity of text and word using the Levenshtein Distance algorithm . It compares the distance between two texts and you can use the threshold that suits you best.



For example, all text above 90% similarity should be considered the same.

0


source


There were many common questions in the NLP community on textual similarity / intro (STS 2015, 2014, 2013, RTE 2010, ...) etc. This is the last competition:

http://alt.qcri.org/semeval2015/task2/index.php?id=semantic-textual-similarity-for-english

Some of them free up the presented systems or baseline for use, which I believe you can also use for your task. Look at this:

http://ixa2.si.ehu.es/stswiki/index.php/Main_Page

0


source


  • You can also use Apache Spark 2.0 , which has rich ML and has many robust algorithms built in terms of document similarity.

  • Sample Programs / Sample Programs can be downloaded from this site .

0


source


You write that you are worried about time complexity - one of the common ways to avoid slow product comparisons of all paragraphs is shingling .

In short, you

  • create a sketch / fingerprint set of all paragraphs (run each paragraph through a set of hash functions),
  • make a map from sketches to paragraphs that have this sketch.
  • change this to a map from pairs of paragraphs to the number of shared sketches,
  • a filter with a specific threshold and
  • go from pairs to groups by clustering with union-find

See http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html for details. I have a simple OCaml implementation that might be helpful, but there are probably libraries in a bunch of other languages ​​if you're looking for documentation.

0


source


Your question is about detecting plagiarism in natural language processing. The choice may be to use a fingerprint algorithm. Since the order of the paragraphs is not important, I suggest the following document:

Winnowing: local algorithms for document fingerprinting

0


source







All Articles