A modern method for large-scale near-duplication of document detection?

Question

A modern method for large-scale near-duplication of document detection?

As far as I understand, the scientific consensus in NLP is that the most effective method of finding near-duplicates in large-scale scientific document collections (over 1 billion documents) is the one found here:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

which can be summarized as follows:

a) document wrapping b) mini-changers to obtain tile name signatures c) location sensitivity to avoid performing paired similarity calculations for all signatures, but instead focus only on the pair within the buckets.

I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading on large-scale almost duplicate detection for two weeks) and the above was posted quite a few years ago I am wondering if the limitations of this are known algorithm and are there better approaches that offer a more attractive trade-off between performance and complexity.

Thanks in advance!

+3

machine-learning nlp

Alex 04 june 17 at 14:13

source to share

1 answer

otmar · Answer 1 · 2017-08-02T07:44:30+0000

Regarding the second step b), there have been some recent developments that have greatly accelerated the computation of signatures:

Optimal seal for fast and accurate agitator, 2017, https://arxiv.org/abs/1703.04664
Fast Similarity Similarity, 2017, https://arxiv.org/abs/1704.04370
SuperMinHash - New Minimum Hash Algorithm for Estimating Similarity Jaccard, 2017, https://arxiv.org/abs/1706.05698

A modern method for large-scale near-duplication of document detection?

More articles: