A modern method for large-scale near-duplication of document detection?

As far as I understand, the scientific consensus in NLP is that the most effective method of finding near-duplicates in large-scale scientific document collections (over 1 billion documents) is the one found here:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

which can be summarized as follows:

a) document wrapping b) mini-changers to obtain tile name signatures c) location sensitivity to avoid performing paired similarity calculations for all signatures, but instead focus only on the pair within the buckets.

I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading on large-scale almost duplicate detection for two weeks) and the above was posted quite a few years ago I am wondering if the limitations of this are known algorithm and are there better approaches that offer a more attractive trade-off between performance and complexity.

Thanks in advance!

+3
machine-learning nlp


source to share


1 answer


Regarding the second step b), there have been some recent developments that have greatly accelerated the computation of signatures:



  • Optimal seal for fast and accurate agitator, 2017, https://arxiv.org/abs/1703.04664
  • Fast Similarity Similarity, 2017, https://arxiv.org/abs/1704.04370
  • SuperMinHash - New Minimum Hash Algorithm for Estimating Similarity Jaccard, 2017, https://arxiv.org/abs/1706.05698
+1


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics