A modern method for large-scale near-duplication of document detection?
As far as I understand, the scientific consensus in NLP is that the most effective method of finding near-duplicates in large-scale scientific document collections (over 1 billion documents) is the one found here:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
which can be summarized as follows:
a) document wrapping b) mini-changers to obtain tile name signatures c) location sensitivity to avoid performing paired similarity calculations for all signatures, but instead focus only on the pair within the buckets.
I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading on large-scale almost duplicate detection for two weeks) and the above was posted quite a few years ago I am wondering if the limitations of this are known algorithm and are there better approaches that offer a more attractive trade-off between performance and complexity.
Thanks in advance!
source to share
Regarding the second step b), there have been some recent developments that have greatly accelerated the computation of signatures:
- Optimal seal for fast and accurate agitator, 2017, https://arxiv.org/abs/1703.04664
- Fast Similarity Similarity, 2017, https://arxiv.org/abs/1704.04370
- SuperMinHash - New Minimum Hash Algorithm for Estimating Similarity Jaccard, 2017, https://arxiv.org/abs/1706.05698
source to share