A modern method for large-scale near-duplication of document detection?

As far as I understand, the scientific consensus in NLP is that the most effective method of finding near-duplicates in large-scale scientific document collections (over 1 billion documents) is the one found here:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

which can be summarized as follows:

a) document wrapping b) mini-changers to obtain tile name signatures c) location sensitivity to avoid performing paired similarity calculations for all signatures, but instead focus only on the pair within the buckets.

I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading on large-scale almost duplicate detection for two weeks) and the above was posted quite a few years ago I am wondering if the limitations of this are known algorithm and are there better approaches that offer a more attractive trade-off between performance and complexity.

Thanks in advance!

+3


source to share


1 answer


Regarding the second step b), there have been some recent developments that have greatly accelerated the computation of signatures:



+1


source







All Articles