Finding the most similar documents (nearest neighbors) from a set of documents

I have 80,000 documents that cover a very large number of topics. I want to do this for each article, provide links to recommend other articles (something like 5 articles) that are similar to the ones the user is currently reading. If I don't need that, I'm not very interested in classifying documents, just similarities or relatedness, and ideally I would like to output an 80,000 x 80,000 matrix of all documents with the appropriate distance (or perhaps correlation similarity?) To others documents in the set.

I am currently using NLTK to process document content and get ngrams, but from there I am not sure which approach I should take to calculate the similarity between documents.

I read about using tf-idf and cosine similarity, however due to the large number of topics, I expect a very large number of unique tokens, so multiplying two very long vectors can be a bad way. Also 80,000 documents may require a lot of reproduction between vectors. (Admittedly, this only needed to be done once, so this is still an option).

Is there a better way to get the spacing between documents without creating a huge ngrams vector? Spearman's correlation? or would a more low-tech approach be more appropriate, for example, take the upper ngrams and find other documents with the same ngrams in the upper k -grams? I just feel like I should probably solve the problem at its most brute force if I need to multiply maybe 10000 vectors of elements together 320 million times (the sum of the arithmetic series 79.999 + 79.998 ... 1).

Any advice for approaches or what to read next is greatly appreciated.

+3


source to share


2 answers


So for K=5

what you basically want to return K-Nearest Neighbors to a specific document? In this case, you must use an algorithm K-Nearest Neighbors

. Scikit-Learn has good text imports and subroutine normalization (tfidf) and then it's pretty easy to implement KNN .

The heuristic basically just creates normalized word count vectors from all words in the document and then compares the distance between the vectors. I would have definitely changed a number of different indicators of distance, eg Euclidean vs. Manhattan vs. Cosine Similarity

. Vectors aren't really long

, they just sit in arrogant space. Thus, you can correct the unique words you wrote about simply by downscaling through PCA or your favorite algorithm.



It's probably just as easy to do in another package, but the scikit learning documentation is top notch and makes it easy to learn quickly and thoroughly.

+1


source


You should learn about hashing mechanisms that can be used to calculate the similarity between documents.

Typical hash functions are designed to minimize collision matching near duplicates with very different hash keys. In cryptographic hash functions, if the data is changed by one bit, the hash key will be changed to a completely different one.

The purpose of affinity hashing is to create a similarity hash function. Hash methods for near-duplicate detection are for the opposite intent of cryptographic hashing algorithms. Very similar documents are matched against very similar hash keys, or even the same key. The difference between the bit distance of keystrokes is a measure of similarity.



After the hash keys have been computed, the keys can be sorted to increase the speed of finding a nearby duplicate from O (n2) to O (nlog (n)) . The threshold can be determined and tuned by analyzing the accuracy of the training data.

Simhash, Minhash, and local sensitive hashing are three implementations of hash methods. You can google and get more information on this. There are many scientific articles related to this topic ...

+1


source







All Articles