Tools to identify nearby duplicate documents
I am doing an NLP project and identifying the closest duplicate document is part of that. Can anyone with experience in this area suggest tools (implementations like Weka) available to detect nearly duplicates?
The project is devoted to the compilation of a statistical report on crimes after the analysis of news articles of some local English newspapers. Criminal articles are classified first. Duplicate articles must then be identified and merged. The collection of data can contain about 1000 crime-related articles to detect nearly duplicates.
I define about duplicates here as articles containing the same criminal incident. Sometimes, different news articles can report the same incidents. Also, the same news story may report news articles on different days.
The time taken to re-discover is not an issue as it is not online processing. Accuracy is very important here.
Thanks in advance.
source to share
While the concept is duplicate content
fairly simple, the concept near-duplicate content
can be problematic.
For example, do you view documents related to the same event (eg news articles from different sources) as NDC? Or are you considering papers that show the same syntactic patterns (like weather forecasts) as NDC?
Given your purpose, I think you are more interested in the old definition of NDC, but it should be more clearly expressed.
As a first experience, you can try OnIOn ( https://code.google.com/p/onion/ ) a tool designed to detect DC / NDC, but given the size of your enclosure (which is small), you might want to implement your own system NDC removal based on your NDC definition . Here I would suggest you read the original work of Broder et al. ( Http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf ) ... to give you some ideas.
source to share