How can I tell if two web contents are similar?

Question

How can I tell if two web contents are similar?

Given 2 html sources, I want to first extract the main content from it using something like this . Are there any other better libraries - I'm specifically looking for Python / Javascript?

Once I have two retrieved contents, I want to return a score between 0 and 1, indicating how similar they are. news articles on the same topic from CNN and BBC will have higher similarity scores because they are on the same topic or webpages referring to the same product on Amazon.com and Walmart.com will have a high score. How can i do this? Are there already existing libraries? What good libraries can I use? I am mainly looking for a combination of automatic generalization , keyword extraction , named object recognition, and sentiment analysis .

+3

python machine-learning nlp semantic-analysis text-mining

pathikrit 05 Apr 12 at 20:09

source to share

1 answer

Yavar · Accepted Answer · 2012-04-05T20:36:44+0000

There are many things in your question. I will try to provide you with a library or else suggest algorithms for you that can solve your problems (which you can google and you will get many python implementations)

Point 1 . To extract the main content from html (http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html) and for other NLP related stuff you can check out NLTK. Its written in Python. You can also check out a library called BeautifulSoup, its amazing (http://www.crummy.com/software/BeautifulSoup/)

Point 2 . When you speak:

Once I have two retrieved contents, I want to return a score between 0 and 1, indicating how similar they are ....

To do this, I suggest you group your document set using any unsupervised clustering learning technique. Since your problem falls under distance based clustering, it should be very easy for you to group similar documents and then assign a score to them based on their similarity to the centroid of the clusters. Try either K-Means or Adaptive Resonance Theory. In the latter case, you do not need to determine the number of clusters in advance. OR, as Larsman points out in his comments, you can just use TF-IDF (http://www.miislita.com/term-vector/term-vector-3.html)

Point 3 . When you speak:

I am mainly looking for a combination of automatic summarization, keyword extraction, name and entity recognition, and feeling analysis

For automatic summation use phase without negative matrix

Use NLTK to extract the keyword

Use NLTK to recognize names and entities

Use NLTK to analyze sentiment

How can I tell if two web contents are similar?

More articles: