Naive Bayes classifier using python
I am using scikit-learn to find the weight of a Tf-idf document and then using a Naive
Bayesian classifier to classify the text. But the Tf-idf weight of all words in the documents is negative, except for a few. But as far as I know, negative values ββmean unimportant conditions. Do I need to pass all Tf-idf values ββto a Bayesian classifier? If we only need to go through a few of them, how can we do it? Also, how much better or worse is a Bayesian classifier compared to a linear SVC? Is there a better way to find tags in text other than using Tf-idf?
thank
source to share
You have a lot of questions, but I'll try to help.
As far as I remember, TF-IDF should not be negative. TF is the term frequency (how often the term appears in a particular document) and reverse document frequency (the number of documents in the corpus / # of documents that include this term). It is then usually the log weighed. We often add it to the denominator to avoid division by zero. Hence, the only time you get a negative tf * idf is if the term appears in every document of the document (which is not very useful for searching, as you mentioned, since it adds no information). I would double check your algorithm.
given term t, document d, corpus c:
tfidf = term freq * log(document count / (document frequency + 1))
tfidf = [# of t in d] * log([#d in c] / ([#d with t in c] + 1))
In machine learning, naive bays and SVMs are both good tools - their quality will vary depending on the application, and I've done projects where their accuracy was comparable. Naive Bayes is generally pretty easy to hack by hand - I took this shot first before heading to the SVM libraries.
I may be missing something, but I'm not entirely sure I know exactly what you are looking for - we will gladly change my answer.
source to share
This bug has been fixed in upstream. Beware as the vector text API has also changed to make it easier to set up tokenization.
source to share
I am also interested in this topic. When I use baes classification (maybe this Russian article on baes algorithm can help you http://habrahabr.ru/blogs/python/120194/ ) I only use 20 upper words of documents. I have tried many values. In my exeperimental top 20, get the best score. Also I changed the normal tf-idf to this:
def f(word):
idf = log10(0.5 / word.df)
if idf < 0:
idf = 0
return word.tf * idf
In this case, "bad words" are 0.
source to share