Naive Bayes classifier using python

I am using scikit-learn to find the weight of a Tf-idf document and then using a Naive
Bayesian classifier to classify the text. But the Tf-idf weight of all words in the documents is negative, except for a few. But as far as I know, negative values ​​mean unimportant conditions. Do I need to pass all Tf-idf values ​​to a Bayesian classifier? If we only need to go through a few of them, how can we do it? Also, how much better or worse is a Bayesian classifier compared to a linear SVC? Is there a better way to find tags in text other than using Tf-idf?

thank

+3


source to share


3 answers


You have a lot of questions, but I'll try to help.

As far as I remember, TF-IDF should not be negative. TF is the term frequency (how often the term appears in a particular document) and reverse document frequency (the number of documents in the corpus / # of documents that include this term). It is then usually the log weighed. We often add it to the denominator to avoid division by zero. Hence, the only time you get a negative tf * idf is if the term appears in every document of the document (which is not very useful for searching, as you mentioned, since it adds no information). I would double check your algorithm.

given term t, document d, corpus c:



tfidf = term freq * log(document count / (document frequency + 1))
tfidf = [# of t in d] * log([#d in c] / ([#d with t in c] + 1))

      

In machine learning, naive bays and SVMs are both good tools - their quality will vary depending on the application, and I've done projects where their accuracy was comparable. Naive Bayes is generally pretty easy to hack by hand - I took this shot first before heading to the SVM libraries.

I may be missing something, but I'm not entirely sure I know exactly what you are looking for - we will gladly change my answer.

+6


source


This bug has been fixed in upstream. Beware as the vector text API has also changed to make it easier to set up tokenization.



+6


source


I am also interested in this topic. When I use baes classification (maybe this Russian article on baes algorithm can help you http://habrahabr.ru/blogs/python/120194/ ) I only use 20 upper words of documents. I have tried many values. In my exeperimental top 20, get the best score. Also I changed the normal tf-idf to this:

def f(word):
    idf = log10(0.5 / word.df)
    if idf < 0:
        idf = 0
    return word.tf * idf

      

In this case, "bad words" are 0.

+2


source







All Articles