Can we weigh to improve the classification of sparse data in the corpus?

I am currently using tfidf before doing classification across multiple sites based on their content. Unfortunately my learning data is uneven: about 70% of pre-tagged websites are news sites, and the rest (tech, art, entertainment, etc.) are a huge minority.

My questions are:

  • Can tfidf be configured to distinguish different labels differently and make it behave as if the data were the same? Should I take a different approach in this case? I am currently using Gaussian Naive Bayes classifier after tfidf parsing, would there be anything even better in this particular case?

  • Is it possible for tfidf to give me a list of possible labels when the probability that it is a given label is below a certain threshold? For example, if the vector notation is close enough that it is only slightly (<1-2%) more likely to be one class than the other, can it print how?

+3


source to share





All Articles