Can we weigh to improve the classification of sparse data in the corpus?
I am currently using tfidf before doing classification across multiple sites based on their content. Unfortunately my learning data is uneven: about 70% of pre-tagged websites are news sites, and the rest (tech, art, entertainment, etc.) are a huge minority.
My questions are:
-
Can tfidf be configured to distinguish different labels differently and make it behave as if the data were the same? Should I take a different approach in this case? I am currently using Gaussian Naive Bayes classifier after tfidf parsing, would there be anything even better in this particular case?
-
Is it possible for tfidf to give me a list of possible labels when the probability that it is a given label is below a certain threshold? For example, if the vector notation is close enough that it is only slightly (<1-2%) more likely to be one class than the other, can it print how?
source to share
No one has answered this question yet
Check out similar questions: