Mood Analysis Skillset

I use Python NLTK for sentiment analysis and my data has about 200,000 reviews. To use the Naive Bayes classifier, I need a training kit that is labeled. Since my details are not flagged, I manually created about 100 reviews, both positive and negative. But I don't think this is the way to do it. I heard that I need to have 20% of the data as a training set to train the classifier and apply it to the other 80% of the data.

Is there a better way to create a Naive Bayes Classifier workout set? Thanks for your help, and please let me know if the questions are not clear to understand.

+3


source to share


1 answer


We have had great success using only about 100-200 training samples (depending on the specific classification) to classify hundreds of thousands of paragraphs with a fairly high degree of accuracy.

We manually filtered the randomly selected samples to make sure they weren't very similar (and therefore represent different ways of expressing the concept). We used RapidMiner for classification, not NLTK, but I expect the algorithms to be quite similar.



Run your classifier with your 100 reviews and then run a set of 100 random reviews not included in the training kit. Check the accuracy and add more feedback to the training kit if the accuracy is not where you want it.

+1


source







All Articles