How to use pickled classifier with countVectorizer.fit_transform () to label data
I trained the classifier on a bunch of short documents and marinated it after getting reasonable f1 and precision values to define a binary classification.
During training, I have reduced the number of features using sciki-learn countVectorizer
cv:
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
and then the methods fit_transform()
and were used transform()
to get the transformed sets of trains and tests:
transformed_feat_train = numpy.zeros((0,0,))
transformed_feat_test = numpy.zeros((0,0,))
transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
transformed_feat_test = cv.transform(testingTextFeat).toarray()
This all worked great for training and testing the classifier. However, I'm not sure how to use fit_transform()
and transform()
with the pickled version of the trained classifier to predict the label of invisible, unlabeled data.
I extract the features of the untagged data in exactly the same way as when preparing / testing the classifier:
## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)
## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)
transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()
## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
Error message:
Traceback (most recent call last):
File "../clf.py", line 615, in <module>
if __name__=="__main__": main()
File "../clf.py", line 579, in main
cv.fit_transform(pickledClassifierFile)
File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
source to share
You must use the same identifier instance to transform training and test data. You can do this by creating a pipeline with a vector classifier + classifier, training the pipeline on a training set, baiting the entire pipeline. Later, load the pickled pipeline and ask for the forecast on it.
See this related question: Bringing a Classifier into Production .
source to share