How to use pickled classifier with countVectorizer.fit_transform () to label data

I trained the classifier on a bunch of short documents and marinated it after getting reasonable f1 and precision values ​​to define a binary classification.

During training, I have reduced the number of features using sciki-learn countVectorizer

cv:

    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 

      

and then the methods fit_transform()

and were used transform()

to get the transformed sets of trains and tests:

    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))

    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()

      

This all worked great for training and testing the classifier. However, I'm not sure how to use fit_transform()

and transform()

with the pickled version of the trained classifier to predict the label of invisible, unlabeled data.

I extract the features of the untagged data in exactly the same way as when preparing / testing the classifier:

## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)

## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)

transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()

## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)

      

Error message:

    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words

      

+3


source to share


1 answer


You must use the same identifier instance to transform training and test data. You can do this by creating a pipeline with a vector classifier + classifier, training the pipeline on a training set, baiting the entire pipeline. Later, load the pickled pipeline and ask for the forecast on it.



See this related question: Bringing a Classifier into Production .

+3


source







All Articles