How to use pickled classifier with countVectorizer.fit_transform () to label data
I trained the classifier on a bunch of short documents and marinated it after getting reasonable f1 and precision values to define a binary classification.
During training, I have reduced the number of features using sciki-learn countVectorizer
      
        
        
        
      
    cv:
    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 
      
        
        
        
      
    and then the methods fit_transform()
      
        
        
        
      
    and were used transform()
      
        
        
        
      
    to get the transformed sets of trains and tests:
    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))
    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()
      
        
        
        
      
    This all worked great for training and testing the classifier. However, I'm not sure how to use fit_transform()
      
        
        
        
      
    and transform()
      
        
        
        
      
    with the pickled version of the trained classifier to predict the label of invisible, unlabeled data.
I extract the features of the untagged data in exactly the same way as when preparing / testing the classifier:
## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)
## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)
transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()
## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
      
        
        
        
      
    Error message:
    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words
      
        
        
        
      
    You must use the same identifier instance to transform training and test data. You can do this by creating a pipeline with a vector classifier + classifier, training the pipeline on a training set, baiting the entire pipeline. Later, load the pickled pipeline and ask for the forecast on it.
See this related question: Bringing a Classifier into Production .