SGDClassifier with HashingVectorizer and TfidfTransformer
I would like to understand if an online SGDClassifier (with partial_fit) can be trained using HashingVectorizer and TfidfTransformer. Simply attaching to them in the Pipeline will not work as the TfidfTransformer is stateful, so it will disrupt the online learning process. This post says that it is not possible to use tf-idf online, but the comment of this post suggests that it might somehow be possible: "In particular, if you are using stateful transformers as a TfidfTransformer, you will need to do multiple passes in your data ". Is this possible without loading the entire training set into memory? If so, how? If not, is there an alternative solution for combining the HashingVectorizer with tf-idf on large datasets?
source to share
Is this possible without loading the entire workout set into memory?
Not. TfidfTransformer
must have the entire matrix X
in memory. You would need to roll your own tf-idf evaluator, use it to calculate the frequencies of each document in one pass over the data, then do another pass to create tf-idf functions and set a classifier for them.
source to share