Online version of scikit-learn TfidfVectorizer

I want to use scikit-learn HashingVectorizer

because it is great for online learning problems (new markers in the text are guaranteed to appear on the "bucket"). Unfortunately, the implementation included in scikit-learn does not seem to include support for tf-idf functions. Does the output of vector images pass through the TfidfTransformer

only way to make online updates work with the tf-idf functions, or is there a more elegant solution?

+1


source to share


2 answers


In other words, you cannot use TF IDF online, as the IDF of all past functions will change with each new document, which would mean revisiting and retraining all previous documents, which will no longer be online.



There may be some approximations, but you will have to implement them yourself.

+3


source


You can do TF-IDF "online", contrary to what was said in the accepted answer.

In fact, every search engine (like Lucene) does.

Which doesn't work if you assume you have TF-IDF vectors in memory.

Search engines like lucene naturally don't store all data in memory. Instead, they load one column at a time (which is not a lot due to the rarity). IDF arises trivially from the length of an inverted list.



The point is, you are not converting your data to TF-IDF and then doing the standard cosine similarity.

Instead, you use the current IDF weights when calculating similarities using a weighted cosine-like (often modified with additional weighting, condition forcing, penalties, etc.).

This approach will work with basically any algorithm that allows attribute weighting during evaluation . Many algorithms will do, but very few implementations are flexible enough, unfortunately. Unfortunately, most of you expect you to be multiplying the weights into your data matrix prior to training.

+2


source







All Articles