Spark TF-IDF will return words from a hash

Question

Spark TF-IDF will return words from a hash

I am following this example from Spark documentation for calculating TF-IDF for a bunch of documents. Spark uses a hashing trick for these calculations, so at the end you end up with a vector containing the hashed words and the corresponding weight, but ... How can I return words from the hash?

Do I really need to hash all the words and store them in a map so that I can iterate over it later looking for keywords? No more efficient way to inline Spark?

Thank you in advance

+3

java hash apache-spark tf-idf

Quaiks 09 nov. 14 at 17:39

source to share

3 answers

Tim Hennekey · Answer 1 · 2015-05-22T14:31:49+0000

Converting String to hash in HashingTF results in a positive integer from 0 to numFeatures

(default 2 ^ 20) using org.apache.spark.util.Utils.nonNegativeMod (int, int) .

The original line is lost; there is no way to convert from the resulting integer to the input string.

David · Answer 2 · 2015-05-22T17:51:10+0000

You need to create a dictionary that maps all the tokens in your dataset to a hash value. But, since you are using a hashing trick, there may be hash collisions and the mapping is not perfectly reversible.

marilena.oita · Answer 3 · 2017-02-24T13:26:10+0000

If you are using CountVectorizer instead of HashingTF (TFIDF is basically a HashingTF and IDF conversion suit), then it is probably a better fit for your needs because you can rebuild the indexed dictionary.

String[] vocabulary= countVectorizerModel.vocabulary();

so you know how to find them;

For example, given the resulting SparseVector like (11, [0,1,3] , [1.0, ..., where [0,1,3] represents the indexes of the vocabulary terms occurring in the corresponding text, then you can get the conditions by referring to:

vocabulary[index]

If you need to do this in the context of LDA terms, the solution is the same.

Spark TF-IDF will return words from a hash

More articles: