Spark TF-IDF will return words from a hash

I am following this example from Spark documentation for calculating TF-IDF for a bunch of documents. Spark uses a hashing trick for these calculations, so at the end you end up with a vector containing the hashed words and the corresponding weight, but ... How can I return words from the hash?

Do I really need to hash all the words and store them in a map so that I can iterate over it later looking for keywords? No more efficient way to inline Spark?

Thank you in advance

+3


source to share


3 answers


Converting String to hash in HashingTF results in a positive integer from 0 to numFeatures

(default 2 ^ 20) using org.apache.spark.util.Utils.nonNegativeMod (int, int) .



The original line is lost; there is no way to convert from the resulting integer to the input string.

+5


source


You need to create a dictionary that maps all the tokens in your dataset to a hash value. But, since you are using a hashing trick, there may be hash collisions and the mapping is not perfectly reversible.



+3


source


If you are using CountVectorizer instead of HashingTF (TFIDF is basically a HashingTF and IDF conversion suit), then it is probably a better fit for your needs because you can rebuild the indexed dictionary.

String[] vocabulary= countVectorizerModel.vocabulary();

      

so you know how to find them;

For example, given the resulting SparseVector like (11, [0,1,3] , [1.0, ..., where [0,1,3] represents the indexes of the vocabulary terms occurring in the corresponding text, then you can get the conditions by referring to:

vocabulary[index]

      

If you need to do this in the context of LDA terms, the solution is the same.

+1


source







All Articles