Implementing Word2Vec Models Effectively into Production Services

This is a rather long shot, but I hope someone else has been in a similar situation as I'm looking for some advice on how to effectively bring a set of large word2vec models into a production environment.

We have a number of w2v trainable models with a dimension of 300. Because of the underlying data, a huge corpus with labeled POS words; specialized dictionaries up to 1 million words - these models have grown quite large and we are currently exploring effective ways to expose them to our users without paying too high an infrastructure cost.

Also, in order to better control the size of the dictionary, obviously decreasing the dimension on the vectors of objects would be an option. Does anyone know of a post around this, especially how it will affect the quality of the model and how best to measure it?

Another option is to pre-calculate the top X most similar words for each word dictionary and provide a lookup table. Since the size of the model is so large, it is also very inefficient at present. Are there any heuristics that can be used to reduce the number of required distance calculations from nx n-1 to a lower number?

Many thanks!

+3


source to share


1 answer


There are pre-indexing techniques for finding similarity in high-dimensional spaces that can speed up nearest neighbor discovery, but usually at the cost of absolute precision. (They also need more memory for the index.)

An example is the ANNOY library . The gensim project includes a demo box demonstrating its use with Word2Vec .

I once did some experiments using only 16bit (not 32bit) floats in the Word2Vec model. It kept the memory idle and the nearest neighbor top-N results were almost unchanged. But perhaps due to the fact that there was still a slowdown before converting to 32-bit floats during one-time calculations, the speed of operations was actually reduced. (And this suggests that each distance calculation could cause a temporary expansion of memory, offsetting any savings in free time.) So this is not a quick fix, but further exploration here - possibly related to finding / implementing the correct routines for array operations float16 - could mean 50% model savings and equivalent or even better speed.



For many applications, dropping the least frequent words doesn't really hurt - or even, when done before training, can improve the quality of the remaining vectors. Since many implementations, including gensim, sort the word-vector array in most cases with the lowest frequent ordering, you can either drop the end of the array to save memory, or limit the search most_similar()

to the first N entries in computation speed.

Once you have minimized the size of the dictionary, you want to be sure that the full set is in RAM and no swapping is triggered during (typical) full sweep distance calculations. If you need multiple processes to respond to responses from the same set of vectors, as in a web service on a multicore machine, gensim memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding up gensim Word2Vec load times .

Finally, while pre-compiling first-grader neighbors for a larger vocabulary is time consuming and memory intensive, if your access pattern is such that some tokens are checked much more than others, cache N is most recently, or M is most frequently requested top-N can significantly improve perceived performance - creating only less frequently requested list neighbors requires total distance calculations for every other token.

+1


source







All Articles