Doc2vec MemoryError

I'm using the doc2vec model from the gensim framework to represent a corpus of 15,500,000 short documents (up to 300 words):

gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 )

      

After the vectors were created, there are over 18,000,000 vectors representing words and documents.

I want to find the most similar elements (words or documents) for a given element:

 similarities = model.most_similar(โ€˜uid_10693076โ€™)

      

but I get a MemoryError when the similarities are calculated:

Traceback (most recent call last):

   File "article/test_vectors.py", line 31, in <module> 
    similarities = model.most_similar(item) 
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 639, in most_similar 
    self.init_sims() 
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 827, in init_sims 
    self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) 

      

I have an Ubuntu machine with a 60GB frame and 70GB swap. I checked the memory allocation (in htop) and I noticed that the memory was not fully used. I also set an unlimited maximum address space that can be locked in memory in python:

resource.getrlimit(resource.RLIMIT_MEMLOCK)

      

Can anyone explain the reason behind this MemoryError? In my opinion, the available memory should be sufficient to perform these calculations. Could there be some memory limitation in python or OS?

Thanks in advance!

+3


source to share


1 answer


18M vectors * 400 measurements * 4 bytes / float = 28.8GB for syn0 array model (training vectors)

The syn1 array (hidden weight) will also be 28.8GB - although syn1 doesn't really need to be written for doc vectors, which are never target predictions during training.

The vocabulary structures (vocab dict and index2word table) are likely to add another GB or more. So all of your 60 GB of RAM.

The syn0norm array used for similarity calculations would require an additional 28.8 GB, for a total use of about 90 GB. This is the syn0norm creation where you are getting the error. But even if the syn0norm creation succeeded, then deep penetration into virtual memory could ruin performance.



Some steps that might help:

  • Use min_count at least 2: words that appear once are unlikely to contribute much, but they probably use a lot of memory. (But since words are a tiny part of your syn0, this will only save a little bit.)

  • After training, but before running init_sims (), discard the syn1 array. You won't be able to train anymore, but the existing word / doc vectors remain available.

  • After training, but before calling most_similar (), call init_sims () yourself with the replace = True parameter to cancel the unnormalized syn0 and replace it with syn0norm. Again you won't be able to train anymore, but you will save the syn0 memory.

The continuous work separating document vectors and words that will appear in gensim past verstion 0.11.1 should also provide some relief eventually. (It will compress syn1 to include only word entries, and let the doc vectors come from a file-backed array (memmap'd).)

+10


source







All Articles