Continue training Doc2Vec model

The Gensim official tutorial explicitly states that it is possible to continue training the (loaded) model. I know that according to the documentation it is not possible to continue training the model loaded from the format word2vec

. But even when you generate the model from scratch and then try to call the method train

, it is not possible to access the newly created labels for the instances LabeledSentence

provided in train

.

>>> sentences = [LabeledSentence(['first', 'sentence'], ['SENT_0']), LabeledSentence(['second', 'sentence'], ['SENT_1'])]
>>> model = Doc2Vec(sentences, min_count=1)
>>> print(model.vocab.keys())
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
>>> sentence = LabeledSentence(['third', 'sentence'], ['SENT_2'])
>>> model.train([sentence])
>>> print(model.vocab.keys())

# At this point I would expect the key 'SENT_2' to be present in the vocabulary, but it isn't
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])

      

Can I continue learning the Doc2Vec model in Gensim with new suggestions? If so, how can this be achieved?

+3


source to share


2 answers


I understand that this is not possible for new labels. We can continue learning only when the new data has the same labels as the old. As a result, we train or readjust the weights of the already learned vocabulary, but we cannot learn a new vocabulary.

Similarly, there is a question about adding new shortcuts / words / sentences during training: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online $ 20word2vec / word2vec-toolkit / L9zoczopPUQ / _Zmy57TzxUQJ



Alternatively, you can follow this discussion: https://groups.google.com/forum/#!topic/gensim/UZDkfKwe9VI

Update. If you want to add new words to an already trained model, have a look at online word2vec here: http://rutumulkar.com/blog/2015/word2vec/

+4


source


According to gensim documentation online / incremental learning is not supported for doc2vec.

refer to https://github.com/RaRe-Technologies/gensim/issues/1019



I can still add new documents to the existing doc2vec model (but some of them fail due to a segmentation fault), but most of these queries fail on a newly added document (so this approach seems to be useless).

0


source







All Articles