Continue training Doc2Vec model
The Gensim official tutorial explicitly states that it is possible to continue training the (loaded) model. I know that according to the documentation it is not possible to continue training the model loaded from the format word2vec
. But even when you generate the model from scratch and then try to call the method train
, it is not possible to access the newly created labels for the instances LabeledSentence
provided in train
.
>>> sentences = [LabeledSentence(['first', 'sentence'], ['SENT_0']), LabeledSentence(['second', 'sentence'], ['SENT_1'])]
>>> model = Doc2Vec(sentences, min_count=1)
>>> print(model.vocab.keys())
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
>>> sentence = LabeledSentence(['third', 'sentence'], ['SENT_2'])
>>> model.train([sentence])
>>> print(model.vocab.keys())
# At this point I would expect the key 'SENT_2' to be present in the vocabulary, but it isn't
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
Can I continue learning the Doc2Vec model in Gensim with new suggestions? If so, how can this be achieved?
source to share
I understand that this is not possible for new labels. We can continue learning only when the new data has the same labels as the old. As a result, we train or readjust the weights of the already learned vocabulary, but we cannot learn a new vocabulary.
Similarly, there is a question about adding new shortcuts / words / sentences during training: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online $ 20word2vec / word2vec-toolkit / L9zoczopPUQ / _Zmy57TzxUQJ
Alternatively, you can follow this discussion: https://groups.google.com/forum/#!topic/gensim/UZDkfKwe9VI
Update. If you want to add new words to an already trained model, have a look at online word2vec here: http://rutumulkar.com/blog/2015/word2vec/
source to share
According to gensim documentation online / incremental learning is not supported for doc2vec.
refer to https://github.com/RaRe-Technologies/gensim/issues/1019
I can still add new documents to the existing doc2vec model (but some of them fail due to a segmentation fault), but most of these queries fail on a newly added document (so this approach seems to be useless).
source to share