Command line parameter in word2vec

I want to use word2vec to create my own text vector text with the current English wikipedia, but I cannot find an explanation for the command line parameter for using this program. In the demp-script you can find the following:
(text8 is the old 2006 wikipedia corpus)

make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin

      

What is the meaning of the command line parameter:
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

And what are the most appropriate values ​​when I have about 20GB text content (.txt file)? I read that for large packages, a vector size of 300 or 500 would be better.

+3


source to share


1 answer


You can check main () of word2vec.c and an explanation of each of the following options can be found:

printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
printf("Options:\n");
printf("Parameters for training:\n");
printf("\t-train <file>\n");
printf("\t\tUse text data from <file> to train the model\n");...`

      



Regarding the most appropriate values, I'm sorry I don't know the answer, but you can find some hints from the Performance paragraph of the original site ( Word2Vec - Google code ). He said:

 - architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
 - the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
 - sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
 - dimensionality of the word vectors: usually more is better, but not always
 - context (window) size: for skip-gram usually around 10, for CBOW around 5 

      

+2


source







All Articles