Command line parameter in word2vec
I want to use word2vec to create my own text vector text with the current English wikipedia, but I cannot find an explanation for the command line parameter for using this program. In the demp-script you can find the following:
(text8 is the old 2006 wikipedia corpus)
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin
What is the meaning of the command line parameter: vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
And what are the most appropriate values ββwhen I have about 20GB text content (.txt file)? I read that for large packages, a vector size of 300 or 500 would be better.
source to share
You can check main () of word2vec.c and an explanation of each of the following options can be found:
printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
printf("Options:\n");
printf("Parameters for training:\n");
printf("\t-train <file>\n");
printf("\t\tUse text data from <file> to train the model\n");...`
Regarding the most appropriate values, I'm sorry I don't know the answer, but you can find some hints from the Performance paragraph of the original site ( Word2Vec - Google code ). He said:
- architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
- the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
- sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
- dimensionality of the word vectors: usually more is better, but not always
- context (window) size: for skip-gram usually around 10, for CBOW around 5
source to share