How do I implement Word2Vec in Java?
I installed word2Vec using this tutorial on an Ubuntu laptop. Do I need to install DL4J to implement word2Vec vectors in Java? I'm comfortable with Eclipse and not sure if I need all the other prerequisites that DL4J wants to install.
Ideally, it would be simple for me to just use the Java code I already wrote (in Eclipse) and change a few lines so that the word searches I do will retrieve the vector word2Vec instead of the current search process I'm using.
Also, I have looked into using GloVe, however I don't have MatLab. Can GloVe be used without MatLab? (Because of this, I got an error while installing). If so, the same question as above comes up ... I have no idea how to implement it in Java.
source to share
What is stopping you from saving a word2vec file (C program) in text format and then reading the file with a chunk of Java code and loading the vectors into a hash file entered with a string of words?
Some code snippets:
// Class to store a hashmap of wordvecs
public class WordVecs {
HashMap<String, WordVec> wordvecmap;
....
void loadFromTextFile() {
String wordvecFile = prop.getProperty("wordvecs.vecfile");
wordvecmap = new HashMap();
try (FileReader fr = new FileReader(wordvecFile);
BufferedReader br = new BufferedReader(fr)) {
String line;
while ((line = br.readLine()) != null) {
WordVec wv = new WordVec(line);
wordvecmap.put(wv.word, wv);
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
....
}
// class for each wordvec
public class WordVec implements Comparable<WordVec> {
public WordVec(String line) {
String[] tokens = line.split("\\s+");
word = tokens[0];
vec = new float[tokens.length-1];
for (int i = 1; i < tokens.length; i++)
vec[i-1] = Float.parseFloat(tokens[i]);
norm = getNorm();
}
....
}
If you want to get the nearest neighbors for a given word, you can store a list of N nearest precalculated neighbors associated with each WordVec object.
source to share
Dl4j author here. Our word2vec implementation is for people who need custom pipelines. I do not accuse you of taking the easy route.
Our word2vec implementation is for when you want to do something with them, not mess around. The c word2vec format is pretty simple.
The logic in java is parsed here if you want: https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-scaleout/deeplearning4j-nladding/match4j-nlp/jrcava/maindeader/maindeemblp/jrcavadeader/maindeande/src / WordVectorSerializer.java # L113
Hope this helps a little
source to share