How do I implement Word2Vec in Java?

I installed word2Vec using this tutorial on an Ubuntu laptop. Do I need to install DL4J to implement word2Vec vectors in Java? I'm comfortable with Eclipse and not sure if I need all the other prerequisites that DL4J wants to install.

Ideally, it would be simple for me to just use the Java code I already wrote (in Eclipse) and change a few lines so that the word searches I do will retrieve the vector word2Vec instead of the current search process I'm using.


Also, I have looked into using GloVe, however I don't have MatLab. Can GloVe be used without MatLab? (Because of this, I got an error while installing). If so, the same question as above comes up ... I have no idea how to implement it in Java.

+3


source to share


2 answers


What is stopping you from saving a word2vec file (C program) in text format and then reading the file with a chunk of Java code and loading the vectors into a hash file entered with a string of words?

Some code snippets:



// Class to store a hashmap of wordvecs
public class WordVecs {

    HashMap<String, WordVec> wordvecmap;
    ....
    void loadFromTextFile() {
        String wordvecFile = prop.getProperty("wordvecs.vecfile");
        wordvecmap = new HashMap();
        try (FileReader fr = new FileReader(wordvecFile);
            BufferedReader br = new BufferedReader(fr)) {
            String line;

            while ((line = br.readLine()) != null) {
                WordVec wv = new WordVec(line);
                wordvecmap.put(wv.word, wv);
            }
        }
        catch (Exception ex) { ex.printStackTrace(); }        
    }
    ....
}

// class for each wordvec
public class WordVec implements Comparable<WordVec> {
    public WordVec(String line) {
        String[] tokens = line.split("\\s+");
        word = tokens[0];
        vec = new float[tokens.length-1];
        for (int i = 1; i < tokens.length; i++)
            vec[i-1] = Float.parseFloat(tokens[i]);
        norm = getNorm();
    }
    ....
}

      

If you want to get the nearest neighbors for a given word, you can store a list of N nearest precalculated neighbors associated with each WordVec object.

+6


source


Dl4j author here. Our word2vec implementation is for people who need custom pipelines. I do not accuse you of taking the easy route.

Our word2vec implementation is for when you want to do something with them, not mess around. The c word2vec format is pretty simple.



The logic in java is parsed here if you want: https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-scaleout/deeplearning4j-nladding/match4j-nlp/jrcava/maindeader/maindeemblp/jrcavadeader/maindeande/src / WordVectorSerializer.java # L113

Hope this helps a little

+5


source







All Articles