Does H2O provide any pre-processed vectors for use with h2o word2vec?

H2O recently added word2vec to its API. It's great that you can easily train your own word vectors on the corpus that you provide yourself.

However, there is great scope for using big data and large computers of this type like software vendors like Google or H2O.ai, but not many H2O end users may have access due to network bandwidth and computing power limitation.

Word attachments can be thought of as a type of unsupervised learning. Thus, great value can be achieved in the scientific data pipeline using pre-processed dictionary vectors that have been built on a very large corpus as infrastructure in specific applications. The use of pre-assigned general purpose vector vectors can be viewed as a form of learning transfer. Reusing vectors of vectors is analogous to deep learning in computer vision, generating the lowest layers that learn to detect edges in photographs. Higher levels detect certain types of objects made up of layers of edges below them.

For example, Google provides some pre-word vectors with their word2vec package. The more examples, the better, the better what happens to unsupervised learning. In addition, sometimes individual physicists have to practically load a gigantic corpus of text on which they can train their own word vectors. And there is no compelling reason for every user to recreate the same wheel by tracing the word vectors themselves on the same general purpose corpus (corpi?) Like wikipedia.

Word embeddings are very important and have the potential to be the bricks and mortar galaxies of possible applications. TF-IDF, the old foundation for many native data applications for data, is becoming obsolete by using word embeddings instead.

Three questions:

1 - Currently H2O provides any generally accepted word pre-embeddings (word vectors), for example, prepared from text found on legal or other public (government) websites or wikipedia or twitter or Craigslist, or other free or open sources in human texts?

2 - Is there a community site where H2O users can share their learnable word2vec vectors, which are built on more specialized corpora such as medicine and law?

3 - Can H2O import portable Word words from its word2vec package?

+3


source to share


1 answer


Thank you for your questions.

You are absolutely right, there are many situations where you don't need a custom model and a pre-trained model will work well. My guess is that people will basically create their own models for less problems in their specific area and use pre-trained models in addition to the custom model.

You can import third party pretrained models into H2O if they are in CSV format. This is true for many of the GloVe models available.

To do this, import the model into a frame (like any other dataset):

w2v.frame <- h2o.importFile("pretrained.glove.txt")

      



And then convert it to regular H2O word2vec model:

w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)

      

Please note that you need to specify the size of the attachments.

As far as I know,

H2O has no plans to offer a model exchange / model market for the w2v model. You can use the models available online: https://github.com/3Top/word2vec-api

We currently do not support importing the Google binary word embedding format, however, support is on our roadmap as it makes a lot of sense to our users.

+2


source







All Articles