Tensorflow tf.constant_initializer is very slow

Trying to use pretrained word2vec embeddings from 100 dim to train LSTM

@staticmethod
def load_embeddings(pre_trained_embeddings_path, word_embed_size):
    embd = []
    import time
    start_time = time.time()
    cnt = 4
    with codecs.open(pre_trained_embeddings_path, mode="r", encoding='utf-8') as f:
        for line in f.readlines():
            values = line.strip().split(' ')
            embd.append(values[1:])
            cnt += 1
            if cnt % 100000 == 0:
                print("word-vectors loaded: %d" % cnt)

    embedding, vocab_size, embed_dim = embd, len(embd), len(embd[0])

    load_end_time = time.time()
    print("word vectors loaded from and start initialising, cnt: %d, time taken: %d secs " % (vocab_size, load_end_time - start_time))

    embedding_init = tf.constant_initializer(embedding, dtype=tf.float16)
    src_word_embedding = tf.get_variable(shape=[vocab_size, embed_dim], initializer=embedding_init, trainable=False, name='word_embedding', dtype=tf.float16)

    print("word-vectors loaded and initialised, cnt: %d, time taken: %d secs" % (vocab_size, time.time() - load_end_time))

    return src_word_embedding

      

And the output of this when running this method is like:

word vectors loaded from and start initialising, cnt: 2419080, time taken: 74 secs
word-vectors loaded and initialised, cnt: 2419080, time taken: 1647 secs

      

system information: tensorflow 1.1.0, tcmalloc, python 3.6, ubuntu 14.04

HALF hour to initialize seems very slow or is this normal behavior? Any idea what the problem might be or is there one?

UPDATE: Using @ sirfz's method to secure attachments, it loaded attachments really quickly Initialization Done in 85 secs

+3


source to share


2 answers


Loading large constants into a graph is not only slower but also leaks a lot of memory. I had a similar problem that I reported about this a long time ago and the best solution for me was this:



# placeholder for loading your saved embeddings
embedding_init = tf.placeholder(tf.float16, shape=[vocab_size, embed_dim])
src_word_embedding = tf.get_variable(initializer=embedding_init, trainable=False, name='word_embedding', dtype=tf.float16)

# run initialization with the value of embeddings placeholder
session.run(tf.global_variables_initializer(), feed_dict={embedding_init: embedding})

      

0


source


I don't know if this is the intended behavior, but I can explain why it might slow down with a small example:

import tensorflow as tf

x = [[0, 1], [2, 3]]
a = tf.constant(x, name='a')
b = tf.Variable(x, name='b')
c = a + b

with tf.Session() as sess:
    writer = tf.summary.FileWriter('logs', sess.graph)
    writer.close()

      

When you initialize a constant, the value of that constant is added to the graph. If you open the graph, you can see it by clicking on the value a

.



enter image description here

In my case it was a 2x2 matrix, but it looks like in your case it is 2M x? matrix that is huge. Therefore, in my opinion, this is the reason for such a slow execution.

Try to initialize this as a variable and load there.

0


source







All Articles