When should I initialize state in LSTM code?

This is the UDACITY LSTM code for sentiment classification.

Here is a link of the whole sentence-rnn-code: udacity / sentiment-rnn

I wonder why they initialize the state of the cell right below the for loop for the epoch.

I think that the state of the cell should be initialized to zero when the input clause changes, so it should be in the minibatch statement for the loop.

## part of the sentence-rnn code
# Getting an initial state of all zeros
initial_state = cell.zero_state(batch_size, tf.float32)

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)    ###### i think this line

        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            ###### should be here
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)

      

who can explain why?

Thank!

+3


source to share


1 answer


  • Zero initialization is good practice if the impact is low.

By default, the zero state is used to initialize the RNN state. This often works well, especially for sequence-to-sequence problems such as language modeling, where the proportion of outputs that significantly affect the initial state is small.

  1. Zero initialization in each batch can lead to overfitting


Initializing zero for each batch will result in the following: losses in the early stages of the sequence-sequence (i.e. immediately after the reset state) will be greater than in subsequent stages, since there is less history. Thus, their contribution to the gradient during training will be relatively higher. But if all resettable states are associated with a null state, the model can (and will) learn how to compensate for just that. As the ratio of reset states to total observations increases, the model parameters will be more and more adjusted to this zero state, which can affect performance at later time steps.

  1. Do we have other options?

One simple solution is to make the initial state noisy (reduce losses for the first step). Check here for details and other ideas

+1


source







All Articles