How can LSTM pay attention to variable length input

The LSTM attention mechanism is a forward softmax feed network that receives latent states of each encoder time step and the current decoder state.

These 2 steps seem to contradict and can't wrap my head around: 1) The number of entries to the live network must be specified. 2) the number of hidden states of the encoder is variable (depends on the number of time steps during encoding).

I do not understand something? Also, would the training be the same as if I was training a regular encoder / decoder network, or would I have to train the attention mechanism separately?

Thanks in Advance

+3


source to share


2 answers


I asked myself the same thing today and found this question. I've never used the attention mechanism myself, but from this article it seems a little more than just a straightforward softmax. For each output y inetwork decoder context vector c icomputed as the weighted sum of the hidden states of the encoder h 1, ..., h T:

c i= α i1 h 1+ ... + α <sub> Etosub> h <sub> Tsub>

The number of time steps T can be different for each sample, because the coefficients α ijare not fixed-size vectors. In fact, they are calculated using softmax (e i1, ..., e iT), where each e ijis the result of a neural network whose input is the hidden state of the encoder h jand the hidden state of the decoder s i-1:



e ij= f ( s i-1, h j)

Thus, before calculating y i this neural network has to be evaluated T times, producing T weights α i1, ..., α iTsub>. Also, this tensorflow impotence can be helpful.

+7


source


def attention(inputs, size, scope):
    with tf.variable_scope(scope or 'attention') as scope:
        attention_context_vector = tf.get_variable(name='attention_context_vector',
                                             shape=[size],
                                             regularizer=layers.l2_regularizer(scale=L2_REG),
                                             dtype=tf.float32)
        input_projection = layers.fully_connected(inputs, size,
                                            activation_fn=tf.tanh,
                                            weights_regularizer=layers.l2_regularizer(scale=L2_REG))
        vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True)
        attention_weights = tf.nn.softmax(vector_attn, dim=1)
        weighted_projection = tf.multiply(inputs, attention_weights)
        outputs = tf.reduce_sum(weighted_projection, axis=1)

return outputs

      



Hope this piece of codes helps you understand how attention works. I use this feature in my document classification assignments which is a lstm attention model different from your codec decoder model.

+1


source







All Articles