Initialization of decoder states in sequences to sequences

I am writing my first neural machine translator in tensorflow. I am using an Attention Encoding / Decoder engine. My encoder and decoder are lstm stacks with residual connections, but the encoder has an initial bi-directional layer. Decoder does not work.

The code I've seen for initializing the state of the decoder cells uses common practice with the last state of the encoder cells. However, this is only a clean solution if your encoder and decoder architectures are the same, as is the case in many seq2seq tutorials. In many other systems, like this Google model , the architecture is different from encoder and decoder.

What are some of the alternative strategies used to initialize decoder state under these circumstances?

I've seen cases where the last latent state of the encoder is passed through the trained weight vector to create the initial decoder state for all decoder layers. I've also seen more inventive ideas such as the one presented here , but I would like to develop an intuition as to why people choose certain strategies.

+3


source to share





All Articles