What's the difference between Luong Attention and Bahdanau?
They are very well explained in the pytorch seq2seq tutorial
The main difference is how to evaluate the similarity between the current decoder input and the encoder outputs.
source to share
I went through this effective approach to attention-based neural machine translation . In section 3.1, they mentioned the difference between the two attentions as follows:
-
I noticed that the top hidden states of the layers are used both in the encoder and in the decoder. But Bogdanau's attention will be focused on the concatenation of the forward and reverse source of the hidden state (Top Hidden Layer).
-
In Luong's attention, they get the hidden state of the decoder at time t . Then calculate the attention scores and from this get a context vector that will be connected to the hidden state of the decoder and then predicted.
But in Bogdanau at time t we will consider the latent state of the decoder at about t-1 . Then we calculate the alignment, the context vectors as above. But then we combine this context with the hidden state of the decoder at time t-1 . Thus, before softmax, this cascade vector falls inside the GRU.
-
Luong has various types of alignments. Bogdanau only has a consistent bill model.
source to share
Professor Chris Manning explained the two methods in a Standford NLP lecture https://youtu.be/IxQtK2SjWWM?t=2996
source to share