When and where should we use these LSTM keras models

I know how RNN, LSTM, neural networks, activation function work, but from different LSTM models available. I don't know what I should be using, for what data and when. I've created these 5 models as a sample of the various LSTM model variations I've seen, but I don't know what is the optimal dataset to use. I have most of my discussion on the second / third lines of these models. Are Model 1 and Model 4 the same? Why model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False))

is different from model4.add(Embedding(X_train.shape[1], 128, input_length=max_len))

. I would really appreciate If someone can explain these five patterns in plain English.

from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed

#model1
model1 = Sequential()
model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print model1.summary()

#model2
model2 = Sequential()
model2.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=True))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print model2.summary()


#model3
model3 = Sequential()
model3.add(TimeDistributed(Dense(X_train.shape[1]), input_shape=(X_train.shape[1],1)))
model3.add(LSTM(10, return_sequences=False))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print model3.summary()


#model4
model4 = Sequential()
model4.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model4.add(LSTM(10))
model4.add(Dense(1, activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print model4.summary()

#model5
model5 = Sequential()
model5.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model5.add(Bidirectional(LSTM(10)))
model5.add(Dense(1, activation='sigmoid'))
model5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print model5.summary()

      

+3


source to share


1 answer


So:



  • The first network is the best for classification. It just analyzes the entire sequence - and once all the input steps are fed into the model, it can make a decision. There are other variants of this architecture (using, for example, GlobalAveragePooling1D

    or max one), which are quite similar from a conceptual point of view.

  • The second network - in terms of design, is very similar to the first architecture. They are distinguished by the fact that in the first approach there are two consecutive calls fit

    and are predict

    completely independent, whereas here the initial state for the second call coincides with the first in the first. This allows you to use many interesting scenarios, for example, analysis of a sequence of different lengths or, for example, a decision-making process due to the fact that you can effectively stop the inference / learning process - influence the network or input and return to it with an updated state.

  • It is best when you do not want to use a repeating network at all stages of your computation. Especially - when your network is large - introducing repeating layers is quite costly from a point with a parameter number (introducing a recurrent join usually increases the number of parameters by at least 2). Thus, you can apply a static mesh as a preprocessing step - and then you feed the results to the repeating part. This makes learning easier.

  • The model is a special case of case 3. Here - you have a sequence of tokens that are coded with one hot coding and then transformed with Embedding

    . This makes the process less memory intensive.

  • Bitrate networking gives you the advantage of knowing at each step not only the sequence of the previous history, but also the next steps. This is computationally expensive, and you also lose the ability to serialize data - as you need to have complete consistency in your analysis.

+1


source







All Articles