LSTM learning path
I am new to NN and I am doing my own "Hello World" with LSTM instead of copying something. I chose simple logic like this:
Input at 3 time intervals. The first one is 1 or 0, the other 2 are random numbers. The expected result is the same as the first input timeslot. The data flow looks like this:
_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this.
This simple (?) Logic can be trained with 100% accuracy. I did a lot of tests, and the most effective model I found was 3 LSTM layers, each with 15 hidden units. This resulted in 100% accuracy after 22 epochs.
However, I noticed something that is difficult for me to understand: in the first 12 epochs, the model does not advance at all, since it is measured in accuracy (corresponds to 0.5) and only minor progress is measured by categorical crossentropy (goes from 0.69 to 0.65) Then from epoch 12 to epoch 22, he trains very quickly with 1.0 precision. The question is, why does learning happen like this? Why aren't the first 12 eras moving forward and why are eras 12-22 so much more efficient?
Here is my entire code:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical
x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)
model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.fit(x_, y_binary, epochs=100)
It is difficult to give a concrete answer to this, as it depends on many factors. One of the main factors that comes into play when training neural networks is the learning rate of your chosen optimizer.
In your code, you don't have a specific learning rate. Adam's training frequency in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate based lr_t
on an initial learning rate (0.001) and the current time step, defined as
lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .
The values beta_2
and beta_1
are usually left with the default values 0.999
and 0.9
respectively. If you draw this tutorial, you get a picture something like this:
Perhaps this is just a nice place to update your weights to find a local (possibly global) minimum. A learning rate that is too high often does not affect it, it just "skips" across regions that would reduce your error, while lower training levels take up less space in the error landscape and allow you to find areas where the error is lower.
I suggest using an optimizer that makes fewer assumptions like stochastic gradient descent (SGD) and you test that hypothesis using a lower learning rate.