LSTM learning path

I am new to NN and I am doing my own "Hello World" with LSTM instead of copying something. I chose simple logic like this:

Input at 3 time intervals. The first one is 1 or 0, the other 2 are random numbers. The expected result is the same as the first input timeslot. The data flow looks like this:

_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this. 

      

This simple (?) Logic can be trained with 100% accuracy. I did a lot of tests, and the most effective model I found was 3 LSTM layers, each with 15 hidden units. This resulted in 100% accuracy after 22 epochs.

However, I noticed something that is difficult for me to understand: in the first 12 epochs, the model does not advance at all, since it is measured in accuracy (corresponds to 0.5) and only minor progress is measured by categorical crossentropy (goes from 0.69 to 0.65) Then from epoch 12 to epoch 22, he trains very quickly with 1.0 precision. The question is, why does learning happen like this? Why aren't the first 12 eras moving forward and why are eras 12-22 so much more efficient?

Here is my entire code:

from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical

x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)

model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])
model.fit(x_, y_binary, epochs=100)

      

+3


source to share


1 answer


It is difficult to give a concrete answer to this, as it depends on many factors. One of the main factors that comes into play when training neural networks is the learning rate of your chosen optimizer.

In your code, you don't have a specific learning rate. Adam's training frequency in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate based lr_t

on an initial learning rate (0.001) and the current time step, defined as

lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .

      

The values beta_2

and beta_1

are usually left with the default values 0.999

and 0.9

respectively. If you draw this tutorial, you get a picture something like this:



Adam's dynamic learning rate for era 1-22

Perhaps this is just a nice place to update your weights to find a local (possibly global) minimum. A learning rate that is too high often does not affect it, it just "skips" across regions that would reduce your error, while lower training levels take up less space in the error landscape and allow you to find areas where the error is lower.

I suggest using an optimizer that makes fewer assumptions like stochastic gradient descent (SGD) and you test that hypothesis using a lower learning rate.

+1


source







All Articles