Why does recognition speed drop after several online training eras?

I am using tensorflow for pattern recognition on MNIST dataset. In each training epoch, I selected 10,000 random images and did an online training with batch size 1. The recognition speed increased during the first few epochs, but after several epochs the recognition speed decreased significantly. (In the first 20 epochs, the recognition rate increases to ~ 94%. Subsequently, the recognition rate was 90-> 50-> 40-> 30-> 20). What is the reason for this?

In addition, with a batch size of 1, performance is worse than when using a batch size of 100 (maximum recognition rate 94% versus 96%). I've looked at several links, but there seem to be conflicting results about whether small or large batch sizes achieve better performance. What will happen in this case?

Edit: I've also added a metric for the recognition rate of the training set and test case. Recognition speed versus epoch

I have attached a copy of the code below. Thanks for the help!

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)

#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')

#model of neural network
def neural_network_model(data):
    hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1])               , name='l1_w'),
                      'biases': tf.Variable(tf.random_normal([n_nodes_hl1])                    , name='l1_b')}

    hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])       , name='l2_w'),
                      'biases' :tf.Variable(tf.random_normal([n_nodes_hl2])                    , name='l2_b')}

    hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])       , name='l3_w'),
                      'biases' :tf.Variable(tf.random_normal([n_nodes_hl3])                    , name='l3_b')}

    output_layer   = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes])     , name='lo_w'),
                      'biases' :tf.Variable(tf.random_normal([n_classes])                   , name='lo_b')}

    l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
    l1 = tf.nn.relu(l1) 
    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
    l2 = tf.nn.relu(l2)     
    l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
    l3 = tf.nn.relu(l3)
    output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']    
return output

#train neural network
def train_neural_network(x):
    prediction = neural_network_model(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
    optimizer = tf.train.AdamOptimizer().minimize(cost)
    hm_epoches = 100
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for epoch in range(hm_epoches):
            epoch_loss=0
            for batch in range (10000):
                epoch_x, epoch_y=mnist.train.next_batch(batch_size)                
                _,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
                epoch_loss += c
            correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
            accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
            print(epoch_loss)
            print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
            print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))

train_neural_network(x)

      

+3


source to share


2 answers


ACCURACY VARIETY

You're over the top. This is when the model learns the spurious features characteristic of the image artifacts in the training data through important features. One of the main experimental results of any application is the determination of the optimal number of training iterations.

For example, perhaps 80% of the 7 in your training data have a slight additional slope to the right at the base of the stem, where 4 and 1 are not. After training too much, your model "decides" that the best way to tell 7 from a different number is with additional bias, despite any other features. As a result, some 1 and 4 are now classified as 7.



BATCH SIZE

Again, the best batch size is one of the experimental results. Typically a batch size of 1 is too small: this gives the first few input images, which affect the early weights too much when training the core or perceptron. This is a minor reinstallation case: one element has an undue effect on the model. However, it is important enough to change your top scores by 2%.

You need to balance batch size with other hyperparameters to find the sweet spot, optimal performance, followed by the shortest training time. In my experience it was better to increase the batch size before my time per image got worse. The models I used the most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached the minimum batch size; from there, the learning rate usually involved choosing the batch size of the best RAM to use.

+2


source


There are several possibilities, although you need to do some experimentation to find out what it is.

retraining

Prun explained it well. I'll add that the easiest way to avoid overfitting is to simply remove 10-15% of the training set and evaluate the recognition rate on this verified check, set after every few epochs. If you plot the change in recognition rate on the training and validation sets, you will eventually reach a point on the graph where the learning error continues to decrease, but the validation error starts to grow. Stop exercising at this point; that's where refitting starts in earnest. Note that it is important that there is no overlap between training / validation / testing sets.

This was more likely before you mentioned that the learning error did not decrease either, but it is possible that it recycles a fairly uniform part of your learning set by outliers or something. Try to randomize the order of your exercise set after each era; if he is adjusting one section of the set at the expense of others, it may help.

Addendum: The massive instant drop in quality in age 20 makes this even less likely; this is not what a reimagining looks like.



Numerical instability

If you get a particularly incorrect input at the activation point of a function with a large gradient, you can get a giant weight update that twists everything that has been learned so far. For this reason, it is customary to impose strict restrictions on the magnitude of the gradient. But you are using AdamOptimizer which has an epsilon parameter to avoid instability. I haven't read this article, so I don't know exactly how it works, but the fact that it is there makes instability less likely.

Saturated neurons

Some activation functions have areas with very small gradients, so if you end up with weights that the function is almost always in that region, you have a tiny gradient and therefore cannot learn effectively. Sigmoids and Tanh are especially prone to this because they have flat areas on either side of the function. ReLUs do not have a flat area at the upper end, but do at the lower end. Try replacing your activation functions with Softplus; they are similar to ReLU, but with a continuous non-zero gradient.

+1


source







All Articles