Calculate values ​​for a subset of tensor values ​​associated with the same neuron in the optimizer

I am writing an optimizer in TensorFlow using python.

How to calculate the values ​​of subsets of tensor values ​​that are connected as incoming connections of neurons?


For example, take a stochastic gradient descent optimizer with an impulse term. An impulse is calculated for each connection. Now I want to calculate the impulse for one connection by calculating the average of all the impulse values ​​of the connections associated with the same neuron.

Connection example

In this figure, you can see two connections that are both connected to neuron 3 as an incoming connection. Both connections must be counted to update the weight of one connection. Typically, an update for join (1, 3) would only include gradient (1, 3) and pulse (1, 3). To update the connection (1, 3), I want to use the average for impulse (1, 3) and impulse (2, 3).

Let's look at a simple fully connected neural network with one input neuron, two hidden layers, two neurons per hidden layer, and one output neuron:

Neural network example

If we look at the normal calculation of the impulse (called "accumulation" in the code) to update the weight for the connection between neuron 2 and neuron 5, we will just look at the last time impulse.

We can see the normal "accumulation" update computation from the python implementation below:

accumulation = self.get_slot(var, "a")
accumulation_update = grad + (mu_t * accumulation)

      

For the connection between neuron 2 and neuron 5, the accumulation looks like this:

accumulationUpdate_ {2.5} = grad_ {2.5} + (\ mu * accumulation_ {2.5})

This is the part that needs to change. The new impulse calculation should take the average of all connections that connect as incoming connections to the same neuron as the connection for which the weight update is calculated. Considering an exemplary neural network, the "accumulation" value for connection (2, 5) is the average of the "accumulation" value for connection (2, 5) and (3, 5). These are all incoming connections of neuron 5.

The Accumulation update changes as follows:

accumulation = self.get_slot(var, "a")
accumulation_means = # Code to calculate all mean values for all neurons
accumulation_update = grad + (mu_t * accumulation_means) # Use the means for the accumulation_update

      

The accumulation update calculation for compound (2, 5) is now calculated as follows:

accumulation_mean = (accumulation(2, 5) + accumulation(3, 5)) / 2
accumulation_update(2, 5) = grad(2, 5) + (mu_t * accumulation_mean)

      

This calculation is done the same for each connection:

calculation for all connections

Here's a python implementation of stochastic gradient descent with momentum:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow.python.framework import ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import state_ops
from tensorflow.python.training import optimizer


class SGDmomentum(optimizer.Optimizer):
    def __init__(self, learning_rate=0.001, momentum_term=0.9, use_locking=False, name="SGDmomentum"):
        super(SGDmomentum, self).__init__(use_locking, name)
        self._lr = learning_rate
        self._mu = momentum_term

        self._lr_t = None
        self._mu_t = None

    def _create_slots(self, var_list):
        for v in var_list:
            self._zeros_slot(v, "a", self._name)

    def _apply_dense(self, grad, var):
        lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
        mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
        accumulation = self.get_slot(var, "a")

        accumulation_update = grad + (mu_t * accumulation)
        accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)

        var_update = lr_t * accumulation_t
        var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)

        return control_flow_ops.group(*[var_t, accumulation_t])

    def _prepare(self):
        self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
        self._mu_t = ops.convert_to_tensor(self._mu, name="momentum_term")

      

The neural network I'm testing with (MNIST): https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py

How to implement the described mean of "accumulation" values ​​in the existing MWE code?


As a side note:

MWE is not my real scenario. This is just a minimal working example to explain and work around the problem I am trying to solve.

I am writing an optimizer in python because I could not create TensorFlow on Windows and therefore could not compile C ++ files. I've spent a lot of time building Windows and I can't afford to spend more time on it. The optimizer in python is enough for me as I am just prototyping at the moment.

I am new to tensorflow and python. I can't find anything about this topic in the documentation. Linking me to the source would be great. Also, the internal structure of tensors is not digested for me, and the error messages I get when I check are just not clear to me. Keep this in mind when explaining something.

+3


source to share


1 answer


Let's take neuron 2,3,4,5 as an example to calculate a new impulse. We ignore bias and only consider weights:

enter image description here

We use W for the weight matrix G for the corresponding W gradients , M for the corresponding momentum matrix, \ tilde {\ bm {M}} for the middle matrix.

enter image description here

So renewing new momentum

enter image description here

I changed the code in the suggested SGDmomentum class and ran it in the MNIST example without the errors I think you already did.



def _apply_dense(self, grad, var):
    lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
    mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
    accumulation = self.get_slot(var, "a")

    param_dims = len(accumulation.get_shape().as_list())
    if param_dims == 2:  # fc layer weights
        accumulation_mean = tf.reduce_mean(accumulation, axis=1, keep_dims=True)
    elif param_dims == 1:  # biases
        accumulation_mean = accumulation
    else:  # cnn? or others
        # TODO: improvement
        accumulation_mean = accumulation

    accumulation_update = grad + (mu_t * accumulation_mean)  # broadcasting is supported by tf.add()
    accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)

    var_update = lr_t * accumulation_t
    var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)

    return control_flow_ops.group(*[var_t, accumulation_t])

      

For training

with tf.name_scope('train'):
    train_step = SGDmomentum(FLAGS.learning_rate, 0.9).minimize(cross_entropy)
    # train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
    #     cross_entropy)

      

At the moment, this algorithm converges less quickly than the traditional SGD with momentum on MNIST.

As for additional reading source, I don't know if Stanford CS231n can help you Gradient Descent and SGD with momentum . You probably already knew that.

If you are still confused about using matrix structure for gradient tensors, try to accept it because there is almost no difference between matrix and single scalar.

What I've done here is just converting the calculation of each accumulationUpdate_*

in your question to matrix form.

+1


source







All Articles