What does compute_gradients return in tensorflow

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
gradients, variables = zip(*optimizer.compute_gradients(mean_sqr))
opt = optimizer.apply_gradients(list(zip(gradients, variables)))

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

for j in range(TRAINING_EPOCHS):
    sess.run(opt, feed_dict={x: batch_xs, y_: batch_xs})

      

I don't understand what compute_gradients returns? Does the sum (dy / dx) for the given x values ​​assigned by batch_xs return and update the gradient in the apply_gradients function, e.g .:
theta-theta - LEARNING_RATE * 1 / m * gradients?

Or does it already return the average of the gradients, which is summed for each x value in a given batch, like sum (dy / dx) * 1 / m, m is defined as batch_size?

+3


source to share


1 answer


compute_gradients (a, b) returns d [sum a] / db. So in your case this returns d mean_sq / d theta, where theta is the set of all variables. There is no "dx" in this equation, you do not compute gradients over. inputs. So what happens to the batch dimension? You remove it yourself in the mean_sq definition:

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))

      

this way (I'm assuming y is for simplicity 1D)

d[ mean_sqr ] / d theta = d[ 1/M SUM_i=1^M (pred(x_i), y_i)^2 ] / d theta
                        = 1/M SUM_i=1^M d[ (pred(x_i), y_i)^2 ] / d theta

      

so you control whether it sums the batch, takes the average, or does something else, if you were to define mean_sqr to use reduce_sum instead of reduce_mean, the gradients would be batch sum, etc.



On the other hand apply_gradients just "applies gradients", the exact rule for the application depends on the optimizer, for the GradientDescentOptimizer it will be

theta <- theta - learning_rate * gradients(theta)

      

For Adam, what you use the equation is more complicated, of course.

Note, however, that tf.gradients looks more like "backprop" than a true gradient in a mathematical sense, which means that it depends on graph dependencies and does not recognize dependencies in the "opposite" direction.

+1


source







All Articles