What does compute_gradients return in tensorflow

Question

What does compute_gradients return in tensorflow

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
gradients, variables = zip(*optimizer.compute_gradients(mean_sqr))
opt = optimizer.apply_gradients(list(zip(gradients, variables)))

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

for j in range(TRAINING_EPOCHS):
    sess.run(opt, feed_dict={x: batch_xs, y_: batch_xs})

I don't understand what compute_gradients returns? Does the sum (dy / dx) for the given x values assigned by batch_xs return and update the gradient in the apply_gradients function, e.g .:
theta-theta - LEARNING_RATE * 1 / m * gradients?

Or does it already return the average of the gradients, which is summed for each x value in a given batch, like sum (dy / dx) * 1 / m, m is defined as batch_size?

+3

python-3.x deep-learning gradient tensorflow

user3104352 08 jul. 17 at 3:02

source to share

1 answer

lejlot · Accepted Answer · 2017-07-08T12:05:48+0000

compute_gradients (a, b) returns d [sum a] / db. So in your case this returns d mean_sq / d theta, where theta is the set of all variables. There is no "dx" in this equation, you do not compute gradients over. inputs. So what happens to the batch dimension? You remove it yourself in the mean_sq definition:

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))

this way (I'm assuming y is for simplicity 1D)

d[ mean_sqr ] / d theta = d[ 1/M SUM_i=1^M (pred(x_i), y_i)^2 ] / d theta
                        = 1/M SUM_i=1^M d[ (pred(x_i), y_i)^2 ] / d theta

so you control whether it sums the batch, takes the average, or does something else, if you were to define mean_sqr to use reduce_sum instead of reduce_mean, the gradients would be batch sum, etc.

On the other hand apply_gradients just "applies gradients", the exact rule for the application depends on the optimizer, for the GradientDescentOptimizer it will be

theta <- theta - learning_rate * gradients(theta)

For Adam, what you use the equation is more complicated, of course.

Note, however, that tf.gradients looks more like "backprop" than a true gradient in a mathematical sense, which means that it depends on graph dependencies and does not recognize dependencies in the "opposite" direction.

What does compute_gradients return in tensorflow

More articles: