Where in the tensorflow gradients is the sum over the y elements?
I am trying to hack tf.gradient
in tensorflow that will give for y
rank tensor (M, N) and x
rank tensor (Q, P) a rank gradient tensor (M, N, Q, P), as one would expect.
As pointed out in several questions on this site *, what turns out is the rank (Q, P), which is the graph of the sum of elements from y
. Now what I can't figure out by looking at the tensorflow code, is this sum over the elements from y
? Is it like a beginning or an end? Can anyone help me identify the lines of code where this is being done?
*
source to share
I answered this here , but I guess it's not very useful because you can't use this knowledge to be able to differentiate relatively non-scalar y
. The Scalar y assumption is central to the design of the inverse AD algorithm, and there is not a single place you can change to support non-scalar y
s. As this confusion continues to rise, let me elaborate a little more on why this is non-trivial:
First of all, how does the reverse AD work - suppose we have a function f, that the composition of the component functions f_i. Each component function takes a vector of length n and outputs a vector of length n.
Its derivative can be expressed as a sequence of matrix multiplications. The full expression can be expressed below.
Under differentiation, the composition of functions becomes the matrix multiplication of the corresponding component function of the Jacobians.
Note that this applies to matrix / matrix products, which turns out to be too expensive for neural networks. IE, AlexNet contains 8k activations in its convnet-> fc transition layer. Doing matrix multipliers where each matrix is 8k x 8k would take too long. The trick that makes it efficient is to assume that the last function in the chain produces a scalar. Then its Jacobian is a vector, and the whole thing can be rewritten in terms of vector-matrix multiplications, and the matrix matrix is multiplied.
This product can be computed efficiently by doing left-to-right multiplication, so everything you do is multiplied by an nxn vector matrix instead of an nxn matrix matrix.
You can make it even more efficient by never creating those nxn derivative matrices in the first place, and associate each component function with an operator that implicitly executes the Jacobi vector matrix product. What TensorFlow does tf.RegisterGradient
. Here is an illustration of "grad" associated with a component function.
Now this is done for vector value functions, what if your functions are matrices? This is a typical situation we deal with in neural networks. IE, in the layer that multiplies the matrix, the matrix you are multiplying is unknown, and it is a matrix. In this case, the last derivative is of rank 2, and the rest of the derivatives are of rank 3.
Now, to apply the chain rule, you have to deal with additional notation, because now the "x" in the chain rule stands for matrix multiplication, generalized to tensors of rank-3.
Note, however, that we do not need to do the multiplication explicitly, as we are using the grad operator. So now, in practice, this operator now takes rank-2 values and produces rank-2 values.
So in all these cases, there is the assumption that the end goal is scalar, which allows fully connected layers to be differentiated by skipping matrices around.
If you want to extend this to support a non-scalar vector, you will need to modify the inverse AD algorithm to propagate more information. IE, for fully connected feed-forward networks, you will distribute rank-3 tensors instead of matrices.
source to share