How backpropagation works in tensorflow
In tensorflow, it seems like the whole backpropagation algorithm is done in a single run of the optimizer with a specific cost function that is the output of some MLP or CNN.
I don't quite understand how tensorflow knows from cost that this is indeed the output of a certain NN? The cost function can be defined for any model. How do I "tell" that a particular cost function is derived from NN?
Question
How do I "tell" that a particular cost function is derived from NN?
(short) Answer
This is done by simply tweaking the optimizer to minimize (or maximize) the tensor. For example, if I have a loss function such as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
where y0 is true true (or desired result) and y_out is the computed output, then I could minimize the loss by defining my training function like this
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
This tells Tensorflow that when calculating train , gradient descent is applied to loss to minimize it, and loss is calculated using y0 and y_out, so gradient descent also affects those (if they are trainable variables), etc.
The variable y0 , y_out , loss and train are not standard python variables, but instead describing a computational graph. Tensorflow uses information about this compute graph to unfold it by applying gradient descent.
In particular, how this happens is outside the scope of this answer. Here and here are two good starting points for more information on more specific features.
Sample code
Step through the sample code. Code first.
### imports
import tensorflow as tf
### constant data
x = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]
### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant( x , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b1 = tf.Variable( tf.random_uniform( [3] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )
# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
b2 = tf.Variable( tf.random_uniform( [1] , minval=0.1 , maxval=0.9 , dtype=tf.float32 ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
for step in range(500) :
sess.run(train)
results = sess.run([m1,b1,m2,b2,y_out,loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label,result in zip(*(labels,results)) :
print ""
print label
print result
print ""
Skip it, but in reverse order, starting with
sess.run(train)
This tells tensorflow to find the graph of the node defined by the train and calculate it. The train is defined as
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
To calculate this tensor flux, it is necessary to calculate the automatic loss differentiation , which means passing the graph. loss is defined as
loss = tf.reduce_sum( tf.square( y0 - y_out ) )
What a tensor really is, applying automatic differentiation to unwrap first tf.reduce_sum , then tf.square , then y0 - y_out , which then leads to traverse the graph for both y0 and y_out.
y0 = tf.constant( y_ , dtype=tf.float32 )
y0 is constant and will not be updated.
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )
y_out be processed like losses will initially processed tf.sigmoid etc.
In general, each operation (eg tf.sigmoid, tf.square) not only defines an anterior operation (sigmoid or square is used), but also the information needed for automatic differentiation. This differs from standard python math such as
x = 7 + 9
The above equation does not code anything but update x where
z = y0 - y_out
encodes the plot of subtracting y_out from y0 and preserves both direct operation and enough to automatically differentiate in z
Backpropagation backpropagation was created Rumelhart, Hinton et al. And published on Nature in 1986.
As indicated in section 6.5: Algorithms and other back-propagation algorithm differentiation in-depth study of the book deeplearning book , there are two types of approaches to reverse the spread of the gradients using the computer graphs: Character differentiation and numbers and symbols to the derivative character. More Tensorflow-relevant, as pointed out in this article: The TensorFlow Tour is more recent, which can be illustrated with this diagram:
Source: Section II, Part D TensorFlow Tour
On the left side of Figure 7 above, w represents the weights (or variables) in Tensorflow, and x and y are two intermediate operations (or nodes, w, x, y and z are all operations) to get the scalar loss z.
Tensorflow will add a node to each node (if we print the variable names at a specific breakpoint, we can see some additional variables for such nodes, and they will be excluded if we freeze the model into a protocol buffer file to unwrap) on the graph for a gradient that can see in the diagram (b) on the right: dz / dy, dy / dx, dx / dw.
During the backpropagation at each node, we multiply its gradient by the gradient of the previous one and, finally, we obtain a symbolic descriptor of the common target derivative dz / dw = dz / dy * dy / dx * dx / dw, which exactly the chain rule applies. Once the gradient has been worked out, we can update ourselves at a learning rate.
For more details, please read this article: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems