How backpropagation works in tensorflow

Question

How backpropagation works in tensorflow

In tensorflow, it seems like the whole backpropagation algorithm is done in a single run of the optimizer with a specific cost function that is the output of some MLP or CNN.

I don't quite understand how tensorflow knows from cost that this is indeed the output of a certain NN? The cost function can be defined for any model. How do I "tell" that a particular cost function is derived from NN?

+10

tensorflow

Ezer miller May 26 '17 at 21:50

source to share

2 answers

Backpropagation backpropagation was created Rumelhart, Hinton et al. And published on Nature in 1986.

As indicated in section 6.5: Algorithms and other back-propagation algorithm differentiation in-depth study of the book deeplearning book , there are two types of approaches to reverse the spread of the gradients using the computer graphs: Character differentiation and numbers and symbols to the derivative character. More Tensorflow-relevant, as pointed out in this article: The TensorFlow Tour is more recent, which can be illustrated with this diagram:

Source: Section II, Part D TensorFlow Tour

On the left side of Figure 7 above, w represents the weights (or variables) in Tensorflow, and x and y are two intermediate operations (or nodes, w, x, y and z are all operations) to get the scalar loss z.

Tensorflow will add a node to each node (if we print the variable names at a specific breakpoint, we can see some additional variables for such nodes, and they will be excluded if we freeze the model into a protocol buffer file to unwrap) on the graph for a gradient that can see in the diagram (b) on the right: dz / dy, dy / dx, dx / dw.

During the backpropagation at each node, we multiply its gradient by the gradient of the previous one and, finally, we obtain a symbolic descriptor of the common target derivative dz / dw = dz / dy * dy / dx * dx / dw, which exactly the chain rule applies. Once the gradient has been worked out, we can update ourselves at a learning rate.

For more details, please read this article: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

0

Lerner Zhang 30 Aug '19 at 15:15

source to share

Panchishin · Accepted Answer · 2017-05-27T03:34:38+0000

Question

How do I "tell" that a particular cost function is derived from NN?

(short) Answer

This is done by simply tweaking the optimizer to minimize (or maximize) the tensor. For example, if I have a loss function such as

loss = tf.reduce_sum( tf.square( y0 - y_out ) )

where y0 is true true (or desired result) and y_out is the computed output, then I could minimize the loss by defining my training function like this

train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

This tells Tensorflow that when calculating train , gradient descent is applied to loss to minimize it, and loss is calculated using y0 and y_out, so gradient descent also affects those (if they are trainable variables), etc.

The variable y0 , y_out , loss and train are not standard python variables, but instead describing a computational graph. Tensorflow uses information about this compute graph to unfold it by applying gradient descent.

In particular, how this happens is outside the scope of this answer. Here and here are two good starting points for more information on more specific features.

Sample code

Step through the sample code. Code first.

### imports
import tensorflow as tf

### constant data
x  = [[0.,0.],[1.,1.],[1.,0.],[0.,1.]]
y_ = [[0.],[0.],[1.],[1.]]

### induction
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output

# Layer 0 = the x2 inputs
x0 = tf.constant( x  , dtype=tf.float32 )
y0 = tf.constant( y_ , dtype=tf.float32 )

# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable( tf.random_uniform( [2,3] , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
b1 = tf.Variable( tf.random_uniform( [3]   , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
h1 = tf.sigmoid( tf.matmul( x0,m1 ) + b1 )

# Layer 2 = the 3x1 sigmoid output
m2 = tf.Variable( tf.random_uniform( [3,1] , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
b2 = tf.Variable( tf.random_uniform( [1]   , minval=0.1 , maxval=0.9 , dtype=tf.float32  ))
y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )


### loss
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum( tf.square( y0 - y_out ) )

# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)


### training
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
  sess.run( tf.global_variables_initializer() )
  for step in range(500) :
    sess.run(train)

  results = sess.run([m1,b1,m2,b2,y_out,loss])
  labels  = "m1,b1,m2,b2,y_out,loss".split(",")
  for label,result in zip(*(labels,results)) :
    print ""
    print label
    print result

print ""

Skip it, but in reverse order, starting with

sess.run(train)

This tells tensorflow to find the graph of the node defined by the train and calculate it. The train is defined as

train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

To calculate this tensor flux, it is necessary to calculate the automatic loss differentiation , which means passing the graph. loss is defined as

loss = tf.reduce_sum( tf.square( y0 - y_out ) )

What a tensor really is, applying automatic differentiation to unwrap first tf.reduce_sum , then tf.square , then y0 - y_out , which then leads to traverse the graph for both y0 and y_out.

y0 = tf.constant( y_ , dtype=tf.float32 )

y0 is constant and will not be updated.

y_out = tf.sigmoid( tf.matmul( h1,m2 ) + b2 )

y_out be processed like losses will initially processed tf.sigmoid etc.

In general, each operation (eg tf.sigmoid, tf.square) not only defines an anterior operation (sigmoid or square is used), but also the information needed for automatic differentiation. This differs from standard python math such as

x = 7 + 9

The above equation does not code anything but update x where

z = y0 - y_out

encodes the plot of subtracting y_out from y0 and preserves both direct operation and enough to automatically differentiate in z

How backpropagation works in tensorflow

Question

(short) Answer

Sample code

More articles: