XOR neural network backprop

Question

XOR neural network backprop

I am trying to implement a basic XOR NN with 1 hidden layer in Python. I don't understand backprop algo's specific answer, so I got stuck getting delta2 and updated the weights ... help?

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

vec_sigmoid = np.vectorize(sigmoid)

theta1 = np.matrix(np.random.rand(3,3))
theta2 = np.matrix(np.random.rand(3,1))

def fit(x, y, theta1, theta2, learn_rate=.001):
    #forward pass
    layer1 = np.matrix(x, dtype='f')
    layer1 = np.c_[np.ones(1), layer1]
    layer2 = vec_sigmoid(layer1*theta1)
    layer3 = sigmoid(layer2*theta2)

    #backprop
    delta3 = y - layer3
    delta2 = (theta2*delta3) * np.multiply(layer2, 1 - layer2) #??

    #update weights
    theta2 += learn_rate * delta3 #??
    theta1 += learn_rate * delta2 #??

def train(X, Y):
    for _ in range(10000):
        for i in range(4):
            x = X[i]
            y = Y[i]
            fit(x, y, theta1, theta2)


X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]    
train(X, Y)

+3

python machine-learning neural-network backpropagation

WBC May 14 '15 at 16:16

source to share

1 answer

bnsh · Accepted Answer · 2015-05-15T06:09:00+0000

So first, here is the corrected code to get you working.

#! /usr/bin/python

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

vec_sigmoid = np.vectorize(sigmoid)

# Binesh - just cleaning it up, so you can easily change the number of hiddens.
# Also, initializing with a heuristic from Yoshua Bengio.
# In many places you were using matrix multiplication and elementwise multiplication
# interchangably... You can't do that.. (So I explicitly changed everything to be
# dot products and multiplies so it clear.)
input_sz = 2;
hidden_sz = 3;
output_sz = 1;
theta1 = np.matrix(0.5 * np.sqrt(6.0 / (input_sz+hidden_sz)) * (np.random.rand(1+input_sz,hidden_sz)-0.5))
theta2 = np.matrix(0.5 * np.sqrt(6.0 / (hidden_sz+output_sz)) * (np.random.rand(1+hidden_sz,output_sz)-0.5))

def fit(x, y, theta1, theta2, learn_rate=.1):
    #forward pass
    layer1 = np.matrix(x, dtype='f')
    layer1 = np.c_[np.ones(1), layer1]
    # Binesh - for layer2 we need to add a bias term.
    layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
    layer3 = sigmoid(layer2.dot(theta2))

    #backprop
    delta3 = y - layer3
    # Binesh - In reality, this is the _negative_ derivative of the cross entropy function
    # wrt the _input_ to the final sigmoid function.

    delta2 = np.multiply(delta3.dot(theta2.T), np.multiply(layer2, (1-layer2)))
    # Binesh - We actually don't use the delta for the bias term. (What would be the point?
    # it has no inputs. Hence the line below.
    delta2 = delta2[:,1:]

    # But, delta are just derivatives wrt the inputs to the sigmoid.
    # We don't add those to theta directly. We have to multiply these by
    # the preceding layer to get the theta2d and theta1d's
    theta2d = np.dot(layer2.T, delta3)
    theta1d = np.dot(layer1.T, delta2)

    #update weights
    # Binesh - here you had delta3 and delta2... Those are not the
    # the derivatives wrt the theta's, they are the derivatives wrt
    # the inputs to the sigmoids.. (As I mention above)
    theta2 += learn_rate * theta2d #??
    theta1 += learn_rate * theta1d #??

def train(X, Y):
    for _ in range(10000):
        for i in range(4):
            x = X[i]
            y = Y[i]
            fit(x, y, theta1, theta2)


# Binesh - Here a little test function to see that it actually works
def test(X):
    for i in range(4):
        layer1 = np.matrix(X[i],dtype='f')
        layer1 = np.c_[np.ones(1), layer1]
        layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))]
        layer3 = sigmoid(layer2.dot(theta2))
        print "%d xor %d = %.7f" % (layer1[0,1], layer1[0,2], layer3[0,0])

X = [(0,0), (1,0), (0,1), (1,1)]
Y = [0, 1, 1, 0]    
train(X, Y)

# Binesh - Alright, let see!
test(X)

And now a little explanation. Sorry for the rough drawing. It was just easier to take a snapshot than drawing something in gimp.

_{(source: binesh at cablemodem.hex21.com )}

So. First, we have an error function. We'll call it CE (for cross entropy. I'll try to use your variables where possible, although I'm going to use L1, L2 and L3 instead of layer1, layer2 and layer3. Sigh (I don't know how to make latex here. It seems to work on exchange stack statistics. is weird.)

CE = -(Y log(L3) + (1-Y) log(1-L3))

We need to take the derivative of this L3 so that we can see how we can move L3 to decrease this value.

dCE/dL3 = -((Y/L3) - (1-Y)/(1-L3))
        = -((Y(1-L3) - (1-Y)L3) / (L3(1-L3)))
        = -(((Y-Y*L3) - (L3-Y*L3)) / (L3(1-L3)))
        = -((Y-Y3*L3 + Y3*L3 - L3) / (L3(1-L3)))
        = -((Y-L3) / (L3(1-L3)))
        = ((L3-Y) / (L3(1-L3)))

Great, but really, we can't just change L3 as we see fit. L3 is a function of Z3 (see my picture).

L3      = sigmoid(Z3)
dL3/dZ3 = L3(1-L3)

I don't get it here (sigmoid derivative), but it really isn't that hard to prove).

But, in any case, what is the derivative of L3 with respect to Z3, and we want to get the derivative of CE from Z3.

dCE/dZ3 = (dCE/dL3) * (dL3/dZ3)
        = ((L3-Y)/(L3(1-L3)) * (L3(1-L3)) # Hey, look at that. The denominator gets cancelled out and
        = (L3-Y) # This is why in my comments I was saying what you are computing is the _negative_ derivative.

We call the Z derivatives "deltas". So, in your code, this matches delta3.

Great, but we can't just change the Z3 the way we like it. We need to calculate it from the derivative of L2.

But this is more difficult.

Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)

So, we need to take partial derivatives with respect to. L2 (1), L2 (2) and L2 (3)

dZ3/dL2(1) = theta2(1)
dZ3/dL2(2) = theta2(2)
dZ3/dL2(3) = theta2(3)

Note that offset will be effective

dZ3/dBias  = theta2(0)

but the offset never changes, it is always 1, so we can safely ignore it. But our layer 2 includes an offset, so we'll save it for now.

But, again, we want to get the derivative with respect to Z2 (0), Z2 (1), Z2 (2) (It looks like I painted this badly, unfortunately. Look at the graph, I think it will be clearer with this.)

dL2(1)/dZ2(0) = L2(1) * (1-L2(1))
dL2(2)/dZ2(1) = L2(2) * (1-L2(2))
dL2(3)/dZ2(2) = L2(3) * (1-L2(3))

What is now dCE / dZ2 (0..2)

dCE/dZ2(0) = dCE/dZ3 * dZ3/dL2(1) * dL2(1)/dZ2(0)
           = (L3-Y)  * theta2(1)  * L2(1) * (1-L2(1))

dCE/dZ2(1) = dCE/dZ3 * dZ3/dL2(2) * dL2(2)/dZ2(1)
           = (L3-Y)  * theta2(2)  * L2(2) * (1-L2(2))

dCE/dZ2(2) = dCE/dZ3 * dZ3/dL2(3) * dL2(3)/dZ2(2)
           = (L3-Y)  * theta2(3)  * L2(3) * (1-L2(3))

But we can actually express it as (delta3 * Transpose [theta2]), alternately multiplied by (L2 * (1-L2)) (where L2 is a vector)

This is our delta2 layer. I am deleting the first entry of this because, as I mentioned above, this corresponds to the offset delta (which I denote L2 (0) in my graph.)

So. We now have our Z derivatives, but in reality we can only change our theta.

Z3 = theta2(0) + theta2(1) * L2(1) + theta2(2) * L2(2) + theta2(3) * L2(3)
dZ3/dtheta2(0) = 1
dZ3/dtheta2(1) = L2(1)
dZ3/dtheta2(2) = L2(2)
dZ3/dtheta2(3) = L2(3)

Once again, we want dCE / dtheta2 (0) tho, so this becomes

dCE/dtheta2(0) = dCE/dZ3 * dZ3/dtheta2(0)
               = (L3-Y) * 1
dCE/dtheta2(1) = dCE/dZ3 * dZ3/dtheta2(1)
               = (L3-Y) * L2(1)
dCE/dtheta2(2) = dCE/dZ3 * dZ3/dtheta2(2)
               = (L3-Y) * L2(2)
dCE/dtheta2(3) = dCE/dZ3 * dZ3/dtheta2(3)
               = (L3-Y) * L2(3)

Well it's just np.dot (layer2.T, delta3) and this is what I have in theta2d

And, similarly: Z2 (0) = theta1 (0.0) + theta1 (1.0) * L1 (1) + theta1 (2.0) * L1 (2) dZ2 (0) / dtheta1 (0.0) = 1 dZ2 (0) / dtheta1 (1.0) = L1 (1) dZ2 (0) / dtheta1 (2.0) = L1 (2)

Z2(1) = theta1(0,1) + theta1(1,1) * L1(1) + theta1(2,1) * L1(2)
dZ2(1)/dtheta1(0,1) = 1
dZ2(1)/dtheta1(1,1) = L1(1)
dZ2(1)/dtheta1(2,1) = L1(2)

Z2(2) = theta1(0,2) + theta1(1,2) * L1(1) + theta1(2,2) * L1(2)
dZ2(2)/dtheta1(0,2) = 1
dZ2(2)/dtheta1(1,2) = L1(1)
dZ2(2)/dtheta1(2,2) = L1(2)

And we would have to multiply by dCE / dZ2 (0), dCE / dZ2 (1) and dCE / dZ2 (2) (for each of the three groups at the top. But if you think about it, then it just becomes np.dot (layer1. T, delta2), and what I have in theta1d.

Now, because you did Y-L3 in your code, you add in theta1 and theta2 ... But, here's the reasoning. What we have just calculated is the weighted derivative of CE. Thus, this means that weight gain will increase CE. But we really want to reduce CE. So we subtract (usually). But, since you are calculating the negative derivative in your code, you are adding correctly.

Does this make sense?

XOR neural network backprop

More articles: