Understanding Backpropagation

14 Sep, 2024

So recently I read this tweet and my honest response was: "Of course I can't". But I'm willing to try and learn. So here's a blogpost about learning how to implement the backpropagation algorithm with NumPy. To be fair, I'm not a ML person, I'm just studying and this is a nice exercise.

Given the title, I want to learn how to implement the backpropagation algorithm in NumPy (the original tweet also cites "with no internet" but I'll need it.). First of all, to brush up my knowledge of the backpropagation algorithm I watched this video from Artem Kirsanov and this one from 3Blue1Brown. Then, to get a better grip on the coding part, I watched Karpathy Autograd video lecture.

I think it's time to code, because if I can't code it, I didn't get it.

Let's start with the basics. My goal here is to be extremely clear and try to explain (to myself) the backprop algo. First things first, let's import numpy, python library for linear algebra.

import numpy as np

Let's define an activation function which we will need later. Activation function are the non linear part of a NN. I choose sigmoid arbitrarily (I could have chosen a ReLu, it's the same for learning purpose).

def sigomoid(x):
    return 1 / (1 + np.exp(-x))

Next, I want a simple function that calculates the derivative of the sigmoid (had to double-check my math but I think it's right):

def derivative_sigomoid(x):
    return x * (1 - x)

Then, I'll make a class called SimpleNeuralNetwork, it will encapsulate all the functions needed to have a functioning NN. I wanted to make a neural network which is 4 input neuron, 1 output neuron and 4 neuron as a hidden layers. I will show an example of this network learn a XOR function (remember that NN are a function approximators, you can basically teach them any functions you can think of).

For the first function in our class let's define the constructor. Here it is:

def __init__(self, input_size, hidden_size, output_size):
        np.random.seed(42)
        self.w1 = np.random.randn(input_size, hidden_size)
        self.w2 = np.random.randn(hidden_size, output_size)

The init method sets up the network architecture with specified input, hidden, and output layer sizes. I set up the random seed for reproducibility so you can have the same results as me (I hope). Then the weights are initialized randomly. The connections between the input layer and hidden layer are represented by the weight matrix self.w1 and the connections between the hidden layer and output layer are represented by the weight matrix self.w2.

Now, we move on to the first important step, which is called the forward pass:

 def forward(self, X):
        self.z1 = np.dot(X, self.w1)
        self.a1 = sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.w2)
        self.a2 = sigmoid(self.z2)
        return self.a2

The first multiplication in done between X (input neurons) which contains a simple matrix we will define later and w1, the aforemnetioned matrix. The weighted sum is computed and then the activation function is applied. Then the same thing is done for the a1 matrix getting the result and multiplying it for w2. At the end sigmoid is applied.

We arguably got to the toughest point. The backpropagation function. Here's the core of our problem and why I decided to write this.
The backpropagation is how a network learns. The example in this blog is how this network can approximate a XOR function. Here's the code of the backward function. I'll try to explain each operation.

    def backward(self, X, y, output):
        self.output_error = y - output
        self.output_delta = self.output_error * sigmoid_derivative(output)
        
        self.z1_error = np.dot(self.output_delta, self.w2.T)
        self.z1_delta = self.z1_error * sigmoid_derivative(self.a1)
        
        self.w1 += np.dot(X.T, self.z1_delta)
        self.w2 += np.dot(self.a1.T, self.output_delta)

The main idea is to use the error of the output layer and calculate the error w.r.t. the weights and then update them. So the first operation is calculate the output error so a simple subtraction to show how far the prediction (output) is from the true value (y). The second is the crucial operation of backpropagation, at first we defined the sigmoid_derivative function, which was a simple function to calculate the slope of the sigmoid function at the output value. We want to combine two kind of information which are: how much we missed by the output error and how much the output activation function (sigmoid) was changing at the point of the output.
The intuition behind this:

If the sigmoid function was very flat at our output point (derivative close to 0), even a large error won't cause much change, because the output wasn't very sensitive to input changes at that point.
If the sigmoid function was very steep (derivative close to 1), we'll make larger adjustments because the output was sensitive to input changes.

This output_delta is then used to calculate how much to adjust the weights between the hidden layer and the output layer, and also to calculate the error for the hidden layer (which is then used to adjust the weights between the input and hidden layers). The key idea is that we're not just using the raw error, but we're considering how much the activation function can change at that point, which helps us make more appropriate adjustments to our weights.

Let's break down the remaining parts: calculate the error for the hidden layer (z1_error). This step propagates the error from the output layer back to the hidden layer. We multiply the output_delta by the transpose of w2 (the weights between hidden and output layers). Just to be clear, the transpose of a matrix is an operator which flips a matrix over its diagonal; we use that because it allows us to reverse the direction of the computation, effectively propagating the error backwards through the network. For a simple example consider a simple matrix w2 as:

w2 (4x1)
w1
w2
w3
w4

The transposition of this particular matrix is:

w2.T (1x4)
w1	w2	w3	w4

This reversal is crucial for backpropagation, as it enables us to distribute the error signal back through the network. The transposed weights maintain the same connections but in the opposite direction, preserving the relative importance of each connection.

This gives us an estimate of how much each hidden neuron contributed to the output error.

Then we calculate the delta for the hidden layer (z1_delta).
Similar to the output delta, this combines the error at the hidden layer with the derivative of the activation function. self.a1 represents the activations of the hidden layer. This tells us how to adjust the hidden layer's activations to reduce the error.

Last two operations are: first we update the weights between input and hidden layers (w1):

self.w1 += np.dot(X.T, self.z1_delta)

X.T is the transpose of the input, same intuition as before. We multiply this by z1_delta to get the gradient for w1. This updates w1 to reduce the error. (Note: in a full implementation, you'd typically multiply this by a learning rate.) The learning rate is

a tunable parameter that determines the step size at each iteration while moving toward a minimum of a loss function.

The intuition is that we're adjusting each weight in proportion to both the input value it's connected to (X.T) and the error it caused (z1_delta). Larger inputs or larger errors will cause larger weight updates.

Updating the weights between hidden and output layers (w2):

self.w2 += np.dot(self.a1.T, self.output_delta)

self.a1.T is the transpose of the hidden layer activations. We multiply this by output_delta to get the gradient for w2. This updates w2 to reduce the error. (Again, typically you'd include a learning rate.) Similar to the w1 update, we're adjusting each weight based on the activation of the hidden neuron it's coming from (a1.T) and the error it caused in the output (output_delta).
This process of updating weights based on the calculated error is what allows the neural network to learn from its mistakes and gradually improve its predictions over many iterations of training. And that's what backpropagation is about.

Then if we want to test our NN, we have to define a simple main function.

if __name__ == "__main__":
    # Input data
    X = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
    y = np.array([[0], [1], [1], [0]])

    # Create and train the network
    nn = SimpleNeuralNetwork(3, 4, 1)
    nn.train(X, y, epochs=10000, learning_rate=0.1)

    # Test the network
    test_data = np.array([[1, 1, 0]])
    prediction = nn.predict(test_data)
    print(f"Prediction for [1, 1, 0]: {prediction[0][0]}")

    # Print final weights
    print("\nFinal weights:")
    print("Input to hidden layer (W1):")
    print(nn.w1)
    print("\nHidden to output layer (W2):")
    print(nn.w2)

First, we prepare the data, in this example we create the input data (X) and corresponding target outputs (y). X is a 4x3 matrix, representing 4 training examples, each with 3 input features. y is a 4x1 matrix, representing the desired output for each training example. This data represents the XOR problem: the output is 1 if exactly one of the first two inputs is 1, otherwise it's 0.
Then we create an instance of our neural network by using the SimpleNeuralNetwork constructor.The arguments (3, 4, 1) specify:

3 input neurons (matching the 3 features in our input data)
4 hidden neurons
1 output neuron

Afterwards, we train the network on our data using the method train. It will run for 10,000 epochs (complete passes through the training data). The learning rate is set to 0.1, which determines the size of weight updates during training (arbitrary choice for this problem).
Finally, we create a new, unseen input [1, 1, 0] to test the network and we use the trained network to make a prediction on this new input. We print the prediction and the output should be close to 0 for correct XOR behavior. The actual output is 0.0015886416361329066, which is pretty consistent to what we expected.

Here's the complete final code:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        np.random.seed(42)
        self.w1 = np.random.randn(input_size, hidden_size)
        self.w2 = np.random.randn(hidden_size, output_size)
        
    def forward(self, X):
        self.z1 = np.dot(X, self.w1)
        self.a1 = sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.w2)
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, output):
        self.output_error = y - output
        self.output_delta = self.output_error * sigmoid_derivative(output)
        
        self.z1_error = np.dot(self.output_delta, self.w2.T)
        self.z1_delta = self.z1_error * sigmoid_derivative(self.a1)
        
        self.w1 += np.dot(X.T, self.z1_delta)
        self.w2 += np.dot(self.a1.T, self.output_delta)
    
    def train(self, X, y, epochs, learning_rate):
        for _ in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            
    def predict(self, X):
        return self.forward(X)

# Example usage
if __name__ == "__main__":
    # Input data
    X = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
    y = np.array([[0], [1], [1], [0]])

    # Create and train the network
    nn = SimpleNeuralNetwork(3, 4, 1)
    nn.train(X, y, epochs=10000, learning_rate=0.1)

    # Test the network
    test_data = np.array([[1, 1, 0]])
    prediction = nn.predict(test_data)
    print(f"Prediction for [1, 1, 0]: {prediction[0][0]}")

    # Print final weights
    print("\nFinal weights:")
    print("Input to hidden layer (W1):")
    print(nn.w1)
    print("\nHidden to output layer (W2):")
    print(nn.w2)

Disclaimer: I’m sure I didn’t explain this perfectly. I’m going through a learning curve, and sharing this helps me stay accountable and motivated to improve. It’s a great way to learn because exposing your mistakes is a valuable part of the process.
Thank you for reading this far, I really appreciate it. If you notice any inaccuracies or areas for improvement, please feel free to reach out, and I’ll make corrections as soon as possible.