Neural Networks from Scratch

About

Neural Networks and deep learning were always a black box to me. You put in some data and it gives you back a prediction... magic! Even when using TensorFlow or PyTorch it wasn't crystal clear what was going on so I decided to create one from scratch. No TensorFlow, no PyTorch just NumPy, Python and some Calculus with Linear Algebra. Neural Networks are a type of supervised learning algorithm where we have a lot of data usually labeled (x_i, y_i). If we train the model enough, lets say from (x₁, y₁) to (x₁₀₀₀₀, y₁₀₀₀₀). The model should learn and if we give it a value x₁₀₀₀₁ (which it has never seen before), it should be able to accurately tell you the output is y₁₀₀₀₁.

I chose a couple of data sets to train my neural network on. You may find the google colab pages below:

To work through the steps, I will be using the Iris Flower Dataset as an example but I wrote the code such that we have a Layer and Neural Network class so that we may conveniently adjust the number of layers and neurons depending on the nature of our dataset.

Structure of the Neural Network with the Iris Dataset

The Iris Dataset consists of 150 samples of data. Each sample of data has 4 features with a label corresponding to the flower those features belong to.

Features include:

Sepal Length (Denoted as Input x1 in the network)
Sepal Width (Denoted as Input x2 in the network)
Petal Length (Denoted as Input x3 in the network)
Petal Width (Denoted as Input x4 in the network)

Species include:

Iris Setosa (Denoted as Output $ \hat{y} $1)
Iris Versicolor (Denoted as Output $ \hat{y} $2)
Iris Virginica (Denoted as Output $ \hat{y} $3)

Our goal is for the neural network to be able to predict the species of an unseen dataset. For example if we have an input [7.7, 2.6, 6.9, 2.3]. Where sepal length = 7.7 cm, sepal width = 2.6 cm, petal length = 6.9 cm and petal width = 2.3 cm and the neural network has never seen this before, it should be able to accurately predict which species this is. The neural network would output something like this [0.013, 0.022, 0.96]. Where $ \boldsymbol{\hat{y}} $1 = 0.013 meaning there is a 1.3% chance it is a Iris Setosa, $ \boldsymbol{\hat{y}} $2 = 0.022 meaning there is a 2.2% chance it is a Iris Versicolor and $ \boldsymbol{\hat{y}} $3 = 0.96 meaning there is a 96% chance it is a Iris Virginica. This will be compared to the true output which should be [0, 0, 1] assuming the species is really an Iris Virginica.

Training Steps

We can summarize the training in about 5 steps here though I will go into more details later.

Preparing Data

X-train, y-train

step 2 to 4

step 5

Forward Propagation

$ \boldsymbol{\hat{y}} $

Evaluate Loss

Backpropagation, Gradient Descent and adjusting weight/biases

--> We repeat step 2 to 4 several times until the network has learned the model

Evaluate the model

Weights, biases and notation

To simplify, i'm just going to look at the first and last neuron of layer 0 (I'll consider the input layer as layer 0) and the first and last neuron of layer 1 (The first hidden layer).
Weights W_i(j,k)

Each neuron's has a set of unique weights associated with it which is applied to the input from the previous layer. We can denote each unique weight as W_i(j,k) where i - layer, j - neuron from, k - neuron to.

Biases b_ik

Each neuron except the ones from the input (layer 0) have a unique bias. We can denote each unique bias as b_ik where i - layer, k - neuron to.

Weighted inputs Z_ik

The value entering each neuron is a sum of the weighted inputs + its bias. For example, the value entering the first neuron in layer 1 is Z_ij. Its value can be calculated as a weighted sum of the inputs plus its bias: Z₁₁ = W_1(1,1)*x₁ + W_1(2,1)*x₂ + W_1(3,1)*x₃ + W_1(4,1)*x₄ + b₁₁

Outputs a_ik

The inputs are then run through an activation function which will be covered below. In this neural network we will be using two activation functions ReLU and SoftMax. The output of the function is the output value of the neuron which will serve as the values for the next layer assuming we are not at the final layer.

Activation Function: ReLU

The activation function introduces non-linearity into our neural network. Without the use of activation functions, our neural network will just be a bunch of linear combinations - that's going to be useless. If a connection is not strong enough, it will just not "fire" the neuron. What ReLU does is it only passes the value z_ik on if the value exceeds 0. If it is below 0, the connection is simply not strong enough and the output a_ik will be 0.

In simple terms:
If z_ik > 0 then a_ik = z_ik
Otherwise: a_ik = 0

We can also write the ReLU function as f(x) = max(0, x).

Activation Function: SoftMax

Softmax is another activation function that allows us to take the output as percentages. At the moment, the logits z₃₁, z₃₂, z₃₃ on their own do not represent percentages. The softmax function addresses this by exponentiating each logit, summing these exponentials, and then dividing each exponentiated logit by this total sum. This process effectively converts the logits into a set of values that are interpretable as the model's confidence in each possible outcome, with all the values summing up to 100%. In other words, softmax allows us to evaluate the confidence level of the machine learning model in its predictions, presenting the output as percentages.

$$\text{softmax}(z_{ik}) = \frac{e^{z_{ik}}}{\sum_{c=1}^{N} e^{z_{ic}}}$$

Z - the logit
i - the network layer (usually the last layer if we are using SoftMax)
c, k - the neuron number
N - the number of logits

Evaluate Loss: Categorical Cross-Entropy Loss

So how do we evaluate how close the model is to the true output? We use a loss function called categorical cross-entropy loss. The formula calculates the loss by summing the negative logarithms of the predicted probabilities ($ \boldsymbol{\hat{y}} $_c) for each category, weighted by the actual labels ($ \boldsymbol{\hat{y}} $_c) across all M categories.

$$L = -\sum_{c=1}^{M} y_c \log(\hat{y}_c)$$

c - the output neuron number/ category we are predicting
M - the number of output neurons/ number of categories we are predicting
$ \boldsymbol{\hat{y}} $_c - the c_th predicted value
$ \boldsymbol{y_c} $ - the true value (from y train)

In the examples we can see what happens when our model is 96% sure it is Iris Virginica and it truly is Iris Virginica: the loss is only 0.041. When the model is 68% sure it is Iris Versicolor, but it is actually Iris Virginica: the loss increases to 1.386.

Minimizing loss: Gradient Descent

So how do we get our model to do better? We can aim to minimize the loss. To find the minimum of a function we can solve for the derivative. It's true that when the derivative of a function is at zero we are at a minimum or maximum but sometimes it is highly impractical or even impossible to solve for that. To approach this from another way, we apply gradient descent. Gradient descent finds the current instantaneous slope and updates our parameter based on that.

Let's say we want to see how much we need to shift our current weight w by in order to reduce the value outputed by the Loss function which we can put for example as Loss = (w + 5)². If we differentiate the Loss function with respect to w we can solve for the instantaneous slope of the Loss function at any point along the x-axis. We update our w parameter and move closer to the optimal value to minimize the Loss through subtracting a learning rate α - alpha (0.3 here) * the instantaneous slope at that point.

Initial w = -7, L = 4
After 1st iteration: w = -5.8, L = 0.64
After 2nd iteration: w = -5.32, L = 0.1024
After 3rd iteration: w = -5.128, L = 0.016384
After 4th iteration: w = -5.0512, L = 0.00262144
After 5th iteration: w = -5.02048, L = 0.0004194304

Global minimum of the loss function is not gauranteed

Here we have an example of where gradient descent may not lead us to the global minimum of the loss function. The loss function is given here by L = w⁴ + w³ - 3w² + 5.5. If we differentiate the loss function with respect to that particular weight, we get 4w³ + 3w² - 6w. If our initial weight is initialized to w = 1, we will \inch towards the local minimum of the Loss function found at (w = 0.91, L = 4.46), while there is a more desired optimal global minimum found at (w = -1.66, L = 0.25). Since we have no idea where the global minimum might be, this problem will be difficult to solve. Where we initialize our weight largely determines which local/global minimum we will fall upon. One method to combat this is to train the neural network multiple instances with different randomized weight initializations and find the one that provides us the best minimum loss. The learning rate α - alpha is 0.1 here. A high learning rate α - alpha ensures us we travel across the axis faster but the disadvantage is that we may overshoot the targeted minimum. On the other hand, a low learning rate ensures we will not overshoot, but the disadvantage is that we may take too long to converge to the minimum.

Initial w = 1, L = 4.5
After 1st iteration: w = 0.9, L = 4.4551
After 2nd iteration: w = 0.9054, L = 4.454942
After 3rd iteration: w = 0.905835, L = 4.454941
After 4th iteration: w = 0.905866, L = 4.454941
After 5th iteration: w = 0.905869, L = 4.454941

Initializing Weights and Biases

As shown previously, we have no idea where the optimal weights or biases may lie such that our neural network will provide the optimal performance with minimal loss. Therefore, the weights are initialized randomly using a method called He initialization - discovered by Kaiming He and the biases are initialized to zero.

The Kaiming He initialization works by picking values for each weight from a normal distribution with the standard deviation scaled by $\sqrt{\frac{2}{\text{fan-in}}}$, fan-in here means the dimensions of the input layer. So if we were to initialize weights for Layer 1, fan-in would be 4. If we were to initialize weights for Layer 2, fan-in would be 5 and so on. In a normal distribution with the standard deviation at 1 and the mean at 0, we would have about a 68% chance of getting a value between (-1, 1), 95% chance of getting a value between (-2, 2) and 99.7% chance of getting a value between (-3, 3). If we scale by $\sqrt{\frac{2}{\text{fan-in}}}$, it would be 68% chance between (-$\sqrt{\frac{2}{\text{fan-in}}}$, $\sqrt{\frac{2}{\text{fan-in}}}$), 95% chance between (-2$\sqrt{\frac{2}{\text{fan-in}}}$, 2$\sqrt{\frac{2}{\text{fan-in}}}$) and 99.7% chance between (-3$\sqrt{\frac{2}{\text{fan-in}}}$, 3$\sqrt{\frac{2}{\text{fan-in}}}$). In general, we can integrate between 2 values a and b in the normal distribution function to find the probability of retrieving a value between a and b. Keep in mind that the probability of getting an exact value is 0 because there are infinitely possible many values (eg. there are infinite values 0.1, 0.11, 0.111... between 0.1 and 0.12).

Backpropagation: Time to start fixing our weights/biases

From what we saw above about Minimizing loss: Gradient Descent, we could begin to do the same thing for each and every weight and bias we have in our neural network. If we scan through the network, we would be able to see that we have 73 parameters in total to update and fine tune. 60 weights and 13 biases total. Remember what we were doing in Minimizing loss: Gradient Descent? Find the derivative of the Loss function with respect to the w variable to discover how much w affects the Loss function then we can change the value of w to minimize our Loss.

We are going to do the exact same thing through our network. This time it will be a little tricker though. No longer are we only differentiating with respect to one variable but we are doing it with respect to many. This is why we need partial differentials. We will calculate the partial derivative of the Loss function with respect to every single individual one of these 60 weights and 13 biases such that they may each find their own optimal direction to shift towards.

Time to implement the chain rule. We always start off from the end of the network and work our way to the beginning because the Loss function output is closest to the most recent weights and biases of layer 3. To find how much each weight and bias from layer 3 affects the loss function, we will have to retrace our steps and see what is the most recent thing that affects the loss function (that would be the output $ \hat{y} $_k)). What's the most recent thing that affects $ \hat{y} $_k? That would be Z_3k also known as the logits. What's the most recent thing that affects Z_3k? That would be W_3jk and b_3k. Great! This is what we are interested in updating.

Going from Math -> Code

Here we analyze how we might write the code for the backpropagation algorithm. In the example, we our trying to solve $ \frac{\partial L}{\partial w_{1jk}} $ and $ \frac{\partial L}{\partial b_{k}} $. Notice another introduction of two variables k and k'. The purpose of this is because each a_i affects every z_{(i + 1)}. For example a₁₁ will be found in z₂₁, z₂₂ ... z₂₅. Therefore we have to introduce another variable k' where k' is not necessarily equal to k. Since the calculation for $ \frac{\partial L}{\partial w_{1jk}} $ and $ \frac{\partial L}{\partial b_{k}} $ are almost the same, I will just focus on $ \frac{\partial L}{\partial w_{1jk}} $.

$$\frac{\partial L}{\partial w_{1jk}} = \frac{\partial L}{\partial z_{2k'}} \cdot \frac{\partial z_{2k'}}{\partial a_{1k}} \cdot \frac{\partial a_{1k}}{\partial z_{1k}} \cdot \frac{\partial z_{1k}}{\partial w_{1jk}}$$ The term $ \frac{\partial L}{\partial z_{2k'}} $ was already calculated in layer 2. The term $ \frac{\partial z_{2k'}}{\partial a_{1k}} $ is simply the weights associated with layer 2. This is the value returned from the previous function call so we can pass it in as parameter $ \frac{\partial L}{\partial a_{1k}} $ with the variable named dL_da. All that is left to do is to calculate $ \frac{\partial a_{1k}}{\partial z_{1k}} $ which is the derivative of ReLU. (*Note that if we are coming from the output layer, we will omit the multiplication with ReLU'. I wrote it such that the initial incoming dL_da will automatically be the calculated $ \frac{\partial L}{\partial z_{3k}} $). And also calculate $ \frac{\partial z_{1k}}{\partial w_{1jk}} $ which is a_0jk or in this case, the initial inputs (x₁,x₂,x₃,x₄). For this example, we may not have any more weights or biases to update but in the general case where we have to keep going backwards, we can simply return prev_dL_da = $ \frac{\partial L}{\partial a_{(i-1)k}} $ so that the next time we call the backward function, we will have our dL_da ready to go.

The general equation

In general we can come up with the following formulas:

$$\frac{\partial L}{\partial w_{ijk}} = \frac{\partial L}{\partial a_{ik}} \cdot \frac{\partial a_{ik}}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial w_{ijk}}$$ $$\frac{\partial L}{\partial b_{ik}} = \frac{\partial L}{\partial a_{ik}} \cdot \frac{\partial a_{ik}}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial b_{ik}}$$

$ \frac{\partial L}{\partial w_{ijk}} $ : Derivative of Loss function with respect to the weight in the i^th layer, corresponding to the value coming from the j^th neuron in layer_i-1, and the value going in the k^th neuron in layer_i.

$ \frac{\partial L}{\partial b_{ik}} $ : Derivative of Loss function with respect to the bias in the i^th layer, corresponding to the value going in the k^th neuron in layer_i.

Evaluate Model

Once we have shifted the weights and biases for a certain number of defined epochs (how many times we pass each and every single data point in (X-train, y-train) through the model, we should end up with some pretty good looking weights and biases.

The data from X-test would be sent through the model and for every X-test, there will be an output result $ \hat{y} $. We will then compare $ \hat{y} $ with the true answers found in y-test. The total correct / total number of tests would be the accuracy.

In most categorical models, the output will come out something like [0.013, 0.022, 0.96]. To evaluate this, we simply select the largest value in this list since that is the category that the model most confidently thinks the answer is. We evaluate it with the true answer. If the true answer is [0, 0, 1], it will be counted as correct. If it is [0, 1, 0] or [1, 0, 0] that would be incorrect.

Next I am going to run through an example of what it looks like when we try to make a prediction with some pretty good weights and biases.

Walking through a Trained Model

Lets say our model is trained and ready to go!

Our weights and biases are pretty well optimized which are shown in the picture.

Lets do a mini evaluation with one data point (usually we evaluate on many more data points to see how well it really performs but this is an example). Below is a data pair from (X-test and y-test).

x1 - Sepal Length: 7.7 cm
x2 - Sepal Width: 2.6 cm
x3 - Petal Length: 6.9 cm
x4 - Petal Width: 2.3 cm

The true answer is Iris Virginica: [0, 0, 1], but shush we're not going to tell the model that. Let's see if it can figure it out on its own :)