Neural Networks and deep learning were always a black box to me. You put in some data and it gives you back a prediction... magic! Even when using TensorFlow or PyTorch it wasn't crystal clear what was going on so I decided to create one from scratch. No TensorFlow, no PyTorch just NumPy, Python and some Calculus with Linear Algebra. Neural Networks are a type of supervised learning algorithm where we have a lot of data usually labeled (xi, yi). If we train the model enough, lets say from (x1, y1) to (x10000, y10000). The model should learn and if we give it a value x10001 (which it has never seen before), it should be able to accurately tell you the output is y10001.
I chose a couple of data sets to train my neural network on. You may find the google colab pages below:
The Iris Dataset consists of 150 samples of data. Each sample of data has 4 features with a label corresponding to the flower those features belong to.
Features include:
We can summarize the training in about 5 steps here though I will go into more details later.
To simplify, i'm just going to look at the first and last neuron of layer 0 (I'll consider the input layer as layer 0) and the first and last neuron of layer 1 (The first hidden layer).
Weights Wi(j,k)
The activation function introduces non-linearity into our neural network. Without the use of activation functions, our neural network will just be a bunch of linear combinations - that's going to be useless. If a connection is not strong enough, it will just not "fire" the neuron. What ReLU does is it only passes the value zik on if the value exceeds 0. If it is below 0, the connection is simply not strong enough and the output aik will be 0.
In simple terms:
If zik > 0 then aik = zik
Otherwise:
aik = 0
We can also write the ReLU function as f(x) = max(0, x).
Softmax is another activation function that allows us to take the output as percentages. At the moment, the logits z31, z32, z33 on their own do not represent percentages. The softmax function addresses this by exponentiating each logit, summing these exponentials, and then dividing each exponentiated logit by this total sum. This process effectively converts the logits into a set of values that are interpretable as the model's confidence in each possible outcome, with all the values summing up to 100%. In other words, softmax allows us to evaluate the confidence level of the machine learning model in its predictions, presenting the output as percentages.
$$\text{softmax}(z_{ik}) = \frac{e^{z_{ik}}}{\sum_{c=1}^{N} e^{z_{ic}}}$$
So how do we evaluate how close the model is to the true output? We use a loss function called categorical cross-entropy loss. The formula calculates the loss by summing the negative logarithms of the predicted probabilities (\( \boldsymbol{\hat{y}} \)c) for each category, weighted by the actual labels (\( \boldsymbol{\hat{y}} \)c) across all M categories.
$$L = -\sum_{c=1}^{M} y_c \log(\hat{y}_c)$$
In the examples we can see what happens when our model is 96% sure it is Iris Virginica and it truly is Iris Virginica: the loss is only 0.041. When the model is 68% sure it is Iris Versicolor, but it is actually Iris Virginica: the loss increases to 1.386.
So how do we get our model to do better? We can aim to minimize the loss. To find the minimum of a function we can solve for the derivative. It's true that when the derivative of a function is at zero we are at a minimum or maximum but sometimes it is highly impractical or even impossible to solve for that. To approach this from another way, we apply gradient descent. Gradient descent finds the current instantaneous slope and updates our parameter based on that.
Let's say we want to see how much we need to shift our current weight w by in order to reduce the value outputed by the Loss function which we can put for example as Loss = (w + 5)2. If we differentiate the Loss function with respect to w we can solve for the instantaneous slope of the Loss function at any point along the x-axis. We update our w parameter and move closer to the optimal value to minimize the Loss through subtracting a learning rate α - alpha (0.3 here) * the instantaneous slope at that point.
Initial w = -7, L = 4
After 1st iteration: w = -5.8, L = 0.64
After 2nd iteration: w = -5.32, L = 0.1024
After 3rd iteration: w = -5.128, L = 0.016384
After 4th iteration: w = -5.0512, L = 0.00262144
After 5th iteration: w = -5.02048, L = 0.0004194304
Here we have an example of where gradient descent may not lead us to the global minimum of the loss function. The loss function is given here by L = w4 + w3 - 3w2 + 5.5. If we differentiate the loss function with respect to that particular weight, we get 4w3 + 3w2 - 6w. If our initial weight is initialized to w = 1, we will \inch towards the local minimum of the Loss function found at (w = 0.91, L = 4.46), while there is a more desired optimal global minimum found at (w = -1.66, L = 0.25). Since we have no idea where the global minimum might be, this problem will be difficult to solve. Where we initialize our weight largely determines which local/global minimum we will fall upon. One method to combat this is to train the neural network multiple instances with different randomized weight initializations and find the one that provides us the best minimum loss. The learning rate α - alpha is 0.1 here. A high learning rate α - alpha ensures us we travel across the axis faster but the disadvantage is that we may overshoot the targeted minimum. On the other hand, a low learning rate ensures we will not overshoot, but the disadvantage is that we may take too long to converge to the minimum.
Initial w = 1, L = 4.5
After 1st iteration: w = 0.9, L = 4.4551
After 2nd iteration: w = 0.9054, L = 4.454942
After 3rd iteration: w = 0.905835, L = 4.454941
After 4th iteration: w = 0.905866, L = 4.454941
After 5th iteration: w = 0.905869, L = 4.454941
As shown previously, we have no idea where the optimal weights or biases may lie such that our neural network will provide the optimal performance with minimal loss. Therefore, the weights are initialized randomly using a method called He initialization - discovered by Kaiming He and the biases are initialized to zero.
The Kaiming He initialization works by picking values for each weight from a normal distribution with the standard deviation scaled by $\sqrt{\frac{2}{\text{fan-in}}}$, fan-in here means the dimensions of the input layer. So if we were to initialize weights for Layer 1, fan-in would be 4. If we were to initialize weights for Layer 2, fan-in would be 5 and so on. In a normal distribution with the standard deviation at 1 and the mean at 0, we would have about a 68% chance of getting a value between (-1, 1), 95% chance of getting a value between (-2, 2) and 99.7% chance of getting a value between (-3, 3). If we scale by $\sqrt{\frac{2}{\text{fan-in}}}$, it would be 68% chance between (-$\sqrt{\frac{2}{\text{fan-in}}}$, $\sqrt{\frac{2}{\text{fan-in}}}$), 95% chance between (-2$\sqrt{\frac{2}{\text{fan-in}}}$, 2$\sqrt{\frac{2}{\text{fan-in}}}$) and 99.7% chance between (-3$\sqrt{\frac{2}{\text{fan-in}}}$, 3$\sqrt{\frac{2}{\text{fan-in}}}$). In general, we can integrate between 2 values a and b in the normal distribution function to find the probability of retrieving a value between a and b. Keep in mind that the probability of getting an exact value is 0 because there are infinitely possible many values (eg. there are infinite values 0.1, 0.11, 0.111... between 0.1 and 0.12).
From what we saw above about Minimizing loss: Gradient Descent, we could begin to do the same thing for each and every weight and bias we have in our neural network. If we scan through the network, we would be able to see that we have 73 parameters in total to update and fine tune. 60 weights and 13 biases total. Remember what we were doing in Minimizing loss: Gradient Descent? Find the derivative of the Loss function with respect to the w variable to discover how much w affects the Loss function then we can change the value of w to minimize our Loss.
We are going to do the exact same thing through our network. This time it will be a little tricker though. No longer are we only differentiating with respect to one variable but we are doing it with respect to many. This is why we need partial differentials. We will calculate the partial derivative of the Loss function with respect to every single individual one of these 60 weights and 13 biases such that they may each find their own optimal direction to shift towards.
Time to implement the chain rule. We always start off from the end of the network and work our way to the beginning because the Loss function output is closest to the most recent weights and biases of layer 3. To find how much each weight and bias from layer 3 affects the loss function, we will have to retrace our steps and see what is the most recent thing that affects the loss function (that would be the output \( \hat{y} \)k)). What's the most recent thing that affects \( \hat{y} \)k? That would be Z3k also known as the logits. What's the most recent thing that affects Z3k? That would be W3jk and b3k. Great! This is what we are interested in updating.
Here we analyze how we might write the code for the backpropagation algorithm. In the example, we our trying to solve \( \frac{\partial L}{\partial w_{1jk}} \) and \( \frac{\partial L}{\partial b_{k}} \). Notice another introduction of two variables k and k'. The purpose of this is because each ai affects every z(i + 1). For example a11 will be found in z21, z22 ... z25. Therefore we have to introduce another variable k' where k' is not necessarily equal to k. Since the calculation for \( \frac{\partial L}{\partial w_{1jk}} \) and \( \frac{\partial L}{\partial b_{k}} \) are almost the same, I will just focus on \( \frac{\partial L}{\partial w_{1jk}} \).
$$\frac{\partial L}{\partial w_{1jk}} = \frac{\partial L}{\partial z_{2k'}} \cdot \frac{\partial z_{2k'}}{\partial a_{1k}} \cdot \frac{\partial a_{1k}}{\partial z_{1k}} \cdot \frac{\partial z_{1k}}{\partial w_{1jk}}$$
The term \( \frac{\partial L}{\partial z_{2k'}} \) was already calculated in layer 2. The term \( \frac{\partial z_{2k'}}{\partial a_{1k}} \) is simply the weights associated with layer 2. This is the value returned from the previous function call so we can pass it in as parameter \( \frac{\partial L}{\partial a_{1k}} \) with the variable named dL_da. All that is left to do is to calculate \( \frac{\partial a_{1k}}{\partial z_{1k}} \) which is the derivative of ReLU. (*Note that if we are coming from the output layer, we will omit the multiplication with ReLU'. I wrote it such that the initial incoming dL_da will automatically be the calculated \( \frac{\partial L}{\partial z_{3k}} \)). And also calculate \( \frac{\partial z_{1k}}{\partial w_{1jk}} \) which is a0jk or in this case, the initial inputs (x1,x2,x3,x4). For this example, we may not have any more weights or biases to update but in the general case where we have to keep going backwards, we can simply return prev_dL_da = \( \frac{\partial L}{\partial a_{(i-1)k}} \) so that the next time we call the backward function, we will have our dL_da ready to go.
In general we can come up with the following formulas:
$$\frac{\partial L}{\partial w_{ijk}} = \frac{\partial L}{\partial a_{ik}} \cdot \frac{\partial a_{ik}}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial w_{ijk}}$$
$$\frac{\partial L}{\partial b_{ik}} = \frac{\partial L}{\partial a_{ik}} \cdot \frac{\partial a_{ik}}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial b_{ik}}$$
Once we have shifted the weights and biases for a certain number of defined epochs (how many times we pass each and every single data point in (X-train, y-train) through the model, we should end up with some pretty good looking weights and biases.
The data from X-test would be sent through the model and for every X-test, there will be an output result \( \hat{y} \). We will then compare \( \hat{y} \) with the true answers found in y-test. The total correct / total number of tests would be the accuracy.
In most categorical models, the output will come out something like [0.013, 0.022, 0.96]. To evaluate this, we simply select the largest value in this list since that is the category that the model most confidently thinks the answer is. We evaluate it with the true answer. If the true answer is [0, 0, 1], it will be counted as correct. If it is [0, 1, 0] or [1, 0, 0] that would be incorrect.
Next I am going to run through an example of what it looks like when we try to make a prediction with some pretty good weights and biases.
Lets say our model is trained and ready to go!
Our weights and biases are pretty well optimized which are shown in the picture.
Lets do a mini evaluation with one data point (usually we evaluate on many more data points to see how well it really performs but this is an example). Below is a data pair from (X-test and y-test).