Neural Network from Scratch - Updated

August 31, 2025 in Machine Learning8 minutes

In my last blog, I set up a basic 2-2-1 neural network by hand to solve the XOR table. I used mean squared error (MSE) for the network’s loss function:

$$ \mathcal{L} = \tfrac{1}{2}(\hat{y} - y)^2 $$

As a reminder, the loss function is how we calculate how far from the correct answer a network was during training.

However, as I discovered later, MSE is actually not well suited to this particular problem. The main reason is that we can get into a vanishing gradients problem very quickly, because MSE is more well-suited to regressions, instead of simple classifications like this. It doesn’t provide a strong enough gradient in response to “confidently wrong” answers. So the network’s ability to learn slows or even stalls.

With binary cross-entropy loss (BCE) on the other hand, if the network is very wrong, the gradient will be very large in response. This allows the network to be trained much more reliably and quickly.

I wanted to revisit the last post’s implementation, at a minimum updating the network to use the updated loss function, which has implications on how we determine the error term for each layer.

Updated Error Term

Let’s start with the loss function. Binary cross-entropy loss is defined as:

$$ \mathcal{L}(y, \hat{y}) = - \big[ y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}) \big] $$

This is important to call out as we’re about to factor this into our updated error term implementation, but also as we learned in the last post, we’ll want this to be calculated explicitly so we can track convergence during training.

As a reminder, the error term is the loss wrt the pre-activation. You can pretty easily find the answer via other online resources:

$$ \frac{\partial \mathcal{L}}{\partial z} = \hat y - y $$

However, I still want to see how the sausage is made, especially since I don’t live and breathe calculus every day. This did end up being a little more complicated than the last post, so I still wanted to go step-by-step even if only for future me’s benefit. Feel free to expand the section below if you’re curious.

Details for obtaining error term

By now we know we’ll need the chain rule to expand this out a little bit.

$$ \frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} $$

Let’s look at the left side first. As a reminder, here’s our loss function (BCE):

$$ \mathcal{L}(y,\hat y)=-\Big[y\ln(\hat y)+(1-y)\ln(1-\hat y)\Big] $$

We can use the linearity of derivatives to break this down into terms:

$$ \frac{\partial}{\partial \hat y}\big[-(A+B)\big] = -\frac{\partial}{\partial \hat y}(A+B) = -\left(\frac{\partial A}{\partial \hat y} + \frac{\partial B}{\partial \hat y}\right), $$

where left term $A=-y\ln\hat y$ and right term $B = (1-y)\ln(1-\hat y)$. We can treat each term individually, and then add them together at the end.

For the first term:

$$ \begin{aligned} \frac{\partial}{\partial \hat y} \Big(-y \ln \hat y\Big) &= -y \cdot \frac{\partial}{\partial \hat y} \ln \hat y \\ &= -y \cdot \frac{1}{\hat y} \\ &= \frac{-y \cdot 1}{\hat y} \\ &= -\frac{y}{\hat y} \end{aligned} $$

At a high-level this is:

  1. Pull out the constant $-y$
  2. The standard derivative of the natural logarithm w.r.t is $\frac{1}{\hat y}$.
  3. Simplify

Second term is a little more complicated, but uses some of the same steps. The main complication is that we’re passing $1-\hat y$ into the natural log, not just $\hat y$ so we need an extra chain rule step:

$$ \begin{aligned} \frac{\partial}{\partial \hat y} \Big(-(1-y)\ln(1-\hat y)\Big) &= -(1-y) \cdot \frac{\partial}{\partial \hat y} \ln(1-\hat y) \\ &= -(1-y) \cdot \left( \frac{1}{1-\hat y} \cdot \frac{\partial}{\partial \hat y}(1-\hat y) \right) \\ &= -(1-y) \cdot \left( \frac{1}{1-\hat y} \cdot (-1) \right) \\ &= -(1-y) \cdot \left( -\frac{1}{1-\hat y} \right) \\ &= (1-y) \cdot \left( \frac{1}{1-\hat y} \right) \\ &= \frac{1-y}{1-\hat y} \end{aligned} $$

Step-by-step:

  1. Pull out $-(1-y)$ as the constant factor.
  2. Apply the chain rule to $\ln(1-\hat y)$.
  3. Differentiated $1-\hat y$​ to get $-1$ (remember, the derivative of a constant with respect to a variable is 0, and the derivative of the variable with respect to itself is 1).
  4. Simplified the double negative and multiplied the constant factor to get the final fraction.

Now we can re-join our two separated terms to get the loss w.r.t. $\hat y(1-\hat y)$, and simplify:

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \hat y} &= \frac{y}{\hat y} + \frac{1-y}{1-\hat y} \\ &= \frac{-y(1-\hat y)}{\hat y(1-\hat y)} + \frac{\hat y(1-y)}{\hat y(1-\hat y)} \\ &= \frac{-y(1-\hat y) + \hat y(1-y)}{\hat y(1-\hat y)} \\ &= \frac{-y + y\hat y + \hat y - y\hat y}{\hat y(1-\hat y)} \\ &= \frac{\hat y - y}{\hat y(1-\hat y)} \\ \end{aligned} $$

Steps:

  1. LCD
  2. Combine
  3. Expand the numerator
  4. Removed canceled-out terms

So our loss derivative w.r.t. $\hat y$ is:

$$ \displaystyle \frac{\partial \mathcal{L}}{\partial \hat y} = \frac{\hat y - y}{\hat y(1-\hat y)} $$

Finally, we can go all the way back to our initial chain rule and substitute in the two coefficients:

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial z} &= \frac{\partial \mathcal{L}}{\partial \hat y}\cdot\frac{\partial \hat y}{\partial z} \\ &= \frac{\hat y - y}{\hat y(1-\hat y)}\cdot \hat y(1-\hat y) \\ &= \boxed{\hat y - y} \end{aligned} $$
  1. Substitute the left side with the loss w.r.t. $\hat y(1-\hat y)$ we found earlier, and the right side with the well known sigmoid derivative.
  2. Dividing by and then multiplying by $\hat y(1-\hat y)$ cancels each other out so we’re just left with $\hat y - y$

I won’t always break things down in this much detail, but given I’m still getting my feet wet w.r.t. deep learning and anything to do with calculus was a long time ago, it’s a good exercise for me right now.

Changes to Implementation

Below is a notebook containing the updated implementation. Try it out! But before we get there, here is a summary of the changes from the previous version. I am pretty sure that all of these changes contributed to the improvements

In my research I encountered a lot of ideas for improvement but settled on these. I believe all of them meaningfully contributed to an improvement in convergence reliability, but tbh they still depend on each other in complex ways, so it’s possible these can be whittled down to achieve the same or even greater benefit. Nevertheless, here they are:

  1. Changing to BCE instead of MSE. In addition to being the right tool for the job, doing this allowed me to make other subsequent changes to further improve things.
  2. Smaller initial weights/biases prevent the sigmoid from saturating at 0 or 1. I think I had convinced myself I need to increase this in the previous post in order to “escape” activations away from the flat part of the sigmoid. But this was unreliable because it ends up depending on random initialization to align just right. This is probably one of the main reason I had to retry (often many times). Retries are much lower now. And learning rate (see next bullet) seems to be a better way to get away from bad initial weights.
  3. Increase learning rate from 0.1 to 0.3 to escape poor initial weights. I was hesitant to to this because I thought I was seeing oscillation, but now that I know more about MSE’s weakness in this context, I’m not sure about that. And I can definitely see better convergence reliability at this increased value.
  4. Save the gradient descent for the end of the epoch, applying the average of these gradients from all datasets. Why? This can help smooth things out, preventing over-reacting (in part due to the raised learning rate) and actually helping to converge faster.
  5. Updated the convergence threshold (which allows a training run to exit early to save some time) to be a little more realistic

Here’s the updated notebook:

View this interactive marimo notebook below (best on desktop), or in a separate tab.

Load Interactive Notebook

I am definitely observing an improvement in convergence reliabilty. Where the previous approach would often require 20 or more retries (though sometimes very few), the new approach VERY rarely requires more than 1 or 2. It’s just an estimate running both by hand many times, but that’s what I’m observing.

Can retries be avoided?

We’re able to reduce the number of retries by increasing convergence reliability, but can we eliminate the need for them entirely? Guarantee convergence each time? In general yes, but from here, we’d have to either:

  • Put much more constraints over how the weights are initialized, and at this scale I may as well just hand pick the weights. However, that’s not the point of the exercise, I’m trying to generalize a little bit.
  • Add more hidden neurons to help mitigate the high likelihood of randomly picking symmetric weights

Follow-Ups

While this does result in an improvement, I did not choose a different activation function, or add more neurons. Such a change may make this network much more reliable. However, I’m finding that the time spent on this is having diminishing returns, as while a 2-2-1 network is great for learning the general concepts, optimizing it is not that worthwhile, at least for me. I’ll be moving on to larger networks and more real-world problems in future blog posts.