Backpropagation

In a previous post, we defined the loss function as a measure of how wrong a model's predictions are, and minimized it with gradient descent. But we never addressed how to efficiently compute the gradient of the loss with respect to every parameter in a large neural network. That's what backpropagation does.

The "gradient" in gradient descent is a vector of partial derivatives of the loss with respect to each parameter. A partial derivative is just a ratio: how much changes when we change by a small amount. If , then changing by causes to change by approximately , and this approximation is more accurate for smaller .

What if we want but we only know intermediate partial derivatives and ? The chain rule says we multiply them: . If changing by causes to change by , and changing by causes to change by , then changing by causes to change by . This can be seen by letting .

What if the dependency isn't a single chain? If affects through two intermediate variables and , we sum the contributions: .

A neural network is a long chain of differentiable operations, where the loss depends on each parameter through many intermediate layers. Backpropagation is just the systematic application of the chain rule, working backwards from the loss through each operation to compute every partial derivative. The most fundamental operation in a neural network is matrix multiplication, so let's start there.

The Gradient of a Matrix Multiplication

Suppose where is , is , and is :

Each element of is a dot product of a row of with a column of :

From these equations we can read off partial derivatives directly. For example, , so and . Each depends only on row of and column of , so most partial derivatives are zero: since doesn't appear in .

Now suppose feeds into some downstream computation that produces a scalar loss . We're given for every element of , and want to find and .

Consider . It appears in , , , and , which is the entire first row of . By the chain rule, we sum the effect through each of these:

Substituting the partial derivatives we computed above:

Writing out as a matrix:

we see that is the dot product of row of with , which is row of . In other words, it's the entry of .

The same pattern holds for every entry: equals the dot product of row of with row of . So the gradient of is itself a matrix multiplication:

Similarly, appears in and , which is the entire first column of :

which is the entry of . In general:

The dimensions confirm this: should be like , and is . Likewise should be like , and is .

To compute these gradients, the only information we needed from upstream was : how much the loss changes per unit change in each element of . We didn't need to know anything about what happens after .

A Small Neural Network

Now let's apply this to a neural network with inputs, hidden neurons with ReLU activations (), and output:

The loss for a target is . Let's use concrete values:

The forward pass computes each layer in sequence:

Now we work backwards from the loss, applying the chain rule through each operation:

Since , the gradient flows to both and :

Through the ReLU, whose derivative is where and where :

Finally, is a matrix multiplication, so applying the formula from the previous section:

The entire second row is zero because neuron had : the ReLU zeroed it out in the forward pass, and now zeros out its gradient in the backward pass.

The parameters of this network are and , and we computed their gradients:

Each gradient tells us how the loss changes when we adjust that parameter. To verify them all at once, let's increase every parameter by . The predicted change in loss is times the sum of all gradient entries:

Rerunning the forward pass with and :

The actual change in is roughly , closely matching the predicted .

In practice, we use these gradients to improve the model through gradient descent: each parameter is updated by subtracting times its gradient, where is a learning rate. With :

The predicted change in loss sums each parameter's contribution:

Rerunning the forward pass with the updated parameters:

The actual change is roughly , closely matching the predicted .

That's all backpropagation is: a forward pass to compute the loss, then a backward pass applying the chain rule through each operation, tracking how much the loss changes with respect to each variable. Each operation only needs its local inputs and the gradient flowing back from downstream, so it's a computationally efficient way to determine how to nudge the parameters of a model to decrease the loss.