week 04 / 12

Backprop from scratch

Gradients come from the chain rule on a computational graph — no magic.

works through arena 0.4 · 0.4 Backprop

The chain rule for programmers

A derivative is a local multiplier: if b changes a little, how much does c change? If a changes b, and b changes c, then a changes c by multiplying those local effects.

For y = log(x * x), the computation has two steps: square, then log. Backprop walks those steps backward, asking each operation how upstream change should be routed to its inputs.

import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.log(x * x)
y.backward()
print(x.grad)
# tensor(0.6667) because d log(x^2) / dx = 2 / x

A graph, not a spell

Every tensor operation creates a node. The node stores enough information to answer a backward question later. Autograd is the system that records this graph and traverses it in reverse topological order.

For coders, the shape is familiar: build dependency graph, sort dependencies, run callbacks in reverse. The math is local to each operation.

def log_back(grad_out, x):
    return grad_out / x

def multiply_back(grad_out, x, y):
    return grad_out * y, grad_out * x

Each backward function receives the gradient of the output and returns gradients for the inputs.

The two gotchas: reuse and broadcasting

If a tensor is used twice, gradients accumulate. The graph has two paths back to the same value, so the answer is a sum, not an overwrite.

Broadcasting has the opposite move in reverse. If the forward pass stretched a tensor from (3,) to (2, 3), the backward pass must sum over the stretched axis to get back to (3,).

import torch

x = torch.ones(2, 3, requires_grad=True)
b = torch.arange(3.0, requires_grad=True)
y = (x + b).sum()
y.backward()
print(b.grad)
# tensor([2., 2., 2.]) because b was used once per row

From autograd to modules

The notebook finishes by rebuilding Parameter, Module, and Linear, then training MNIST on the tiny engine you wrote. That closes the loop from week 2: the abstractions were never magic, just carefully packaged graph bookkeeping.

Pair-session guide

Core work is ARENA 0.4 sections 1 and 2: primitive backward functions and the backprop engine. Stretch work is rebuilding the neural-network library and training MNIST. Have one person draw the graph while the other writes code for the backward pass; switch often.

What you should see

Your backprop() should agree with PyTorch .backward() on the same small graphs. In the stretch section, MNIST should train using your own mini-autograd engine. The win condition is being able to say where each gradient came from, not memorising derivative notation.

this week's pair session

core

1–2 — backward functions and the backprop engine

stretch

Full nn rebuild + MNIST