week 04 / 12
Backprop from scratch
Gradients come from the chain rule on a computational graph — no magic.
works through arena 0.4 · 0.4 Backprop
The chain rule for programmers
A derivative is a local multiplier: if b changes a little, how much does c change? If a changes b, and b changes c, then a changes c by multiplying those local effects.
For y = log(x * x), the computation has two steps: square, then log. Backprop walks those steps backward, asking each operation how upstream change should be routed to its inputs.
import torch
x = torch.tensor(3.0, requires_grad=True)
y = torch.log(x * x)
y.backward()
print(x.grad)
# tensor(0.6667) because d log(x^2) / dx = 2 / x
A graph, not a spell
Every tensor operation creates a node. The node stores enough information to answer a backward question later. Autograd is the system that records this graph and traverses it in reverse topological order.
For coders, the shape is familiar: build dependency graph, sort dependencies, run callbacks in reverse. The math is local to each operation.
def log_back(grad_out, x):
return grad_out / x
def multiply_back(grad_out, x, y):
return grad_out * y, grad_out * x
Each backward function receives the gradient of the output and returns gradients for the inputs.
The two gotchas: reuse and broadcasting
If a tensor is used twice, gradients accumulate. The graph has two paths back to the same value, so the answer is a sum, not an overwrite.
Broadcasting has the opposite move in reverse. If the forward pass stretched a tensor from (3,) to (2, 3), the backward pass must sum over the stretched axis to get back to (3,).
import torch
x = torch.ones(2, 3, requires_grad=True)
b = torch.arange(3.0, requires_grad=True)
y = (x + b).sum()
y.backward()
print(b.grad)
# tensor([2., 2., 2.]) because b was used once per row
From autograd to modules
The notebook finishes by rebuilding Parameter, Module, and Linear, then training MNIST on the tiny engine you wrote. That closes the loop from week 2: the abstractions were never magic, just carefully packaged graph bookkeeping.
Pair-session guide
Core work is ARENA 0.4 sections 1 and 2: primitive backward functions and the backprop engine. Stretch work is rebuilding the neural-network library and training MNIST. Have one person draw the graph while the other writes code for the backward pass; switch often.
What you should see
Your backprop() should agree with PyTorch .backward() on the same small graphs. In the stretch section, MNIST should train using your own mini-autograd engine. The win condition is being able to say where each gradient came from, not memorising derivative notation.
this week's pair session
core
- 1–2 — backward functions and the backprop engine
stretch
- Full nn rebuild + MNIST