week 03 / 12
How training actually works
Training = rolling downhill on a loss landscape.
works through arena 0.3 · 0.3 Optimization
Loss is the terrain
The loss is one number measuring how bad the model is on a batch. Imagine that number as height and every model parameter as a coordinate. Real models have millions of coordinates, so the picture is impossible to draw, but the local idea still works: if the loss slopes upward in one direction, move the weights in the opposite direction.
SGD is the plain version:
weight = weight - lr * grad
The lr is the Learning rate . Too high and the optimizer jumps over the valley or diverges; too low and training crawls.
Momentum and Adam are engineering fixes
Plain SGD can zig-zag in narrow valleys. Momentum remembers recent direction, like a ball that has velocity:
velocity = 0.9 * velocity + grad
weight = weight - lr * velocity
Adam goes further by keeping running estimates of gradient size for each parameter, then adapting the step size per parameter. You do not need to worship the acronym; in the notebook you implement it as bookkeeping over tensors.
The optimizer loop
Here is the whole shape of training, with PyTorch doing gradient calculation:
import torch
model = torch.nn.Linear(2, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
x = torch.randn(8, 2)
y = torch.randn(8, 1)
pred = model(x)
loss = ((pred - y) ** 2).mean()
opt.zero_grad()
loss.backward()
opt.step()
print(loss.item())
Week 2 introduced this loop as ritual. This week names each moving part and asks you to implement optimizers yourself.
Hyperparameters and experiment tracking
A hyperparameter is a setting you choose rather than learn: learning rate, batch size, momentum, weight decay, schedule. The optimizer can only optimize weights; humans still choose the training recipe.
Weights and Biases is a dashboard for comparing runs. A sweep is a structured set of runs over different hyperparameters. The value is not fancy UI; it is being able to answer “which change actually helped?” without relying on memory.
Pair-session guide
Core work is ARENA 0.3 section 1: implement SGD, momentum, RMSprop, and Adam, then compare optimizer trajectories. Stretch work is W&B sweeps on the image classifier and the distributed-training reading. Pair rule: when an optimizer test fails, compare update equations line by line before changing code.
What you should see
You should see optimizer trajectory plots where SGD zig-zags, momentum smooths the path, and Adam often reaches the basin quickly. If you do the W&B stretch, you should see multiple runs on your dashboard and be able to point to the learning rate or schedule that changed the curve.
this week's pair session
core
- 1 — implement SGD, momentum, and Adam
stretch
- W&B sweeps
- Distributed training overview