week 03 / 12

How training actually works

Training = rolling downhill on a loss landscape.

works through arena 0.3 · 0.3 Optimization

Loss is the terrain

The loss is one number measuring how bad the model is on a batch. Imagine that number as height and every model parameter as a coordinate. Real models have millions of coordinates, so the picture is impossible to draw, but the local idea still works: if the loss slopes upward in one direction, move the weights in the opposite direction.

SGD is the plain version:

weight = weight - lr * grad

The lr is the Learning rate . Too high and the optimizer jumps over the valley or diverges; too low and training crawls.

Momentum and Adam are engineering fixes

Plain SGD can zig-zag in narrow valleys. Momentum remembers recent direction, like a ball that has velocity:

velocity = 0.9 * velocity + grad
weight = weight - lr * velocity

Adam goes further by keeping running estimates of gradient size for each parameter, then adapting the step size per parameter. You do not need to worship the acronym; in the notebook you implement it as bookkeeping over tensors.

The optimizer loop

Here is the whole shape of training, with PyTorch doing gradient calculation:

import torch

model = torch.nn.Linear(2, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
x = torch.randn(8, 2)
y = torch.randn(8, 1)

pred = model(x)
loss = ((pred - y) ** 2).mean()
opt.zero_grad()
loss.backward()
opt.step()
print(loss.item())

Week 2 introduced this loop as ritual. This week names each moving part and asks you to implement optimizers yourself.

Hyperparameters and experiment tracking

A hyperparameter is a setting you choose rather than learn: learning rate, batch size, momentum, weight decay, schedule. The optimizer can only optimize weights; humans still choose the training recipe.

Weights and Biases is a dashboard for comparing runs. A sweep is a structured set of runs over different hyperparameters. The value is not fancy UI; it is being able to answer “which change actually helped?” without relying on memory.

Pair-session guide

Core work is ARENA 0.3 section 1: implement SGD, momentum, RMSprop, and Adam, then compare optimizer trajectories. Stretch work is W&B sweeps on the image classifier and the distributed-training reading. Pair rule: when an optimizer test fails, compare update equations line by line before changing code.

What you should see

You should see optimizer trajectory plots where SGD zig-zags, momentum smooths the path, and Adam often reaches the basin quickly. If you do the W&B stretch, you should see multiple runs on your dashboard and be able to point to the learning rate or schedule that changed the curve.

this week's pair session

core

1 — implement SGD, momentum, and Adam

stretch

W&B sweeps
Distributed training overview