week 07 / 12

Superposition & toy models

Features ≠ neurons — models pack more concepts than they have dimensions.

works through arena 1.5.4 · 1.5.4 Toy Models of Superposition & SAEs

What is a feature?

A feature is a human-meaningful property represented in an activation: “this token is inside a Python comment”, “the sentence is in French”, “the next word should be a name”. Features are often sparse: for any one input, most possible features are off. That Sparsity is the loophole that makes superposition possible.

If five features were always active independently, a two-dimensional hidden space could not store them cleanly. But if usually only one or two are active, the model can let their directions overlap a little and use ReLU to clip small accidental leakage.

The pentagon trick

The toy model trains on synthetic sparse features, so we know the ground truth. With two hidden dimensions and five sparse features, the learned directions often spread into a pentagon. Each feature gets part of a dimension rather than owning one neuron.

import torch
import torch.nn.functional as F

# columns are two feature directions in a tiny hidden space
W = F.normalize(torch.tensor([[1.0, 0.0], [0.3, 0.95], [-0.8, 0.6]]), dim=1)
leakage = W @ W.T
print(leakage.round(decimals=2))
# off-diagonal values show features interfering with each other

The off-diagonal entries are interference. Superposition is the model deciding that small interference is cheaper than dedicating full dimensions to rare features.

Why neurons become polysemantic

A Polysemantic neuron responds to multiple unrelated features because the basis neurons are not the true conceptual basis. Week 6 found some clean attention heads, but that is the lucky case. For many model parts, “look at the biggest neuron” is the wrong microscope.

The notebook explores importance and sparsity sweeps, privileged versus non-privileged bases, feature geometry, and double descent. The central question is always the same: when is it worth packing features together?

First contact with SAEs

A Sparse autoencoder (SAE) tries to unmix superposed features. It learns a wider latent representation but penalizes using too many latents at once.

import torch
import torch.nn.functional as F

x = torch.randn(16, 2)
W_enc = torch.randn(2, 8)
W_dec = torch.randn(8, 2)
acts = F.relu(x @ W_enc)
recon = acts @ W_dec
loss = ((recon - x) ** 2).mean() + 0.01 * acts.abs().mean()
print(loss.item())

The reconstruction term says “preserve the original activation.” The sparsity term says “explain it with few active latents.” Week 8 scales this idea to real LLMs.

Pair-session guide

Core work is ARENA 1.5.4 sections 1 and 2 plus the toy SAE section: train toy models, inspect feature geometry, and see an SAE recover known features. Stretch work is the deeper geometry and double-descent sections. Keep asking which dimension is batch, which is feature, and which is parallel model instance.

What you should see

You should see feature directions spread into a pentagon-like plot as sparsity increases. In the SAE section, decoder directions should snap toward the true feature directions. The conceptual win is replacing “one neuron equals one concept” with a geometry story.

this week's pair session

core

1–2 — toy models and privileged basis
5 — toy SAEs

stretch

Feature geometry
Double descent