arena-in-12-weeks
glossary about

week 05 / 12

Build a transformer

Attention: tokens looking at other tokens.

works through arena 1.1 · 1.1 Transformer from Scratch

What an LLM does

Text first becomes tokens: integers from a fixed vocabulary. For every position, the model outputs logits for the next token. Sampling turns those logits into actual text.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
print(tok.tokenize("Mechanistic interpretability"))
# ['Me', 'chan', 'istic', ' interpret', 'ability'] or similar GPT-2 pieces

The exact split can vary by tokenizer version; the point is that models see chunks, not words.

The residual stream is the shared whiteboard

Each token position gets a vector of size d_model. Layer 0 writes token and position information onto that vector. Attention heads and MLPs then read the current vector and add their own updates. Because of the residual connection, later layers can use all earlier work instead of replacing it.

This mental model pays off in every interpretability week: when we ask “where is the model storing the answer?”, we usually mean “which direction in the residual stream carries it?”

Positional encoding

Attention is a bag-of-tokens operation: on its own it has no notion of order, so shuffling the input would shuffle the output the same way. To break that symmetry the model adds Positional encoding — a vector that depends only on position — to each token’s embedding before any layer runs. Layer 0 then writes both token identity and position onto the Residual stream .

Most modern transformers learn these position vectors as parameters rather than using a fixed sinusoidal formula; either way, the residual stream ends up carrying “what token” and “where” in the same vector.

Attention in three passes

Plain English: a token asks a question, earlier tokens advertise keys, matching scores decide where to read, and values are copied back. The glossary phrase is Attention with Query/key/value vectors.

Shape pass: starting from (batch, seq, d_model), a head projects to queries, keys, and values. Attention scores have shape (batch, heads, query_pos, key_pos). A causal mask prevents looking at future tokens.

import torch
from einops import einsum

q = torch.randn(1, 2, 4, 8)  # batch, heads, query positions, head dim
k = torch.randn(1, 2, 4, 8)
scores = einsum(q, k, "b h q d, b h k d -> b h q k")
print(scores.shape)
# torch.Size([1, 2, 4, 4])

The other transformer parts

The embedding is a lookup table from token id to vector. The Unembedding maps final residual vectors back to logits. MLP blocks are per-token neural networks, often where factual or feature-like information appears. LayerNorm keeps activation scales stable. Sampling chooses the next token greedily or with temperature/top-k/top-p randomness.

The forward pass, as a shape table

Keep this shape scratchpad beside you while implementing. For a batch of 2 prompts of length 6:

# no-run
# batch=2, seq=6, vocab=50257, d_model=768, n_heads=12, d_head=64
# tokens   # (2, 6)        integer token ids
# embed    # (2, 6, 768)   token + positional embeddings written into the residual stream
# attn_out # (2, 6, 768)   each position reads others via QK/OV, added back to the stream
# mlp_out  # (2, 6, 768)   per-token MLP block, added back to the stream
# logits   # (2, 6, 50257) final residual stream projected by the unembedding

Every stage keeps (2, 6, 768); only the last projection widens to the vocabulary. If a shape anywhere else changes, that is the bug.

Pair-session guide

Core work is ARENA 1.1 sections 1 and 2: implement the transformer modules and load GPT-2 weights into your clean implementation. Stretch work is training and sampling. This notebook is heavy; it is normal not to finish in one sitting. Keep a shared scratchpad of shapes for Q, K, V, attention scores, and output.

What you should see

Your hand-built model should match GPT-2’s outputs closely after loading real weights. In the sampling section, it should generate coherent English. The payoff is not that you wrote the fastest transformer; it is that GPT-2 is now a collection of modules you can name.

this week's pair session

core

  • 1–2 — inputs/outputs and clean implementation

stretch

  • Training it
  • Sampling