arena-in-12-weeks
glossary about

week 11 / 12

Steering & function vectors

Once you've found a direction, you can push the model along it.

works through arena 1.3.2 · 1.3.2 Function Vectors & Model Steering

Reading to writing

Weeks 6 through 10 mostly asked what is inside the model. This week asks what happens when you edit it. Steering is a causal test: if adding a direction changes behavior in the predicted way, the direction is doing real work.

A minimal mental model is activation addition:

import torch

resid = torch.randn(1, 12, 768)
steering_vec = torch.randn(768)
resid[:, -1, :] = resid[:, -1, :] + 2.0 * steering_vec
print(resid.shape)

The hard part is not the addition. The hard part is finding a direction worth adding and choosing where to add it.

Function vectors

A function vector is a task direction extracted from examples. For antonyms, you show pairs like hot -> cold, collect activations from a set of attention heads, average them, and inject that vector into a new prompt. The model can then perform the task without the examples being present.

This is in-context learning made portable: a task inferred from context becomes a vector you can move between runs.

nnsight as an intervention tool

nnsight is another library for tracing and editing model internals, especially when the model is too large to run locally. Code inside a trace describes what to save or edit during the forward pass; it feels a bit like building a query plan.

with model.trace(prompt):
    hidden = model.transformer.h[9].output[0].save()
    model.transformer.h[9].output[0][:] += steering_vec
    out = model.lm_head.output.argmax(-1).save()

TransformerLens and nnsight share the same conceptual verbs: read activations, write activations, measure outputs.

Connection to SAEs

SAE-latent steering is a cleaner modern variant. Instead of adding a dense direction found by contrastive prompts, you turn up a sparse latent such as a “Golden Gate Bridge” feature and observe output changes. The evidence standard is the same: specific intervention, specific behavioral prediction.

Pair-session guide

Core work is ARENA 1.3.2 sections 1 through 3: nnsight basics, task-encoding states, and function vectors. Stretch work is GPT2-XL steering, which may need more GPU than free Colab. Pair rule: always write down the target layer, token position, and scale before running a steering experiment.

What you should see

You should see a model complete antonym-style tasks because you injected a function vector, not because the prompt contained examples. In the steering stretch, you should see generation shift in sentiment or topic when you add the steering vector at the right layer and scale.

this week's pair session

core

  • 1–3 — nnsight, task-encoding states, function vectors

stretch

  • GPT2-XL steering