arena-in-12-weeks
glossary about

week 08 / 12

SAEs on real models

Dictionary learning recovers human-readable features from real LLMs.

works through arena 1.3.3 · 1.3.3 Interpretability with SAEs

From toy features to real activations

In week 7 we knew the true features because we generated the data. In a real model, we only have activations and behavior. An SAE gives a proposed dictionary of latent features, then humans test whether those latents are coherent.

The dictionaries are huge: thousands to millions of latents. Some are crisp, some are dead, some split one concept across several latents, and some are just not interpretable. That uncertainty is part of the field, not a failure of the exercise.

Reading a feature dashboard

A Feature dashboard is a field microscope for one latent. Look for:

  • top-activating text examples: what reliably turns the latent on;
  • activation statistics: whether it is rare, common, or dead;
  • logit effects: which output tokens are promoted or suppressed;
  • manual tests: prompts you invent to check the story.

Before the session, open Neuronpedia and search for a model you recognize. Try to find one feature that looks semantic, one that looks syntactic, and one you cannot explain.

Running an SAE as an instrument

The code pattern is model plus SAE plus cache. The exact model names may change, but the mental model is stable: run text, collect SAE latent activations, inspect the strongest latents.

from sae_lens import SAE, HookedSAETransformer

model = HookedSAETransformer.from_pretrained("gpt2-small")
sae, _, _ = SAE.from_pretrained("gpt2-small-res-jb", "blocks.7.hook_resid_pre")
_, cache = model.run_with_cache_with_saes("The Golden Gate Bridge", saes=[sae])
acts = cache["blocks.7.hook_resid_pre.hook_sae_acts_post"]
print(acts[0, -1].topk(5).indices)

The top indices are not answers by themselves. They are leads you investigate with dashboards and prompt tests.

Honest limitations

SAEs can have reconstruction error: the decoded activation is not exactly the original. Latents can be dead, split, or too broad. A feature can look meaningful because the examples are cherry-picked. The cure is the same evidence habit as week 9 will formalize: inspect, intervene, and track metrics.

Training your own SAE is stretch. It adds data pipelines, L0-style sparsity metrics, and loss recovered. Useful, but not required to get the core idea.

Pair-session guide

Core work is ARENA 1.3.3 section 1: run pretrained SAEs, inspect features, and build dashboards or dashboard-like reports. Stretch work is deeper latent analysis and training/evaluation. If free Colab runs out of memory on larger models, switch to GPT-2-small variants rather than debugging CUDA for an hour.

What you should see

You should produce or inspect at least one feature dashboard and be able to state a hypothesis like “this latent fires on bridge names” or “this one tracks Python exception code.” The best outcome is one genuine surprise where a latent looks far more human-readable than you expected.

this week's pair session

core

  • 1 — intro to SAE interp

stretch

  • Latent deep dive
  • Train and evaluate an SAE