Scaling Monosemanticity

"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" — applied sparse autoencoders at production scale to extract millions of interpretable features from a frontier language model. Found features corresponding to specific concepts (Golden Gate Bridge, code errors, deception) that are multilingual, multimodal, and abstract.

Demonstrated that mechanistic interpretability can scale to production models, not just toy systems. Features could be clamped to steer model behavior (the "Golden Gate Claude" experiment). A major milestone in understanding what large language models learn internally. By Templeton, Conerly, Marcus et al. on the Anthropic interpretability team led by Chris Olah.

Paper

interpretabilitysafetyfoundational