Anatomy of Post-Training

Interpretability-guided post-training from Anthropic: identifies latent concepts that separate preferred from dispreferred generations, unifies feature-level and data-level interventions as forms of reward shaping, and uses the framework to diagnose spurious training signals such as sycophancy. Connects the interpretability agenda directly to post-training practice.

No results found