Anatomy of Post-Training
paperInterpretability-guided post-training from Anthropic: identifies latent concepts that separate preferred from dispreferred generations, unifies feature-level and data-level interventions as forms of reward shaping, and uses the framework to diagnose spurious training signals such as sycophancy. Connects the interpretability agenda directly to post-training practice.