"Constitutional AI: Harmlessness from AI Feedback" — introduced RLAIF (RL from AI Feedback) using a set of explicit principles (a "constitution") to guide model self-critique and revision. The model critiques its own harmful outputs, revises them, then trains a preference model on the AI-generated comparisons rather than human labels.

Constitutional AI demonstrated that alignment can be achieved with far fewer human preference labels while producing models that are both more helpful and more harmless. The approach became core to Anthropic's training pipeline and influenced the broader move toward scalable oversight and RLAIF across the industry. By Bai, Kadavath, Kundu et al.

Paper

arXiv: 2212.08073

alignmentsafetyfoundational

Related