Reinforcement Learning Towards Broadly and Persistently Beneficial Models

An OpenAI alignment study asking whether RL on beneficial behavior in realistic domains generalizes alignment beyond the training distribution. The authors build a dataset of realistic situations designed to measure and train beneficial traits — truthfulness, fairness, risk awareness, and corrigibility — across varied domains including health, science, and education, then train models with RL on it and evaluate against 50+ independent alignment benchmarks.

Versus a compute-matched baseline, beneficial-trait RL improves performance on over 80% of out-of-distribution benchmarks. The key finding is broad OOD alignment transfer: an intervention confined to a single domain (health) yields broad improvements on non-health evaluations, including reduced reward hacking, deception, and general misalignment. Models also show greater persistence — more resistance to adversarial prompting and harmful fine-tuning — suggesting that reinforcing beneficial behavior in realistic domains can produce models more robustly aligned with human flourishing.

Paper (arXiv)

Paper

arXiv HTML

Authors: Akshay V. Jagadeesh · Rahul K. Arora · Khaled Saab · Ali Malik · Mikhail Trofimov · Foivos Tsimpourlas · Johannes Heidecke · Karan Singhal

alignmentsafetyreinforcement-learningresearch