Training a Helpful and Harmless Assistant (HH-RLHF)

"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" — Anthropic's foundational alignment paper. Studied the tension between helpfulness and harmlessness in RLHF, showing that larger models are both more helpful AND easier to make harmless through alignment training.

Released the HH-RLHF dataset (~170K human preference comparisons), one of the most widely-used alignment datasets. Demonstrated that RLHF can make models less toxic without sacrificing capability when done at scale. By Bai, Jones, Ndousse et al.

Paper (arXiv)HuggingFace (Dataset)

Paper

arXiv HTML

alignmentsafetyfoundationaldata

Paper

Related