Policy Discriminative Learning (POLAR) — a pre-training approach that frames reward modeling as distinguishing between different policies. Achieves significant improvements in preference accuracy across tasks with predictable compute-performance scaling laws.

Paper

Library

reinforcement-learningreasoningresearch