An analysis of why some model families benefit dramatically from reinforcement learning while others barely move. The authors identify "distributional clarity" as the hidden structural property governing RL-friendliness: RL-responsive models exhibit intra-class compactness and inter-class separation in the probabilities they assign to correct vs. incorrect responses. They quantify this geometry with the Silhouette Coefficient and introduce a Silhouette-Aware Reweighting training strategy that improves on it directly.

The reweighting yields gains of up to 5.9 points on AIME24, turning otherwise RL-resistant models into ones that respond to RL. A collaboration between Baidu (Mingzhu Cai, Huang He, Bingjin Chen, Siqi Bao, Hua Wu, and CTO Haifeng Wang) and Tsinghua University's Shenzhen International Graduate School (lead author Shaoning Sun, with Yujiu Yang).

Paper

Authors: Shaoning Sun · Mingzhu Cai · Huang He · Bingjin Chen · Siqi Bao · Yujiu Yang · Hua Wu · Haifeng Wang
reinforcement-learningpost-trainingreasoningresearch