Compact reasoning model post-trained from DeepSeek-R1-Distill-Qwen-1.5B using PRIME (Process Reinforcement through IMplicit rEwards) with token-level RLOO on ~400k math + ~25k code samples (NuminaMath-CoT, APPS, CodeContests, TACO, Codeforces). Generation length ramped 12k → 24k tokens over training.

Self-reported numbers: 88.34% MATH-500, 37.91% GPQA-Diamond, ~40% LeetCode — >50% improvement over the R1-Distill base at the same parameter count. An applied study in token-efficient reasoning at small scale. Not currently scored on Artificial Analysis.

Model Details

Architecture DENSE
Parameters 1.5B
Base model deepseek-r1

Benchmark Scores

Benchmark Score Mode
MATH-500 88.34%
GPQA Diamond 37.91%
LeetCode 40%
reasoningopen-weight

Related