"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models." A process reward model purpose-built for financial-domain reasoning that combines step-level and trajectory-level reward signals into a unified ranking framework, applicable across offline SFT data selection, test-time best-of-N inference, and GRPO reinforcement learning reward shaping.

Built on Qwen2.5-7B-Instruct by Alibaba Cloud's Qwen DianJin Team with Soochow University and Osaka University. Trained on 3,000 step-by-step reasoning trajectories drawn from CFLUE (Chinese Financial Language Understanding Evaluation). Headline results: 58.2% on the CFLUE test set with Fin-PRM-curated SFT data; best-of-16 inference beats majority voting by +5.1% on CFLUE; GRPO with Fin-PRM as the reward yields 70.5% on CFLUE and 62.8% on FinQA. Accepted to IJCAI 2026.

Paper

Venue IJCAI 2026
reasoningresearch

Related