[2603.01571] Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
About this article
Abstract page for arXiv paper 2603.01571: Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Computer Science > Artificial Intelligence arXiv:2603.01571 (cs) [Submitted on 2 Mar 2026] Title:Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Authors:Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma View a PDF of the paper titled Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models, by Qiyuan Zhang and 7 other authors View PDF HTML (experimental) Abstract:Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas ...