[2509.21500] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Summary
This article presents a novel approach to reward modeling in large language models (LLMs) using rubric-based methods to mitigate reward over-optimization during reinforcement fine-tuning.
Why It Matters
As LLMs become increasingly integral in various applications, ensuring their output quality is crucial. This research addresses the common issue of reward over-optimization, which can lead to subpar model performance. By focusing on rubric-based rewards, the study offers a promising solution to enhance LLM training and output quality, thereby advancing the field of machine learning.
Key Takeaways
- Reward over-optimization can degrade the quality of LLM outputs.
- Rubric-based rewards can effectively distinguish between high-quality responses.
- The proposed method improves LLM performance by focusing on the high-reward tail.
- Off-policy examples can be leveraged without introducing artifacts.
- Empirical results show significant improvements in post-training outcomes.
Computer Science > Machine Learning arXiv:2509.21500 (cs) [Submitted on 25 Sep 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training Authors:Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin View a PDF of the paper titled Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training, by Junkai Zhang and 9 other authors View PDF HTML (experimental) Abstract:Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among grea...