[2505.15801] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Summary
The paper introduces VerifyBench, a new benchmarking framework for evaluating reference-based reward systems in large language models, highlighting gaps in existing benchmarks and proposing improvements.
Why It Matters
As large language models increasingly rely on reinforcement learning, effective evaluation of their reasoning capabilities is crucial. VerifyBench addresses the limitations of current benchmarks by focusing on verification against ground truth references, which is essential for enhancing model training and performance.
Key Takeaways
- VerifyBench and its variant, VerifyBench-Hard, are designed to assess reference-based reward systems.
- Current benchmarks primarily focus on preference comparisons, neglecting verification against ground truth.
- Larger model-based verifiers show potential but need improvement on challenging instances.
- The paper provides insights into performance patterns across reasoning tasks.
- Establishing standardized benchmarks can enhance verification accuracy in reasoning models.
Computer Science > Computation and Language arXiv:2505.15801 (cs) [Submitted on 21 May 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models Authors:Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang View a PDF of the paper titled VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models, by Yuchen Yan and 11 other authors View PDF HTML (experimental) Abstract:Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical component of their training is the incorporation of reference-based reward systems within reinforcement learning (RL), where model outputs are evaluated against ground truth references. However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reasoning model training. In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Our comprehensive evaluation reveal...