[2505.15801] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

[2505.15801] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

arXiv - AI 4 min read Article

Summary

The paper introduces VerifyBench, a new benchmarking framework for evaluating reference-based reward systems in large language models, highlighting gaps in existing benchmarks and proposing improvements.

Why It Matters

As large language models increasingly rely on reinforcement learning, effective evaluation of their reasoning capabilities is crucial. VerifyBench addresses the limitations of current benchmarks by focusing on verification against ground truth references, which is essential for enhancing model training and performance.

Key Takeaways

  • VerifyBench and its variant, VerifyBench-Hard, are designed to assess reference-based reward systems.
  • Current benchmarks primarily focus on preference comparisons, neglecting verification against ground truth.
  • Larger model-based verifiers show potential but need improvement on challenging instances.
  • The paper provides insights into performance patterns across reasoning tasks.
  • Establishing standardized benchmarks can enhance verification accuracy in reasoning models.

Computer Science > Computation and Language arXiv:2505.15801 (cs) [Submitted on 21 May 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models Authors:Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang View a PDF of the paper titled VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models, by Yuchen Yan and 11 other authors View PDF HTML (experimental) Abstract:Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical component of their training is the incorporation of reference-based reward systems within reinforcement learning (RL), where model outputs are evaluated against ground truth references. However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reasoning model training. In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Our comprehensive evaluation reveal...

Related Articles

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime