[2510.23038] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

[2510.23038] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents TIR-Judge, a reinforcement learning framework that enhances Large Language Model (LLM) judges by integrating tool-based reasoning, improving evaluation accuracy across various benchmarks.

Why It Matters

As LLMs increasingly serve as evaluators, enhancing their reasoning capabilities is crucial for reliable assessments. This research introduces a novel approach that leverages tool integration, potentially transforming LLM evaluation processes and improving AI reliability.

Key Takeaways

  • TIR-Judge integrates a code executor to enhance LLM judges' evaluation capabilities.
  • The framework employs diverse training methods and flexible judgment formats.
  • TIR-Judge outperforms existing reasoning-based judges on multiple benchmarks.
  • The model demonstrates self-evolution through iterative reinforcement learning.
  • TIR-Judge-Zero achieves performance comparable to distilled models without requiring prior judge trajectories.

Computer Science > Computation and Language arXiv:2510.23038 (cs) [Submitted on 27 Oct 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning Authors:Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu View a PDF of the paper titled Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning, by Ran Xu and 6 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameter...

Related Articles

A Cross-Sectional Study Evaluating the Quality of AI-Generated Patient Education Guides on Diet and Exercise for Diabetes, Hypertension, and Obesity Using ChatGPT-4o, Google Gemini 1.5, Claude Sonnet 4, Perplexity, and Grok
Llms

A Cross-Sectional Study Evaluating the Quality of AI-Generated Patient Education Guides on Diet and Exercise for Diabetes, Hypertension, and Obesity Using ChatGPT-4o, Google Gemini 1.5, Claude Sonnet 4, Perplexity, and Grok

This study evaluates the quality of AI-generated patient education guides on diet and exercise for chronic conditions, comparing five lan...

AI Tools & Products · 2 min ·
Llms

Agents Can Now Propose and Deploy Their Own Code Changes

150 clones yesterday. 43 stars in 3 days. Every agent framework you've used (LangChain, LangGraph, Claude Code) assumes agents are tools ...

Reddit - Artificial Intelligence · 1 min ·
[2603.17839] How do LLMs Compute Verbal Confidence
Llms

[2603.17839] How do LLMs Compute Verbal Confidence

Abstract page for arXiv paper 2603.17839: How do LLMs Compute Verbal Confidence

arXiv - AI · 4 min ·
[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Llms

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Abstract page for arXiv paper 2603.15970: 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight...

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime