Llms Machine Learning Ai Agents Ai Startups

[2510.23038] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

The paper presents TIR-Judge, a reinforcement learning framework that enhances Large Language Model (LLM) judges by integrating tool-based reasoning, improving evaluation accuracy across various benchmarks.

Why It Matters

As LLMs increasingly serve as evaluators, enhancing their reasoning capabilities is crucial for reliable assessments. This research introduces a novel approach that leverages tool integration, potentially transforming LLM evaluation processes and improving AI reliability.

Key Takeaways

TIR-Judge integrates a code executor to enhance LLM judges' evaluation capabilities.
The framework employs diverse training methods and flexible judgment formats.
TIR-Judge outperforms existing reasoning-based judges on multiple benchmarks.
The model demonstrates self-evolution through iterative reinforcement learning.
TIR-Judge-Zero achieves performance comparable to distilled models without requiring prior judge trajectories.

Computer Science > Computation and Language arXiv:2510.23038 (cs) [Submitted on 27 Oct 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning Authors:Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu View a PDF of the paper titled Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning, by Ran Xu and 6 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameter...

Read Original Article

[2510.23038] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Summary

Why It Matters

Key Takeaways

Related Articles

A Cross-Sectional Study Evaluating the Quality of AI-Generated Patient Education Guides on Diet and Exercise for Diabetes, Hypertension, and Obesity Using ChatGPT-4o, Google Gemini 1.5, Claude Sonnet 4, Perplexity, and Grok

Agents Can Now Propose and Deploy Their Own Code Changes

[2603.17839] How do LLMs Compute Verbal Confidence

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

No comments

Stay updated with AI News