[2510.23038] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Summary
The paper presents TIR-Judge, a reinforcement learning framework that enhances Large Language Model (LLM) judges by integrating tool-based reasoning, improving evaluation accuracy across various benchmarks.
Why It Matters
As LLMs increasingly serve as evaluators, enhancing their reasoning capabilities is crucial for reliable assessments. This research introduces a novel approach that leverages tool integration, potentially transforming LLM evaluation processes and improving AI reliability.
Key Takeaways
- TIR-Judge integrates a code executor to enhance LLM judges' evaluation capabilities.
- The framework employs diverse training methods and flexible judgment formats.
- TIR-Judge outperforms existing reasoning-based judges on multiple benchmarks.
- The model demonstrates self-evolution through iterative reinforcement learning.
- TIR-Judge-Zero achieves performance comparable to distilled models without requiring prior judge trajectories.
Computer Science > Computation and Language arXiv:2510.23038 (cs) [Submitted on 27 Oct 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning Authors:Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu View a PDF of the paper titled Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning, by Ran Xu and 6 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameter...