Llms Machine Learning Ai Startups Ai Safety Generative Ai

[2510.25860] Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

arXiv - AI February 23, 2026 4 min read Article

Summary

This article discusses a framework that enhances the reliability of large language model (LLM) raters by inferring thinking traces from label-only annotations, improving human-LLM agreement in evaluations.

Why It Matters

As LLMs become integral in evaluation tasks, ensuring their reliability is crucial. This research presents a method to infer reasoning behind judgments, which can improve LLM performance and provide clearer guidelines, ultimately enhancing the quality of AI evaluations.

Key Takeaways

The proposed framework infers thinking traces from label-only annotations.
Improved LLM-human agreement was observed across multiple datasets.
Refined annotation guidelines enhance consistency among different LLM models.
The method allows for the extension of label-only corpora into more reliable resources.
This research supports the practical use of LLMs as proxies for human reasoning.

Computer Science > Artificial Intelligence arXiv:2510.25860 (cs) [Submitted on 29 Oct 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters Authors:Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei View a PDF of the paper titled Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters, by Xingjian Zhang and 6 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical p...

Read Original Article

[2510.25860] Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Summary

Why It Matters

Key Takeaways

Related Articles

Building knowledge bases from YouTube data using LLMs -- my workflow after 52 guides

What is AI, how do apps like ChatGPT work and why are there concerns?

[2603.29957] Think Anywhere in Code Generation

[2603.16880] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

No comments

Stay updated with AI News