[2510.25860] Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

[2510.25860] Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

arXiv - AI 4 min read Article

Summary

This article discusses a framework that enhances the reliability of large language model (LLM) raters by inferring thinking traces from label-only annotations, improving human-LLM agreement in evaluations.

Why It Matters

As LLMs become integral in evaluation tasks, ensuring their reliability is crucial. This research presents a method to infer reasoning behind judgments, which can improve LLM performance and provide clearer guidelines, ultimately enhancing the quality of AI evaluations.

Key Takeaways

  • The proposed framework infers thinking traces from label-only annotations.
  • Improved LLM-human agreement was observed across multiple datasets.
  • Refined annotation guidelines enhance consistency among different LLM models.
  • The method allows for the extension of label-only corpora into more reliable resources.
  • This research supports the practical use of LLMs as proxies for human reasoning.

Computer Science > Artificial Intelligence arXiv:2510.25860 (cs) [Submitted on 29 Oct 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters Authors:Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei View a PDF of the paper titled Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters, by Xingjian Zhang and 6 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical p...

Related Articles

Llms

Building knowledge bases from YouTube data using LLMs -- my workflow after 52 guides

I've been building a system that turns YouTube channels into structured knowledge bases. Thought I'd share the workflow since Karpathy's ...

Reddit - Artificial Intelligence · 1 min ·
What is AI, how do apps like ChatGPT work and why are there concerns?
Llms

What is AI, how do apps like ChatGPT work and why are there concerns?

AI is transforming modern life, but some critics worry about its potential misuse and environmental impact.

AI News - General · 7 min ·
[2603.29957] Think Anywhere in Code Generation
Llms

[2603.29957] Think Anywhere in Code Generation

Abstract page for arXiv paper 2603.29957: Think Anywhere in Code Generation

arXiv - Machine Learning · 3 min ·
[2603.16880] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Llms

[2603.16880] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Abstract page for arXiv paper 2603.16880: NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectr...

arXiv - Machine Learning · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime