[2510.25860] Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Summary
This article discusses a framework that enhances the reliability of large language model (LLM) raters by inferring thinking traces from label-only annotations, improving human-LLM agreement in evaluations.
Why It Matters
As LLMs become integral in evaluation tasks, ensuring their reliability is crucial. This research presents a method to infer reasoning behind judgments, which can improve LLM performance and provide clearer guidelines, ultimately enhancing the quality of AI evaluations.
Key Takeaways
- The proposed framework infers thinking traces from label-only annotations.
- Improved LLM-human agreement was observed across multiple datasets.
- Refined annotation guidelines enhance consistency among different LLM models.
- The method allows for the extension of label-only corpora into more reliable resources.
- This research supports the practical use of LLMs as proxies for human reasoning.
Computer Science > Artificial Intelligence arXiv:2510.25860 (cs) [Submitted on 29 Oct 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters Authors:Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei View a PDF of the paper titled Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters, by Xingjian Zhang and 6 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical p...