[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Summary
The paper presents SCOPE, a framework for selective pairwise evaluation using large language models (LLMs) that improves judgment accuracy and reduces bias through innovative statistical methods.
Why It Matters
As LLMs are increasingly used for evaluations traditionally done by humans, ensuring their reliability and reducing biases is crucial. SCOPE addresses these issues by providing a statistically sound method to enhance the quality of LLM judgments, which can significantly impact various applications in AI and machine learning.
Key Takeaways
- SCOPE offers a framework for selective pairwise judging with statistical guarantees.
- Introduces Bidirectional Preference Entropy (BPE) to enhance uncertainty measurement.
- Demonstrates improved judgment acceptance rates while maintaining risk levels.
- Achieves significant coverage across various benchmarks, outperforming naive baselines.
- Addresses systematic biases in LLM judgments, enhancing their reliability.
Computer Science > Computation and Language arXiv:2602.13110 (cs) [Submitted on 13 Feb 2026] Title:SCOPE: Selective Conformal Optimized Pairwise LLM Judging Authors:Sher Badshah, Ali Emami, Hassan Sajjad View a PDF of the paper titled SCOPE: Selective Conformal Optimized Pairwise LLM Judging, by Sher Badshah and 2 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage ...