[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

[2602.13110] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

arXiv - AI 4 min read Article

Summary

The paper presents SCOPE, a framework for selective pairwise evaluation using large language models (LLMs) that improves judgment accuracy and reduces bias through innovative statistical methods.

Why It Matters

As LLMs are increasingly used for evaluations traditionally done by humans, ensuring their reliability and reducing biases is crucial. SCOPE addresses these issues by providing a statistically sound method to enhance the quality of LLM judgments, which can significantly impact various applications in AI and machine learning.

Key Takeaways

  • SCOPE offers a framework for selective pairwise judging with statistical guarantees.
  • Introduces Bidirectional Preference Entropy (BPE) to enhance uncertainty measurement.
  • Demonstrates improved judgment acceptance rates while maintaining risk levels.
  • Achieves significant coverage across various benchmarks, outperforming naive baselines.
  • Addresses systematic biases in LLM judgments, enhancing their reliability.

Computer Science > Computation and Language arXiv:2602.13110 (cs) [Submitted on 13 Feb 2026] Title:SCOPE: Selective Conformal Optimized Pairwise LLM Judging Authors:Sher Badshah, Ali Emami, Hassan Sajjad View a PDF of the paper titled SCOPE: Selective Conformal Optimized Pairwise LLM Judging, by Sher Badshah and 2 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage ...

Related Articles

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min ·
Block Resets Management With AI As Cash App Adds Installment Transfers
Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime