[2602.15481] LLM-as-Judge on a Budget
Summary
The paper presents a novel approach to efficiently evaluate large language models (LLMs) under budget constraints, utilizing multi-armed bandit theory to minimize estimation errors in scoring prompt-response pairs.
Why It Matters
As LLMs become integral in various applications, optimizing their evaluation is crucial for ensuring reliability and safety. This research provides a framework that enhances the efficiency of LLM assessments, which is vital for AI alignment and automated evaluations in practical scenarios.
Key Takeaways
- Introduces a variance-adaptive method for LLM evaluation.
- Utilizes multi-armed bandit theory for optimal query allocation.
- Demonstrates significant error reduction compared to uniform allocation.
- Establishes a theoretical foundation for efficient LLM assessments.
- Highlights implications for AI safety and model alignment.
Computer Science > Machine Learning arXiv:2602.15481 (cs) [Submitted on 17 Feb 2026] Title:LLM-as-Judge on a Budget Authors:Aadirupa Saha, Aniket Wagde, Branislav Kveton View a PDF of the paper titled LLM-as-Judge on a Budget, by Aadirupa Saha and 2 other authors View PDF HTML (experimental) Abstract:LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a t...