Llms Machine Learning Nlp Ai Safety

[2602.15481] LLM-as-Judge on a Budget

arXiv - Machine Learning February 18, 2026 3 min read Article

Summary

The paper presents a novel approach to efficiently evaluate large language models (LLMs) under budget constraints, utilizing multi-armed bandit theory to minimize estimation errors in scoring prompt-response pairs.

Why It Matters

As LLMs become integral in various applications, optimizing their evaluation is crucial for ensuring reliability and safety. This research provides a framework that enhances the efficiency of LLM assessments, which is vital for AI alignment and automated evaluations in practical scenarios.

Key Takeaways

Introduces a variance-adaptive method for LLM evaluation.
Utilizes multi-armed bandit theory for optimal query allocation.
Demonstrates significant error reduction compared to uniform allocation.
Establishes a theoretical foundation for efficient LLM assessments.
Highlights implications for AI safety and model alignment.

Computer Science > Machine Learning arXiv:2602.15481 (cs) [Submitted on 17 Feb 2026] Title:LLM-as-Judge on a Budget Authors:Aadirupa Saha, Aniket Wagde, Branislav Kveton View a PDF of the paper titled LLM-as-Judge on a Budget, by Aadirupa Saha and 2 other authors View PDF HTML (experimental) Abstract:LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a t...

Read Original Article

Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min · about 10 hours ago

[2602.15481] LLM-as-Judge on a Budget

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

OpenClaw security checklist: practical safeguards for AI agents

No comments

Stay updated with AI News