Llms Ai Infrastructure Machine Learning Generative Ai

[2602.21585] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

arXiv - AI February 26, 2026 4 min read Article

Summary

The paper presents Duel-Evolve, an innovative algorithm that optimizes large language model outputs at test time using pairwise self-preferences instead of traditional reward models.

Why It Matters

Duel-Evolve addresses the limitations of existing optimization methods that rely on scalar rewards, which can be unreliable or unavailable. By utilizing pairwise comparisons from the LLM itself, this approach enhances test-time scaling and accuracy in various applications, making it a significant advancement in machine learning.

Key Takeaways

Duel-Evolve replaces external scalar rewards with pairwise preferences from LLMs.
The method shows significant accuracy improvements over existing techniques.
It operates without the need for ground-truth labels or hand-crafted scoring functions.
Utilizes Bayesian Bradley-Terry models for uncertainty-aware candidate quality estimates.
Demonstrates effective optimization in large, discrete output spaces.

Computer Science > Machine Learning arXiv:2602.21585 (cs) [Submitted on 25 Feb 2026] Title:Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Authors:Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei View a PDF of the paper titled Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences, by Sweta Karlekar and 7 other authors View PDF HTML (experimental) Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality pare...

Read Original Article

[2602.21585] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.17839] How do LLMs Compute Verbal Confidence

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

[2603.10062] Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting

No comments

Stay updated with AI News