[2602.21585] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Summary
The paper presents Duel-Evolve, an innovative algorithm that optimizes large language model outputs at test time using pairwise self-preferences instead of traditional reward models.
Why It Matters
Duel-Evolve addresses the limitations of existing optimization methods that rely on scalar rewards, which can be unreliable or unavailable. By utilizing pairwise comparisons from the LLM itself, this approach enhances test-time scaling and accuracy in various applications, making it a significant advancement in machine learning.
Key Takeaways
- Duel-Evolve replaces external scalar rewards with pairwise preferences from LLMs.
- The method shows significant accuracy improvements over existing techniques.
- It operates without the need for ground-truth labels or hand-crafted scoring functions.
- Utilizes Bayesian Bradley-Terry models for uncertainty-aware candidate quality estimates.
- Demonstrates effective optimization in large, discrete output spaces.
Computer Science > Machine Learning arXiv:2602.21585 (cs) [Submitted on 25 Feb 2026] Title:Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Authors:Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei View a PDF of the paper titled Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences, by Sweta Karlekar and 7 other authors View PDF HTML (experimental) Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality pare...