[2602.21585] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

[2602.21585] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

arXiv - AI 4 min read Article

Summary

The paper presents Duel-Evolve, an innovative algorithm that optimizes large language model outputs at test time using pairwise self-preferences instead of traditional reward models.

Why It Matters

Duel-Evolve addresses the limitations of existing optimization methods that rely on scalar rewards, which can be unreliable or unavailable. By utilizing pairwise comparisons from the LLM itself, this approach enhances test-time scaling and accuracy in various applications, making it a significant advancement in machine learning.

Key Takeaways

  • Duel-Evolve replaces external scalar rewards with pairwise preferences from LLMs.
  • The method shows significant accuracy improvements over existing techniques.
  • It operates without the need for ground-truth labels or hand-crafted scoring functions.
  • Utilizes Bayesian Bradley-Terry models for uncertainty-aware candidate quality estimates.
  • Demonstrates effective optimization in large, discrete output spaces.

Computer Science > Machine Learning arXiv:2602.21585 (cs) [Submitted on 25 Feb 2026] Title:Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Authors:Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei View a PDF of the paper titled Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences, by Sweta Karlekar and 7 other authors View PDF HTML (experimental) Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality pare...

Related Articles

[2603.17839] How do LLMs Compute Verbal Confidence
Llms

[2603.17839] How do LLMs Compute Verbal Confidence

Abstract page for arXiv paper 2603.17839: How do LLMs Compute Verbal Confidence

arXiv - AI · 4 min ·
[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Llms

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Abstract page for arXiv paper 2603.15970: 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight...

arXiv - AI · 4 min ·
[2603.10062] Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead
Llms

[2603.10062] Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Abstract page for arXiv paper 2603.10062: Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

arXiv - AI · 3 min ·
[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting
Llms

[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting

Abstract page for arXiv paper 2603.09085: Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum ...

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime