[2602.21189] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Summary
The paper explores the trade-off between Pass@k and Pass@1 performance metrics in large language models, revealing how optimizing for Pass@k can negatively impact Pass@1 due to prompt interference.
Why It Matters
Understanding the relationship between Pass@k and Pass@1 is crucial for optimizing large language models, especially in applications where single-shot accuracy is essential. This research highlights the complexities of model fine-tuning and the potential pitfalls of current optimization strategies.
Key Takeaways
- Pass@k optimization can lead to a decrease in Pass@1 performance.
- The trade-off is significant for applications requiring reliable single-shot responses.
- Prompt interference is a key factor in the observed performance degradation.
- The study provides a theoretical framework for understanding these dynamics.
- Experiments validate the theoretical findings in the context of mathematical reasoning tasks.
Computer Science > Machine Learning arXiv:2602.21189 (cs) [Submitted on 24 Feb 2026] Title:Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training Authors:Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi View a PDF of the paper titled Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training, by Anas Barakat and 3 other authors View PDF HTML (experimental) Abstract:Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these pro...