[2602.21420] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Summary
This paper introduces the Asymmetric Confidence-aware Error Penalty (ACE) to enhance reinforcement learning by addressing overconfident errors that suppress valid exploratory trajectories.
Why It Matters
The research highlights a critical flaw in existing reinforcement learning methods that penalize errors uniformly, which can hinder model performance. By proposing ACE, the authors aim to improve reasoning in large language models, making this work significant for advancements in AI and machine learning.
Key Takeaways
- Current reinforcement learning methods fail to differentiate between types of errors, allowing overconfident mistakes to persist.
- The proposed ACE method introduces a dynamic penalty system that adjusts based on the confidence of errors.
- ACE has been tested on multiple model families and consistently improves performance across various benchmarks.
Computer Science > Machine Learning arXiv:2602.21420 (cs) [Submitted on 24 Feb 2026] Title:Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning Authors:Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang View a PDF of the paper titled Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning, by Yuanda Xu and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(p...