[2603.11321] Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
About this article
Abstract page for arXiv paper 2603.11321: Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Computer Science > Machine Learning arXiv:2603.11321 (cs) [Submitted on 11 Mar 2026 (v1), last revised 4 Apr 2026 (this version, v2)] Title:Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings Authors:Yuning Wu, Ke Wang, Devin Chen, Kai Wei View a PDF of the paper titled Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings, by Yuning Wu and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures...