[2509.02522] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Summary
The paper presents PACS, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR), addressing challenges like sparse rewards and unstable updates by reformulating the problem into a supervised learning task.
Why It Matters
This research is significant as it proposes a new method to enhance the stability and efficiency of RLVR, which is crucial for improving the performance of large language models in complex reasoning tasks. The findings could lead to advancements in AI applications requiring reliable reward systems.
Key Takeaways
- PACS reformulates RLVR into a supervised learning task for better stability.
- The framework shows significant performance improvements over existing models.
- It addresses challenges like sparse rewards and unstable policy updates.
- The proposed method recovers classical policy gradient updates effectively.
- Open-source code and data are available for further research and application.
Computer Science > Computation and Language arXiv:2509.02522 (cs) [Submitted on 2 Sep 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR Authors:Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang View a PDF of the paper titled Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR, by Jiaming Li and 7 other authors View PDF HTML (experimental) Abstract:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive exp...