[2603.21383] PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
About this article
Abstract page for arXiv paper 2603.21383: PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
Computer Science > Artificial Intelligence arXiv:2603.21383 (cs) [Submitted on 22 Mar 2026] Title:PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost Authors:Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan View a PDF of the paper titled PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost, by Junkeun Yi and 11 other authors View PDF HTML (experimental) Abstract:Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient nor...