[2510.25992] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Summary
The paper presents Supervised Reinforcement Learning (SRL), a framework that enhances reasoning in Large Language Models (LLMs) by reformulating problem-solving as generating logical actions, improving performance on complex tasks.
Why It Matters
This research addresses the limitations of existing reinforcement learning methods in LLMs, particularly in multi-step reasoning tasks. By introducing SRL, the authors provide a new approach that could significantly improve the training and effectiveness of AI models in various applications, including software engineering.
Key Takeaways
- SRL reformulates problem-solving as generating sequences of logical actions.
- It provides smoother rewards based on expert actions, enhancing learning signals.
- SRL allows small models to tackle previously unsolvable problems.
- Initializing with SRL before RLVR leads to superior performance.
- The framework generalizes well to agentic software engineering tasks.
Computer Science > Computation and Language arXiv:2510.25992 (cs) [Submitted on 29 Oct 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning Authors:Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee View a PDF of the paper titled Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning, by Yihe Deng and 9 other authors View PDF Abstract:Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables sm...