[2601.20802] Reinforcement Learning via Self-Distillation
Summary
This paper introduces Self-Distillation Policy Optimization (SDPO) for reinforcement learning, leveraging rich feedback to enhance learning efficiency and accuracy in various domains.
Why It Matters
The research addresses the limitations of traditional reinforcement learning methods that rely solely on scalar feedback. By utilizing rich textual feedback, SDPO offers a novel approach to improve sample efficiency and accuracy, which is crucial for advancing AI applications in complex environments.
Key Takeaways
- SDPO converts rich feedback into dense learning signals without external teachers.
- The method improves sample efficiency and accuracy over existing RLVR baselines.
- SDPO can leverage successful rollouts as implicit feedback for failed attempts.
- The approach accelerates discovery in binary-reward tasks with fewer attempts.
- It demonstrates effectiveness across diverse applications like scientific reasoning and programming.
Computer Science > Machine Learning arXiv:2601.20802 (cs) [Submitted on 28 Jan 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Reinforcement Learning via Self-Distillation Authors:Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause View a PDF of the paper titled Reinforcement Learning via Self-Distillation, by Jonas H\"ubotter and 10 other authors View PDF HTML (experimental) Abstract:Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scie...