[2505.19590] Learning to Reason without External Rewards
About this article
Abstract page for arXiv paper 2505.19590: Learning to Reason without External Rewards
Computer Science > Machine Learning arXiv:2505.19590 (cs) [Submitted on 26 May 2025 (v1), last revised 2 Mar 2026 (this version, v3)] Title:Learning to Reason without External Rewards Authors:Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song View a PDF of the paper titled Learning to Reason without External Rewards, by Xuandong Zhao and 4 other authors View PDF HTML (experimental) Abstract:Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL...