[2510.25992] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

[2510.25992] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents Supervised Reinforcement Learning (SRL), a framework that enhances reasoning in Large Language Models (LLMs) by reformulating problem-solving as generating logical actions, improving performance on complex tasks.

Why It Matters

This research addresses the limitations of existing reinforcement learning methods in LLMs, particularly in multi-step reasoning tasks. By introducing SRL, the authors provide a new approach that could significantly improve the training and effectiveness of AI models in various applications, including software engineering.

Key Takeaways

  • SRL reformulates problem-solving as generating sequences of logical actions.
  • It provides smoother rewards based on expert actions, enhancing learning signals.
  • SRL allows small models to tackle previously unsolvable problems.
  • Initializing with SRL before RLVR leads to superior performance.
  • The framework generalizes well to agentic software engineering tasks.

Computer Science > Computation and Language arXiv:2510.25992 (cs) [Submitted on 29 Oct 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning Authors:Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee View a PDF of the paper titled Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning, by Yihe Deng and 9 other authors View PDF Abstract:Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables sm...

Related Articles

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch
Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Llms

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

hey everyone. been lurking here for a while and wanted to share something we been building. the problem: ai coding agents are only as goo...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime