[2604.02288] Unifying Group-Relative and Self-Distillation Policy

[2604.02288] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

arXiv - Machine Learning April 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.02288: Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Computer Science > Machine Learning arXiv:2604.02288 (cs) [Submitted on 2 Apr 2026] Title:Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing Authors:Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua View a PDF of the paper titled Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing, by Gengsheng Li and 8 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO ...

Originally published on April 03, 2026. Curated by AI News.

Llms

Upwork Launches Hiring App Inside ChatGPT

Upwork Launches Hiring App Inside ChatGPT - CDO Magazine

AI Tools & Products · 5 min · about 1 hour ago

Llms

Open A.I. and Chat GPT are under criminal investigation

Open A.I. and Chat GPT are under criminal investigation after a deadly shooting last year at Florida State University.

AI Tools & Products · 1 min · about 1 hour ago

Llms

Ulta Partners With Google Gemini To Power Agentic AI For Beauty Shoppers

Ulta Beauty is integrating Google's Gemini AI into its website and app, and extending its catalog across Google's platforms giving it a c...

AI Tools & Products · 5 min · about 1 hour ago

Llms

What 81,000 people told us about the economics of AI

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

AI Tools & Products · 13 min · about 1 hour ago

[2604.02288] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

About this article

Related Articles

Upwork Launches Hiring App Inside ChatGPT

Open A.I. and Chat GPT are under criminal investigation

Ulta Partners With Google Gemini To Power Agentic AI For Beauty Shoppers

What 81,000 people told us about the economics of AI

No comments

Stay updated with AI News