[2604.02288] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
About this article
Abstract page for arXiv paper 2604.02288: Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Computer Science > Machine Learning arXiv:2604.02288 (cs) [Submitted on 2 Apr 2026] Title:Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing Authors:Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua View a PDF of the paper titled Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing, by Gengsheng Li and 8 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO ...