[2603.01162] Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
About this article
Abstract page for arXiv paper 2603.01162: Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Computer Science > Machine Learning arXiv:2603.01162 (cs) [Submitted on 1 Mar 2026] Title:Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic Authors:Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Shijin Gong, Chengchun Shi View a PDF of the paper titled Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic, by Hongyi Zhou and 4 other authors View PDF HTML (experimental) Abstract:Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm -- one with access to a value function that quantifies the goodness of its learning policy at each training iteration -- and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offe...