[2505.17508] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Summary
This paper presents a unified framework for KL-regularized policy gradient algorithms aimed at enhancing reasoning in large language models (LLMs), addressing off-policy settings and improving accuracy in mathematical reasoning benchmarks.
Why It Matters
As LLMs become increasingly integral to various applications, optimizing their reasoning capabilities is crucial. This research offers a structured approach to policy gradient algorithms, potentially leading to more effective and scalable LLM training methods, which can impact AI development and deployment.
Key Takeaways
- Introduces the Regularized Policy Gradient (RPG) framework for LLM reasoning.
- Unifies various KL-regularization techniques and clarifies their applications.
- Demonstrates improved accuracy on reasoning benchmarks with RPG-REINFORCE.
- Identifies and corrects off-policy importance-weighting mismatches.
- Offers a scalable RL algorithm for LLMs through KL-correct objectives and clipped sampling.
Computer Science > Machine Learning arXiv:2505.17508 (cs) [Submitted on 23 May 2025 (v1), last revised 19 Feb 2026 (this version, v4)] Title:On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao View a PDF of the paper titled On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning, by Yifan Zhang and 5 other authors View PDF Abstract:Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importan...