[2505.17508] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

[2505.17508] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

arXiv - AI 4 min read Article

Summary

This paper presents a unified framework for KL-regularized policy gradient algorithms aimed at enhancing reasoning in large language models (LLMs), addressing off-policy settings and improving accuracy in mathematical reasoning benchmarks.

Why It Matters

As LLMs become increasingly integral to various applications, optimizing their reasoning capabilities is crucial. This research offers a structured approach to policy gradient algorithms, potentially leading to more effective and scalable LLM training methods, which can impact AI development and deployment.

Key Takeaways

  • Introduces the Regularized Policy Gradient (RPG) framework for LLM reasoning.
  • Unifies various KL-regularization techniques and clarifies their applications.
  • Demonstrates improved accuracy on reasoning benchmarks with RPG-REINFORCE.
  • Identifies and corrects off-policy importance-weighting mismatches.
  • Offers a scalable RL algorithm for LLMs through KL-correct objectives and clipped sampling.

Computer Science > Machine Learning arXiv:2505.17508 (cs) [Submitted on 23 May 2025 (v1), last revised 19 Feb 2026 (this version, v4)] Title:On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao View a PDF of the paper titled On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning, by Yifan Zhang and 5 other authors View PDF Abstract:Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importan...

Related Articles

Llms

Attention Is All You Need, But All You Can't Afford | Hybrid Attention

Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a f...

Reddit - Artificial Intelligence · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute
Llms

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

AI Tools & Products · 3 min ·
How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'
Llms

How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'

AI Tools & Products · 9 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime