[2509.21154] GRPO is Secretly a Process Reward Model
Summary
The paper presents a theoretical proof that the GRPO algorithm, typically viewed as an outcome reward model, can be interpreted as a process reward model. It identifies a flaw in GRPO that affects exploration and exploitation, proposing a modification that enhances performance...
Why It Matters
Understanding the relationship between process reward models and outcome reward models is crucial for improving reinforcement learning algorithms. This research provides insights into optimizing GRPO, potentially leading to better performance in AI applications, particularly in large language models.
Key Takeaways
- GRPO can be viewed as a process reward model, contrary to its traditional classification.
- A flaw in GRPO affects its exploration and exploitation capabilities.
- The proposed modification, $BB$-GRPO, enhances performance without significant training cost.
- LLMs tuned with $BB$-GRPO outperform those tuned with standard GRPO.
- This research contributes to the understanding of reward structures in reinforcement learning.
Computer Science > Machine Learning arXiv:2509.21154 (cs) [Submitted on 25 Sep 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:GRPO is Secretly a Process Reward Model Authors:Michael Sullivan, Alexander Koller View a PDF of the paper titled GRPO is Secretly a Process Reward Model, by Michael Sullivan and 1 other authors View PDF HTML (experimental) Abstract:Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs tuned with $\lambda$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, ...