[2601.10498] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
Summary
The paper introduces Projected Microbatch Accumulation (PROMA), a novel method for proximal policy updates that enhances KL divergence control in reinforcement learning.
Why It Matters
This research is significant as it presents a new approach to improve the efficiency and performance of reinforcement learning algorithms, particularly in managing policy updates without requiring reference points. This could lead to advancements in AI applications that rely on reinforcement learning.
Key Takeaways
- PROMA offers a reference-free method for proximal policy updates.
- The accumulation-based variant improves KL divergence control compared to existing methods.
- The intra-microbatch variant shows superior validation performance.
- The method is compatible with standard data-parallel training.
- This research could enhance reinforcement learning applications across various domains.
Computer Science > Machine Learning arXiv:2601.10498 (cs) [Submitted on 15 Jan 2026 (v1), last revised 17 Feb 2026 (this version, v4)] Title:PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates Authors:Nilin Abrahamsen View a PDF of the paper titled PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates, by Nilin Abrahamsen View PDF HTML (experimental) Abstract:This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10498 [cs.LG] (or arXiv:2601.10498v4 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10498 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Nilin Abrah...