[2601.10498] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

[2601.10498] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

arXiv - AI 3 min read Article

Summary

The paper introduces Projected Microbatch Accumulation (PROMA), a novel method for proximal policy updates that enhances KL divergence control in reinforcement learning.

Why It Matters

This research is significant as it presents a new approach to improve the efficiency and performance of reinforcement learning algorithms, particularly in managing policy updates without requiring reference points. This could lead to advancements in AI applications that rely on reinforcement learning.

Key Takeaways

  • PROMA offers a reference-free method for proximal policy updates.
  • The accumulation-based variant improves KL divergence control compared to existing methods.
  • The intra-microbatch variant shows superior validation performance.
  • The method is compatible with standard data-parallel training.
  • This research could enhance reinforcement learning applications across various domains.

Computer Science > Machine Learning arXiv:2601.10498 (cs) [Submitted on 15 Jan 2026 (v1), last revised 17 Feb 2026 (this version, v4)] Title:PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates Authors:Nilin Abrahamsen View a PDF of the paper titled PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates, by Nilin Abrahamsen View PDF HTML (experimental) Abstract:This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10498 [cs.LG]   (or arXiv:2601.10498v4 [cs.LG] for this version)   https://doi.org/10.48550/arXiv.2601.10498 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Nilin Abrah...

Related Articles

Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min ·
Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED
Machine Learning

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

Meta’s Muse Spark model offers to analyze users’ health data, including lab results. Beyond the obvious privacy risks, it’s not a capable...

Wired - AI · 9 min ·
Machine Learning

What image/video training data is hardest to find right now? [R]

I'm building a crowdsourced photo collection platform (contributors take photos with smartphones, we auto-label with YOLO/CLIP + enrich w...

Reddit - Machine Learning · 1 min ·
Machine Learning

I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value functi...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime