Machine Learning Ai Agents

[2601.10498] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

arXiv - AI February 18, 2026 3 min read Article

Summary

The paper introduces Projected Microbatch Accumulation (PROMA), a novel method for proximal policy updates that enhances KL divergence control in reinforcement learning.

Why It Matters

This research is significant as it presents a new approach to improve the efficiency and performance of reinforcement learning algorithms, particularly in managing policy updates without requiring reference points. This could lead to advancements in AI applications that rely on reinforcement learning.

Key Takeaways

PROMA offers a reference-free method for proximal policy updates.
The accumulation-based variant improves KL divergence control compared to existing methods.
The intra-microbatch variant shows superior validation performance.
The method is compatible with standard data-parallel training.
This research could enhance reinforcement learning applications across various domains.

Computer Science > Machine Learning arXiv:2601.10498 (cs) [Submitted on 15 Jan 2026 (v1), last revised 17 Feb 2026 (this version, v4)] Title:PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates Authors:Nilin Abrahamsen View a PDF of the paper titled PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates, by Nilin Abrahamsen View PDF HTML (experimental) Abstract:This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10498 [cs.LG] (or arXiv:2601.10498v4 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10498 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Nilin Abrah...

Read Original Article

Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min · 30 minutes ago

Machine Learning

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

Meta’s Muse Spark model offers to analyze users’ health data, including lab results. Beyond the obvious privacy risks, it’s not a capable...

Wired - AI · 9 min · 30 minutes ago

Machine Learning

What image/video training data is hardest to find right now? [R]

I'm building a crowdsourced photo collection platform (contributors take photos with smartphones, we auto-label with YOLO/CLIP + enrich w...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value functi...

Reddit - Machine Learning · 1 min · about 1 hour ago

[2601.10498] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Summary

Why It Matters

Key Takeaways

Related Articles

What's your "When Language Model AI can do X, I'll be impressed"?

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

What image/video training data is hardest to find right now? [R]

I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

No comments

Stay updated with AI News