[2604.01597] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

[2604.01597] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2604.01597: Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Computer Science > Machine Learning arXiv:2604.01597 (cs) [Submitted on 2 Apr 2026] Title:Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training Authors:Dong Shu, Denghui Zhang, Jessica Hullman View a PDF of the paper titled Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training, by Dong Shu and 2 other authors View PDF HTML (experimental) Abstract:Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.01597 [cs.LG]   (or arXiv:2604.01597v1 [cs.LG] for this version)   https://doi.org/10.48550/arXiv.2604.01...

Originally published on April 03, 2026. Curated by AI News.

Related Articles

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination
Llms

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination

I used Jeff Bezos’ Day 1 rule with ChatGPT to stop procrastinating. These simple prompts helped me start faster, overthink less and get m...

AI Tools & Products · 9 min ·
Llms

ChatGPT and Claude? The Real-World AI Buzz Is Elsewhere

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. ...

AI Tools & Products · 1 min ·
Anthropic investigates unauthorized access to restricted Claude Mythos AI model
Llms

Anthropic investigates unauthorized access to restricted Claude Mythos AI model

Anthropic investigates unauthorized access to restricted Claude Mythos AI model - SiliconANGLE

AI Tools & Products · 5 min ·
Llms

Arc Sentry outperformed LLM Guard 92% vs 70% detection on a head to head benchmark. Here is how it works.

I built Arc Sentry, a pre-generation prompt injection detector for open-weight LLMs. Instead of scanning text for patterns after the fact...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime