[2604.01597] Learning from the Right Rollouts: Data Attribution for

[2604.01597] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

arXiv - Machine Learning April 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.01597: Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Computer Science > Machine Learning arXiv:2604.01597 (cs) [Submitted on 2 Apr 2026] Title:Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training Authors:Dong Shu, Denghui Zhang, Jessica Hullman View a PDF of the paper titled Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training, by Dong Shu and 2 other authors View PDF HTML (experimental) Abstract:Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.01597 [cs.LG] (or arXiv:2604.01597v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.01...

Originally published on April 03, 2026. Curated by AI News.

Llms

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination

I used Jeff Bezos’ Day 1 rule with ChatGPT to stop procrastinating. These simple prompts helped me start faster, overthink less and get m...

AI Tools & Products · 9 min · 2 minutes ago

Llms

ChatGPT and Claude? The Real-World AI Buzz Is Elsewhere

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. ...

AI Tools & Products · 1 min · 2 minutes ago

Llms

Anthropic investigates unauthorized access to restricted Claude Mythos AI model

Anthropic investigates unauthorized access to restricted Claude Mythos AI model - SiliconANGLE

AI Tools & Products · 5 min · 2 minutes ago

Llms

Arc Sentry outperformed LLM Guard 92% vs 70% detection on a head to head benchmark. Here is how it works.

I built Arc Sentry, a pre-generation prompt injection detector for open-weight LLMs. Instead of scanning text for patterns after the fact...

Reddit - Artificial Intelligence · 1 min · 15 minutes ago

[2604.01597] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

About this article

Related Articles

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination

ChatGPT and Claude? The Real-World AI Buzz Is Elsewhere

Anthropic investigates unauthorized access to restricted Claude Mythos AI model

Arc Sentry outperformed LLM Guard 92% vs 70% detection on a head to head benchmark. Here is how it works.

No comments

Stay updated with AI News