[2603.21177] Prompt replay: speeding up grpo with on-policy reuse of

[2603.21177] Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

arXiv - Machine Learning March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.21177: Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

Computer Science > Machine Learning arXiv:2603.21177 (cs) [Submitted on 22 Mar 2026] Title:Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts Authors:Andrei Baroian, Rutger Berger View a PDF of the paper titled Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts, by Andrei Baroian and Rutger Berger View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive conf...

Originally published on March 24, 2026. Curated by AI News.

Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min · 7 minutes ago

Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products · about 3 hours ago

Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

[2603.21177] Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

About this article

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

People anxious about deviating from what AI tells them to do?

ChatGPT on trial: A landmark test of AI liability in the practice of law

What if Claude purposefully made its own code leakable so that it would get leaked

No comments

Stay updated with AI News