[2602.02555] Learning to Explore with Parameter-Space Noise: A Deep

[2602.02555] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.02555: Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Computer Science > Machine Learning arXiv:2602.02555 (cs) [Submitted on 30 Jan 2026 (v1), last revised 28 Feb 2026 (this version, v3)] Title:Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards Authors:Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen View a PDF of the paper titled Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards, by Bizhe Bai and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consiste...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 3 hours ago

Llms

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

AI Tools & Products · 1 min · about 3 hours ago

Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min · about 3 hours ago

Llms

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

AI Tools & Products · 6 min · about 3 hours ago

[2602.02555] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

About this article

Related Articles

Claude Mythos and misguided open-weight fearmongering

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

No comments

Stay updated with AI News