[2604.04894] Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation
About this article
Abstract page for arXiv paper 2604.04894: Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation
Computer Science > Computation and Language arXiv:2604.04894 (cs) [Submitted on 6 Apr 2026] Title:Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation Authors:Hengrui Gu, Xiaotian Han, Yujing Bian, Kaixiong Zhou View a PDF of the paper titled Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation, by Hengrui Gu and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy re...