[2510.03669] Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Summary
This article introduces the Token Hidden Reward (THR) metric, which enhances exploration-exploitation strategies in Group Relative Deep Reinforcement Learning, improving training dynamics and accuracy in reasoning tasks.
Why It Matters
The development of the Token Hidden Reward metric addresses a critical challenge in reinforcement learning, particularly in large language models. By providing a mechanism to steer training towards either exploration or exploitation, this research has significant implications for improving model performance in reasoning-intensive applications, potentially leading to more robust AI systems.
Key Takeaways
- Token Hidden Reward (THR) quantifies token influence on model performance.
- Positive THR values enhance exploitation, while negative values support exploration.
- A THR-guided reweighting algorithm can improve training outcomes in RL-tuned LLMs.
- The algorithm shows effectiveness across various architectures and RL objectives.
- Improvements in reasoning tasks suggest THR's potential for fine-tuning AI models.
Computer Science > Machine Learning arXiv:2510.03669 (cs) [Submitted on 4 Oct 2025 (v1), last revised 15 Feb 2026 (this version, v4)] Title:Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Authors:Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis View a PDF of the paper titled Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning, by Wenlong Deng and 6 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the eff...