[2602.13035] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
Summary
This paper introduces Introspective LLM, a hierarchical reinforcement learning framework that optimizes sampling temperature in large language models (LLMs) based on internal states, enhancing exploration and performance in tasks like mathematical reasoning.
Why It Matters
The study addresses the limitations of static temperature settings in LLMs by proposing a dynamic approach that adapts to task-level rewards. This advancement could significantly improve the efficiency and effectiveness of LLMs in various applications, particularly in complex reasoning tasks.
Key Takeaways
- Introduces a hierarchical reinforcement learning framework for LLMs.
- Optimizes sampling temperature dynamically based on internal states.
- Demonstrates improved performance in mathematical reasoning tasks.
- Offers interpretable exploration behaviors aligned with reasoning uncertainty.
- Challenges traditional static temperature settings in LLM training.
Computer Science > Machine Learning arXiv:2602.13035 (cs) [Submitted on 13 Feb 2026] Title:Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL Authors:Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng View a PDF of the paper titled Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL, by Yixiao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with r...