Llms Machine Learning Ai Infrastructure Ai Agents

[2602.13035] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

arXiv - Machine Learning February 16, 2026 3 min read Article

Summary

This paper introduces Introspective LLM, a hierarchical reinforcement learning framework that optimizes sampling temperature in large language models (LLMs) based on internal states, enhancing exploration and performance in tasks like mathematical reasoning.

Why It Matters

The study addresses the limitations of static temperature settings in LLMs by proposing a dynamic approach that adapts to task-level rewards. This advancement could significantly improve the efficiency and effectiveness of LLMs in various applications, particularly in complex reasoning tasks.

Key Takeaways

Introduces a hierarchical reinforcement learning framework for LLMs.
Optimizes sampling temperature dynamically based on internal states.
Demonstrates improved performance in mathematical reasoning tasks.
Offers interpretable exploration behaviors aligned with reasoning uncertainty.
Challenges traditional static temperature settings in LLM training.

Computer Science > Machine Learning arXiv:2602.13035 (cs) [Submitted on 13 Feb 2026] Title:Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL Authors:Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng View a PDF of the paper titled Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL, by Yixiao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with r...

Read Original Article

[2602.13035] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

Summary

Why It Matters

Key Takeaways

Related Articles

Gary Marcus on the Claude Code leak [D]

LLMs learn backwards, and the scaling hypothesis is bounded. [D]

Been building a multi-agent framework in public for 5 weeks, its been a Journey.

8 free AI courses from Anthropic’s Claude platform with certificates

No comments

Stay updated with AI News