[2602.13035] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

[2602.13035] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

arXiv - Machine Learning 3 min read Article

Summary

This paper introduces Introspective LLM, a hierarchical reinforcement learning framework that optimizes sampling temperature in large language models (LLMs) based on internal states, enhancing exploration and performance in tasks like mathematical reasoning.

Why It Matters

The study addresses the limitations of static temperature settings in LLMs by proposing a dynamic approach that adapts to task-level rewards. This advancement could significantly improve the efficiency and effectiveness of LLMs in various applications, particularly in complex reasoning tasks.

Key Takeaways

  • Introduces a hierarchical reinforcement learning framework for LLMs.
  • Optimizes sampling temperature dynamically based on internal states.
  • Demonstrates improved performance in mathematical reasoning tasks.
  • Offers interpretable exploration behaviors aligned with reasoning uncertainty.
  • Challenges traditional static temperature settings in LLM training.

Computer Science > Machine Learning arXiv:2602.13035 (cs) [Submitted on 13 Feb 2026] Title:Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL Authors:Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng View a PDF of the paper titled Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL, by Yixiao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with r...

Related Articles

Llms

Gary Marcus on the Claude Code leak [D]

Gary Marcus just tweeted: ... the way Anthropic built that kernel is straight out of classical symbolic AI. For example, it is in large p...

Reddit - Machine Learning · 1 min ·
Llms

LLMs learn backwards, and the scaling hypothesis is bounded. [D]

submitted by /u/preyneyv [link] [comments]

Reddit - Machine Learning · 1 min ·
Llms

Been building a multi-agent framework in public for 5 weeks, its been a Journey.

I've been building this repo public since day one, roughly 5 weeks now with Claude Code. Here's where it's at. Feels good to be so close....

Reddit - Artificial Intelligence · 1 min ·
Llms

8 free AI courses from Anthropic’s Claude platform with certificates

AI News - General ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime