[2602.12642] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
Summary
This article presents a novel approach to reinforcement learning by reinterpreting the partition function as a difficulty scheduler, enhancing sample efficiency in large language models (LLMs).
Why It Matters
The research addresses a critical challenge in reinforcement learning for LLMs—balancing output diversity and reasoning performance. By proposing the PACED-RL framework, it offers a new method to improve training efficiency, which could lead to more effective AI systems in various applications.
Key Takeaways
- The partition function can be used as a per-prompt expected-reward signal.
- PACED-RL improves sample efficiency by prioritizing informative prompts during training.
- The framework reuses information from previous GFlowNet training, minimizing additional computational costs.
- Extensive experiments show significant performance improvements over existing methods.
- This approach could lead to more efficient distribution-matching training for LLMs.
Computer Science > Computation and Language arXiv:2602.12642 (cs) [Submitted on 13 Feb 2026] Title:Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR Authors:Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung View a PDF of the paper titled Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR, by Dohyung Kim and 5 other authors View PDF HTML (experimental) Abstract:Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amo...