[2602.20197] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
Summary
The paper presents CalibRL, a hybrid-policy RLVR framework that enhances exploration in multi-modal reasoning tasks by balancing exploration and exploitation through expert-guided strategies.
Why It Matters
This research addresses the challenges of reinforcement learning in multi-modal large language models, particularly the issues of entropy collapse and policy degradation. By proposing a controllable exploration method, it contributes to more effective training strategies, which can lead to improved performance in AI applications that require reasoning across different modalities.
Key Takeaways
- CalibRL enhances exploration in reinforcement learning by using expert guidance.
- The framework employs distribution-aware advantage weighting to calibrate updates.
- Asymmetric activation functions help moderate overconfident updates while preserving direction.
- Extensive experiments show consistent improvements across multiple benchmarks.
- The approach aims to alleviate distributional mismatches between model policies and expert trajectories.
Computer Science > Machine Learning arXiv:2602.20197 (cs) [Submitted on 22 Feb 2026] Title:Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning Authors:Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han View a PDF of the paper titled Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning, by Zhuoxu Huang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided ma...