Llms Machine Learning Ai Agents

[2602.20197] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

arXiv - Machine Learning February 25, 2026 4 min read Article

Summary

The paper presents CalibRL, a hybrid-policy RLVR framework that enhances exploration in multi-modal reasoning tasks by balancing exploration and exploitation through expert-guided strategies.

Why It Matters

This research addresses the challenges of reinforcement learning in multi-modal large language models, particularly the issues of entropy collapse and policy degradation. By proposing a controllable exploration method, it contributes to more effective training strategies, which can lead to improved performance in AI applications that require reasoning across different modalities.

Key Takeaways

CalibRL enhances exploration in reinforcement learning by using expert guidance.
The framework employs distribution-aware advantage weighting to calibrate updates.
Asymmetric activation functions help moderate overconfident updates while preserving direction.
Extensive experiments show consistent improvements across multiple benchmarks.
The approach aims to alleviate distributional mismatches between model policies and expert trajectories.

Computer Science > Machine Learning arXiv:2602.20197 (cs) [Submitted on 22 Feb 2026] Title:Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning Authors:Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han View a PDF of the paper titled Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning, by Zhuoxu Huang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided ma...

Read Original Article

[2602.20197] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Summary

Why It Matters

Key Takeaways

Related Articles

I think we’re about to have a new kind of “SEO”… and nobody is talking about it.

Why would Claude give me the same response over and over and give others different replies?

Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra | The Verge

wtf bro did what? arc 3 2026

No comments

Stay updated with AI News