[2602.20197] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

[2602.20197] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents CalibRL, a hybrid-policy RLVR framework that enhances exploration in multi-modal reasoning tasks by balancing exploration and exploitation through expert-guided strategies.

Why It Matters

This research addresses the challenges of reinforcement learning in multi-modal large language models, particularly the issues of entropy collapse and policy degradation. By proposing a controllable exploration method, it contributes to more effective training strategies, which can lead to improved performance in AI applications that require reasoning across different modalities.

Key Takeaways

  • CalibRL enhances exploration in reinforcement learning by using expert guidance.
  • The framework employs distribution-aware advantage weighting to calibrate updates.
  • Asymmetric activation functions help moderate overconfident updates while preserving direction.
  • Extensive experiments show consistent improvements across multiple benchmarks.
  • The approach aims to alleviate distributional mismatches between model policies and expert trajectories.

Computer Science > Machine Learning arXiv:2602.20197 (cs) [Submitted on 22 Feb 2026] Title:Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning Authors:Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han View a PDF of the paper titled Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning, by Zhuoxu Huang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided ma...

Related Articles

Llms

I think we’re about to have a new kind of “SEO”… and nobody is talking about it.

More people are asking ChatGPT things like: “what’s the best CRM?” “is this tool worth it?” “alternatives to X” And they just… trust the ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Why would Claude give me the same response over and over and give others different replies?

I asked Claude to "generate me a random word" so I could do some word play. Then I asked it again in a new prompt window on desktop after...

Reddit - Artificial Intelligence · 1 min ·
Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra | The Verge
Llms

Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra | The Verge

The popular combination of OpenClaw and Claude Code is being severed now that Anthropic has announced it will start charging subscribers ...

The Verge - AI · 4 min ·
Llms

wtf bro did what? arc 3 2026

The Physarum Explorer is a high-speed, bio-inspired neural model designed specifically for ARC geometry. Here is the snapshot of its curr...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime