[2602.22623] ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Summary
The paper presents ContextRL, a framework that enhances knowledge discovery efficiency in multi-layered language models (MLLMs) through context augmentation and improved reward modeling.
Why It Matters
As AI models become increasingly complex, optimizing their knowledge discovery processes is crucial. ContextRL addresses significant challenges in reward modeling, enabling more accurate and efficient responses. This research contributes to the ongoing development of robust AI systems and highlights the importance of contextual information in reinforcement learning.
Key Takeaways
- ContextRL improves knowledge discovery efficiency in MLLMs.
- The framework uses context to enhance reward model accuracy.
- Experimental results show significant performance gains over traditional methods.
- Multi-turn sampling strategies help recover correct responses.
- The research highlights the issue of reward hacking in reinforcement learning.
Computer Science > Machine Learning arXiv:2602.22623 (cs) [Submitted on 26 Feb 2026] Title:ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL Authors:Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan View a PDF of the paper titled ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL, by Xingyu Lu and 14 other authors View PDF HTML (experimental) Abstract:We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reve...