[2602.12566] To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
Summary
This paper explores the effectiveness of multi-domain reinforcement learning for large language models, comparing mixed multi-task training with separate training followed by merging. It presents qualitative and quantitative analyses across various domains.
Why It Matters
Understanding how reinforcement learning can be optimized for multi-domain applications is crucial for advancing the capabilities of large language models. This research provides insights into training paradigms that can enhance model performance in diverse tasks, which is relevant for AI developers and researchers focused on improving LLMs.
Key Takeaways
- Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning in LLMs.
- Mixed multi-task training and separate training followed by merging are two primary paradigms for multi-domain RLVR.
- The study reveals minimal interference between domains, with some showing synergistic effects.
- Qualitative and quantitative experiments were conducted using open-source datasets.
- Insights into weight space geometry and model behavior provide a deeper understanding of mutual gains.
Computer Science > Artificial Intelligence arXiv:2602.12566 (cs) [Submitted on 13 Feb 2026] Title:To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models Authors:Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang View a PDF of the paper titled To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models, by Haoqing Wang and 5 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic ef...