[2505.19892] OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
About this article
Abstract page for arXiv paper 2505.19892: OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
Computer Science > Artificial Intelligence arXiv:2505.19892 (cs) [Submitted on 26 May 2025 (v1), last revised 3 Mar 2026 (this version, v3)] Title:OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging Authors:Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao View a PDF of the paper titled OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging, by Yongxian Wei and 9 other authors View PDF HTML (experimental) Abstract:Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine diff...