[2603.20808] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
About this article
Abstract page for arXiv paper 2603.20808: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.20808 (cs) [Submitted on 21 Mar 2026] Title:Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models Authors:Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng View a PDF of the paper titled Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models, by Enguang Wang and 7 other authors View PDF HTML (experimental) Abstract:While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of...