[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
About this article
Abstract page for arXiv paper 2601.13622: CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.13622 (cs) [Submitted on 20 Jan 2026 (v1), last revised 26 Mar 2026 (this version, v3)] Title:CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models Authors:Donghee Lee, Rui Cai, Zhe Zhao View a PDF of the paper titled CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models, by Donghee Lee and 2 other authors View PDF HTML (experimental) Abstract:Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancin...