[2603.02748] iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
About this article
Abstract page for arXiv paper 2603.02748: iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.02748 (cs) [Submitted on 3 Mar 2026] Title:iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding Authors:HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He View a PDF of the paper titled iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding, by HanZpeng Liu and 7 other authors View PDF HTML (experimental) Abstract:Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, w...