[2505.23004] QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
About this article
Abstract page for arXiv paper 2505.23004: QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
Computer Science > Machine Learning arXiv:2505.23004 (cs) [Submitted on 29 May 2025 (v1), last revised 25 Mar 2026 (this version, v2)] Title:QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining Authors:Kyle R. Chickering, Bangzheng Li, Muhao Chen View a PDF of the paper titled QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining, by Kyle R. Chickering and Bangzheng Li and Muhao Chen View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few ...