[2505.24840] The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
About this article
Abstract page for arXiv paper 2505.24840: The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Computer Science > Computer Vision and Pattern Recognition arXiv:2505.24840 (cs) [Submitted on 30 May 2025 (v1), last revised 26 Mar 2026 (this version, v2)] Title:The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition Authors:Yuwen Tan, Yuan Qing, Boqing Gong View a PDF of the paper titled The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition, by Yuwen Tan and 2 other authors View PDF HTML (experimental) Abstract:This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge. Comments: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2505.24...