[2604.09531] VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
About this article
Abstract page for arXiv paper 2604.09531: VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.09531 (cs) [Submitted on 10 Apr 2026] Title:VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images Authors:Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu View a PDF of the paper titled VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images, by Guanyu Zhou and 5 other authors View PDF Abstract:Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP...