[2510.15040] Composition-Grounded Data Synthesis for Visual Reasoning
About this article
Abstract page for arXiv paper 2510.15040: Composition-Grounded Data Synthesis for Visual Reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.15040 (cs) [Submitted on 16 Oct 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Composition-Grounded Data Synthesis for Visual Reasoning Authors:Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He View a PDF of the paper titled Composition-Grounded Data Synthesis for Visual Reasoning, by Xinyi Gu and 7 other authors View PDF HTML (experimental) Abstract:Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded data Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially imp...