[2508.03284] ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
About this article
Abstract page for arXiv paper 2508.03284: ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
Computer Science > Artificial Intelligence arXiv:2508.03284 (cs) [Submitted on 5 Aug 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools Authors:Shaofeng Yin, Ting Lei, Yang Liu View a PDF of the paper titled ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools, by Shaofeng Yin and 2 other authors View PDF HTML (experimental) Abstract:Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an ...