[2603.29902] ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
About this article
Abstract page for arXiv paper 2603.29902: ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
Computer Science > Artificial Intelligence arXiv:2603.29902 (cs) [Submitted on 31 Mar 2026] Title:ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation Authors:Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang View a PDF of the paper titled ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation, by Yinuo Liu and 8 other authors View PDF HTML (experimental) Abstract:Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end-to-end execution and changing tool backends, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system...