[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
About this article
Abstract page for arXiv paper 2601.11109: Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.11109 (cs) [Submitted on 16 Jan 2026 (v1), last revised 6 Apr 2026 (this version, v3)] Title:Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Authors:Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng View a PDF of the paper titled Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning, by Shaofeng Yin and 8 other authors View PDF HTML (experimental) Abstract:Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging...