[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2601.11109: Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2601.11109 (cs) [Submitted on 16 Jan 2026 (v1), last revised 6 Apr 2026 (this version, v3)] Title:Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Authors:Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng View a PDF of the paper titled Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning, by Shaofeng Yin and 8 other authors View PDF HTML (experimental) Abstract:Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED
Llms

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

From sorting chicken nuggets to screwing in light bulbs, Eka’s robots are eerily lifelike. But do they have real physical smarts?

Wired - AI · 13 min ·
Llms

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

**The "Goldfish Problem" is expensive. I decided to fix the plumbing.** Most Claude implementations leave 90% of their money on the table...

Reddit - Artificial Intelligence · 1 min ·
Llms

What are people using for low-latency autocomplete in production? [P]

I’ve been looking into autocomplete/typeahead systems recently, especially in contexts where latency really matters (e.g. search-as-you-t...

Reddit - Machine Learning · 1 min ·
General Motors is adding Gemini to four million cars | The Verge
Llms

General Motors is adding Gemini to four million cars | The Verge

General Motors is planning to bring Google’s Gemini AI assistant to around four million vehicles across the US.

The Verge - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime