[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved

[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

arXiv - AI April 07, 2026 3 min read

About this article

Abstract page for arXiv paper 2601.11109: Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2601.11109 (cs) [Submitted on 16 Jan 2026 (v1), last revised 6 Apr 2026 (this version, v3)] Title:Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Authors:Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng View a PDF of the paper titled Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning, by Shaofeng Yin and 8 other authors View PDF HTML (experimental) Abstract:Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging...

Originally published on April 07, 2026. Curated by AI News.

Llms

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

From sorting chicken nuggets to screwing in light bulbs, Eka’s robots are eerily lifelike. But do they have real physical smarts?

Wired - AI · 13 min · about 1 hour ago

Llms

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

**The "Goldfish Problem" is expensive. I decided to fix the plumbing.** Most Claude implementations leave 90% of their money on the table...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

What are people using for low-latency autocomplete in production? [P]

I’ve been looking into autocomplete/typeahead systems recently, especially in contexts where latency really matters (e.g. search-as-you-t...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

General Motors is adding Gemini to four million cars | The Verge

General Motors is planning to bring Google’s Gemini AI assistant to around four million vehicles across the US.

The Verge - AI · 4 min · about 3 hours ago

[2601.11109] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

About this article

Related Articles

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

What are people using for low-latency autocomplete in production? [P]

General Motors is adding Gemini to four million cars | The Verge

No comments

Stay updated with AI News