[2603.24866] How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
About this article
Abstract page for arXiv paper 2603.24866: How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
Computer Science > Artificial Intelligence arXiv:2603.24866 (cs) [Submitted on 25 Mar 2026] Title:How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning Authors:Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen View a PDF of the paper titled How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning, by Luyu Yang and 5 other authors View PDF HTML (experimental) Abstract:The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to co...