[2601.01547] Vision-language models lag human performance on physical dynamics and intent reasoning
About this article
Abstract page for arXiv paper 2601.01547: Vision-language models lag human performance on physical dynamics and intent reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.01547 (cs) [Submitted on 4 Jan 2026 (v1), last revised 23 Mar 2026 (this version, v2)] Title:Vision-language models lag human performance on physical dynamics and intent reasoning Authors:Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan, Athanasios V View a PDF of the paper titled Vision-language models lag human performance on physical dynamics and intent reasoning, by Tianjun Gu and 6 other authors View PDF HTML (experimental) Abstract:Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We introduce Teleo-Spatial Intelligence (TSI), a reasoning capability that links spatiotemporal change to goal-directed structure. To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set. Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest proprietary model achieves 57.26\% overall accuracy, far below first-pa...