[2602.15918] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
Summary
The paper presents EarthSpatialBench, a benchmark designed to evaluate spatial reasoning capabilities of multimodal large language models (MLLMs) using Earth imagery, addressing gaps in existing benchmarks.
Why It Matters
As spatial reasoning is critical for applications in embodied AI and agentic systems, this benchmark fills a significant gap by enabling more accurate assessments of MLLMs' abilities to interact with georeferenced images, which is essential for various real-world applications.
Key Takeaways
- EarthSpatialBench includes over 325K question-answer pairs for comprehensive spatial reasoning evaluation.
- The benchmark supports qualitative and quantitative reasoning about spatial relationships.
- It addresses limitations of existing benchmarks by including complex object geometries and systematic topological relations.
- Extensive experiments highlight the current limitations of MLLMs in spatial reasoning tasks.
- The benchmark is crucial for advancing embodied AI and improving interaction with physical environments.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15918 (cs) [Submitted on 17 Feb 2026] Title:EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery Authors:Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, Zhe Jiang View a PDF of the paper titled EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery, by Zelin Xu and 8 other authors View PDF HTML (experimental) Abstract:Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensi...