[2510.16714] SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
About this article
Abstract page for arXiv paper 2510.16714: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.16714 (cs) [Submitted on 19 Oct 2025 (v1), last revised 5 Mar 2026 (this version, v3)] Title:SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes Authors:Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang View a PDF of the paper titled SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes, by Xiongkun Linghu and 4 other authors View PDF HTML (experimental) Abstract:Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potenti...