[2509.24385] Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
About this article
Abstract page for arXiv paper 2509.24385: Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.24385 (cs) [Submitted on 29 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v4)] Title:Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy Authors:Haijier Chen, Bo Xu, Shoujian Zhang, Haoze Liu, Jiaxuan Lin, Jingrong Wang View a PDF of the paper titled Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy, by Haijier Chen and 5 other authors View PDF HTML (experimental) Abstract:Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision-Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction...