[2603.18856] Motion-o: Trajectory-Grounded Video Reasoning
About this article
Abstract page for arXiv paper 2603.18856: Motion-o: Trajectory-Grounded Video Reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.18856 (cs) [Submitted on 19 Mar 2026 (v1), last revised 7 May 2026 (this version, v2)] Title:Motion-o: Trajectory-Grounded Video Reasoning Authors:Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas View a PDF of the paper titled Motion-o: Trajectory-Grounded Video Reasoning, by Bishoy Galoaa and 3 other authors View PDF HTML (experimental) Abstract:Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complemen...