[2603.04976] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
About this article
Abstract page for arXiv paper 2603.04976: 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.04976 (cs) [Submitted on 5 Mar 2026] Title:3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding Authors:Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang View a PDF of the paper titled 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding, by Xiongkun Linghu and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide mor...