[2605.07575] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
About this article
Abstract page for arXiv paper 2605.07575: Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.07575 (cs) [Submitted on 8 May 2026] Title:Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding Authors:Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu View a PDF of the paper titled Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding, by Ke Ma and 10 other authors View PDF HTML (experimental) Abstract:Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" this http URL grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive a...