[2602.05847] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
Summary
The paper introduces OmniVideo-R1, a novel framework designed to enhance audio-visual reasoning through query intention and modality attention, demonstrating superior performance on various benchmarks.
Why It Matters
As AI systems increasingly rely on multi-modal inputs, improving audio-visual reasoning is crucial for applications in robotics, video analysis, and human-computer interaction. OmniVideo-R1 addresses existing limitations in omnivideo models, potentially advancing the field significantly.
Key Takeaways
- OmniVideo-R1 enhances mixed-modality reasoning using query-intensive grounding.
- The framework employs modality-attentive fusion based on contrastive learning.
- Extensive experiments show OmniVideo-R1 outperforms strong baselines in audio-visual tasks.
- The model's design allows for better generalization across various benchmarks.
- Self-supervised learning paradigms are integral to its effectiveness.
Computer Science > Artificial Intelligence arXiv:2602.05847 (cs) [Submitted on 5 Feb 2026 (v1), last revised 15 Feb 2026 (this version, v2)] Title:OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention Authors:Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang View a PDF of the paper titled OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention, by Zhangquan Chen and 12 other authors View PDF HTML (experimental) Abstract:While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities. Comments: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) ACM cl...