[2412.20816] MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval
Summary
The paper presents MomentMix, a novel augmentation technique using Length-Aware DETR to enhance video moment retrieval, particularly for short moments, achieving superior performance on benchmark datasets.
Why It Matters
As video content continues to proliferate, effective moment retrieval techniques are crucial for enhancing user experience on platforms like YouTube. This research addresses the challenges of localizing short moments, which are often overlooked, thereby improving the accuracy and efficiency of video information retrieval systems.
Key Takeaways
- MomentMix employs two augmentation strategies to enhance short moment retrieval.
- The Length-Aware Decoder improves localization accuracy for short moments.
- The proposed method outperforms existing DETR-based models on key benchmarks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2412.20816 (cs) [Submitted on 30 Dec 2024 (v1), last revised 26 Feb 2026 (this version, v3)] Title:MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval Authors:Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim View a PDF of the paper titled MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval, by Seojeong Park and 3 other authors View PDF HTML (experimental) Abstract:Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix generates new short-moment samples by employing two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the ability to understand the query-relevant and irrelevant frames, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions and length of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching...