[2601.16333] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
About this article
Abstract page for arXiv paper 2601.16333: Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.16333 (cs) [Submitted on 22 Jan 2026 (v1), last revised 5 Mar 2026 (this version, v2)] Title:Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments Authors:Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle View a PDF of the paper titled Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments, by Aditya K Surikuchi and 2 other authors View PDF HTML (experimental) Abstract:Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing n...