[2604.08762] InstrAct: Towards Action-Centric Understanding in Instructional Videos
About this article
Abstract page for arXiv paper 2604.08762: InstrAct: Towards Action-Centric Understanding in Instructional Videos
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.08762 (cs) [Submitted on 9 Apr 2026] Title:InstrAct: Towards Action-Centric Understanding in Instructional Videos Authors:Zhuoyi Yang, Jiapeng Yu, Reuben Tan, Boyang Li, Huijuan Xu View a PDF of the paper titled InstrAct: Towards Action-Centric Understanding in Instructional Videos, by Zhuoyi Yang and 4 other authors View PDF HTML (experimental) Abstract:Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method cons...