[2508.09428] What-Meets-Where: Unified Learning of Action and Contact Localization in Images
About this article
Abstract page for arXiv paper 2508.09428: What-Meets-Where: Unified Learning of Action and Contact Localization in Images
Computer Science > Computer Vision and Pattern Recognition arXiv:2508.09428 (cs) [Submitted on 13 Aug 2025 (v1), last revised 28 Mar 2026 (this version, v2)] Title:What-Meets-Where: Unified Learning of Action and Contact Localization in Images Authors:Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu View a PDF of the paper titled What-Meets-Where: Unified Learning of Action and Contact Localization in Images, by Yuxiao Wang and 6 other authors View PDF HTML (experimental) Abstract:People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interact...