[2602.13640] Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation
Summary
This article presents a novel hierarchical framework for integrating audio, visual, and proprioceptive data to enhance robotic manipulation, addressing limitations of existing methods in real-world scenarios.
Why It Matters
The research highlights the importance of acoustic cues in robotic manipulation, which are often overlooked. By proposing a hierarchical fusion approach, the study aims to improve task performance in environments where visual data may be insufficient, thus advancing the field of robotics and multimodal learning.
Key Takeaways
- Introduces a hierarchical representation fusion framework for robotic manipulation.
- Demonstrates the effectiveness of integrating audio cues with visual and proprioceptive data.
- Shows superior performance in tasks like liquid pouring and cabinet opening compared to existing methods.
- Highlights the role of acoustic information in enhancing multimodal interactions.
- Provides a mutual information analysis to interpret the impact of audio cues.
Computer Science > Robotics arXiv:2602.13640 (cs) [Submitted on 14 Feb 2026] Title:Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation Authors:Siyuan Li, Jiani Lu, Yu Song, Xianren Li, Bo An, Peng Liu View a PDF of the paper titled Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation, by Siyuan Li and 5 other authors View PDF HTML (experimental) Abstract:Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-base...