[2502.05435] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Summary
This paper presents the Unbiased Sliced Wasserstein RBF kernel, a novel approach for enhancing audio captioning systems by addressing exposure bias and improving temporal alignment between audio and text.
Why It Matters
The development of effective audio captioning systems is crucial for applications in accessibility and content generation. This research addresses key limitations in existing methods, offering a solution that enhances the quality and accuracy of audio descriptions, which can significantly impact user experience and accessibility in technology.
Key Takeaways
- Introduces the USW-RBF kernel to mitigate exposure bias in audio captioning.
- Enhances temporal alignment between acoustic and linguistic modalities.
- Demonstrates improved caption quality and lexical diversity through extensive experiments.
- Shows generalizability of the kernel in audio reasoning tasks.
- Improves reasoning accuracy in benchmarks by 4%.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2502.05435 (eess) [Submitted on 8 Feb 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning Authors:Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu View a PDF of the paper titled Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning, by Manh Luong and 4 other authors View PDF HTML (experimental) Abstract:Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and ...