Machine Learning Ai Safety Ai Infrastructure Ai Agents Data Science

[2502.05435] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

arXiv - Machine Learning February 27, 2026 4 min read Article

Summary

This paper presents the Unbiased Sliced Wasserstein RBF kernel, a novel approach for enhancing audio captioning systems by addressing exposure bias and improving temporal alignment between audio and text.

Why It Matters

The development of effective audio captioning systems is crucial for applications in accessibility and content generation. This research addresses key limitations in existing methods, offering a solution that enhances the quality and accuracy of audio descriptions, which can significantly impact user experience and accessibility in technology.

Key Takeaways

Introduces the USW-RBF kernel to mitigate exposure bias in audio captioning.
Enhances temporal alignment between acoustic and linguistic modalities.
Demonstrates improved caption quality and lexical diversity through extensive experiments.
Shows generalizability of the kernel in audio reasoning tasks.
Improves reasoning accuracy in benchmarks by 4%.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2502.05435 (eess) [Submitted on 8 Feb 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning Authors:Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu View a PDF of the paper titled Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning, by Manh Luong and 4 other authors View PDF HTML (experimental) Abstract:Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and ...

Read Original Article

[2502.05435] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Summary

Why It Matters

Key Takeaways

Related Articles

World models will be the next big thing, bye-bye LLMs

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

[Research] AI training is bad, so I started an research

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

No comments

Stay updated with AI News