[2502.05435] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

[2502.05435] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

arXiv - Machine Learning 4 min read Article

Summary

This paper presents the Unbiased Sliced Wasserstein RBF kernel, a novel approach for enhancing audio captioning systems by addressing exposure bias and improving temporal alignment between audio and text.

Why It Matters

The development of effective audio captioning systems is crucial for applications in accessibility and content generation. This research addresses key limitations in existing methods, offering a solution that enhances the quality and accuracy of audio descriptions, which can significantly impact user experience and accessibility in technology.

Key Takeaways

  • Introduces the USW-RBF kernel to mitigate exposure bias in audio captioning.
  • Enhances temporal alignment between acoustic and linguistic modalities.
  • Demonstrates improved caption quality and lexical diversity through extensive experiments.
  • Shows generalizability of the kernel in audio reasoning tasks.
  • Improves reasoning accuracy in benchmarks by 4%.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2502.05435 (eess) [Submitted on 8 Feb 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning Authors:Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu View a PDF of the paper titled Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning, by Manh Luong and 4 other authors View PDF HTML (experimental) Abstract:Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and ...

Related Articles

Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
Machine Learning

[Research] AI training is bad, so I started an research

Hello, I started researching about AI training Q:Why? R: Because AI training is bad right now. Q: What do you mean its bad? R: Like when ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

We built an open-source prototype that applies Unix philosophy to retrieval pipelines. Each stage (PII redaction, chunking, dedup, embedd...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime