[2509.14659] Aligning Audio Captions with Human Preferences
Summary
The paper presents a novel framework for audio captioning that aligns captions with human preferences using Reinforcement Learning from Human Feedback (RLHF), demonstrating improved performance over traditional methods.
Why It Matters
This research addresses the limitations of current audio captioning systems that rely on costly supervised learning. By aligning captions with human preferences, the proposed framework enhances the quality and relevance of audio descriptions, making it significant for applications in accessibility and content creation.
Key Takeaways
- Introduces a preference-aligned audio captioning framework using RLHF.
- Utilizes a Contrastive Language-Audio Pretraining (CLAP) model for reward assessment.
- Demonstrates improved caption quality through extensive human evaluations.
- Achieves performance comparable to supervised methods without requiring ground-truth data.
- Offers scalability for real-world applications in audio processing.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.14659 (eess) [Submitted on 18 Sep 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Aligning Audio Captions with Human Preferences Authors:Kartik Hegde, Rehana Mahfuz, Yinyi Guo, Erik Visser View a PDF of the paper titled Aligning Audio Captions with Human Preferences, by Kartik Hegde and 3 other authors View PDF HTML (experimental) Abstract:Current audio captioning relies on supervised learning with paired audio-caption data, which is costly to curate and may not reflect human preferences in real-world scenarios. To address this, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To capture nuanced preferences, we train a Contrastive Language-Audio Pretraining (CLAP) based reward model using human-labeled pairwise preference data. This reward model is integrated into an RL framework to fine-tune any baseline captioning system without ground-truth annotations. Extensive human evaluations across multiple datasets show that our method produces captions preferred over baseline models, particularly when baselines fail to provide correct and natural captions. Furthermore, our framework achieves performance comparable to supervised approaches with ground-truth data, demonstrating effective alignment with human preferences and scalability in real-world use. Comments: Subjects: Audio and Speech Processing (eess.AS); Mac...