[2509.14659] Aligning Audio Captions with Human Preferences

[2509.14659] Aligning Audio Captions with Human Preferences

arXiv - Machine Learning 3 min read Article

Summary

The paper presents a novel framework for audio captioning that aligns captions with human preferences using Reinforcement Learning from Human Feedback (RLHF), demonstrating improved performance over traditional methods.

Why It Matters

This research addresses the limitations of current audio captioning systems that rely on costly supervised learning. By aligning captions with human preferences, the proposed framework enhances the quality and relevance of audio descriptions, making it significant for applications in accessibility and content creation.

Key Takeaways

  • Introduces a preference-aligned audio captioning framework using RLHF.
  • Utilizes a Contrastive Language-Audio Pretraining (CLAP) model for reward assessment.
  • Demonstrates improved caption quality through extensive human evaluations.
  • Achieves performance comparable to supervised methods without requiring ground-truth data.
  • Offers scalability for real-world applications in audio processing.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2509.14659 (eess) [Submitted on 18 Sep 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Aligning Audio Captions with Human Preferences Authors:Kartik Hegde, Rehana Mahfuz, Yinyi Guo, Erik Visser View a PDF of the paper titled Aligning Audio Captions with Human Preferences, by Kartik Hegde and 3 other authors View PDF HTML (experimental) Abstract:Current audio captioning relies on supervised learning with paired audio-caption data, which is costly to curate and may not reflect human preferences in real-world scenarios. To address this, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To capture nuanced preferences, we train a Contrastive Language-Audio Pretraining (CLAP) based reward model using human-labeled pairwise preference data. This reward model is integrated into an RL framework to fine-tune any baseline captioning system without ground-truth annotations. Extensive human evaluations across multiple datasets show that our method produces captions preferred over baseline models, particularly when baselines fail to provide correct and natural captions. Furthermore, our framework achieves performance comparable to supervised approaches with ground-truth data, demonstrating effective alignment with human preferences and scalability in real-world use. Comments: Subjects: Audio and Speech Processing (eess.AS); Mac...

Related Articles

Machine Learning

[R] Fine-tuning services report

If you have some data and want to train or run a small custom model but don't have powerful enough hardware for training, fine-tuning ser...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] Does ML have a "bible"/reference textbook at the Intermediate/Advanced level?

Hello, everyone! This is my first time posting here and I apologise if the question is, perhaps, a bit too basic for this sub-reddit. A b...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML 2026 review policy debate: 100 responses suggest Policy B may score higher, while Policy A shows higher confidence

A week ago I made a thread asking whether ICML 2026’s review policy might have affected review outcomes, especially whether Policy A pape...

Reddit - Machine Learning · 1 min ·
Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch
Machine Learning

Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch

The company turns footage from robots into structured, searchable datasets with a deep learning model.

TechCrunch - AI · 6 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime