[2602.14612] LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio
Summary
The paper presents LongAudio-RAG, a framework for event-grounded question answering over lengthy audio recordings, enhancing accuracy through structured event retrieval.
Why It Matters
As audio data becomes increasingly prevalent, effective tools for processing and querying long audio streams are essential. LongAudio-RAG addresses the challenges of context-length limits in existing models, offering a practical solution that combines edge and cloud computing for improved performance.
Key Takeaways
- LongAudio-RAG improves question answering accuracy by grounding LLM outputs in timestamped event detections.
- The framework converts multi-hour audio into structured event records, facilitating efficient query resolution.
- It demonstrates enhanced performance over traditional RAG and text-to-SQL methods.
- The hybrid edge-cloud architecture allows for low-latency processing and high-quality reasoning.
- A synthetic benchmark was created to evaluate the system's effectiveness in various tasks.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.14612 (eess) [Submitted on 16 Feb 2026] Title:LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio Authors:Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser View a PDF of the paper titled LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio, by Naveen Vakada and 4 other authors View PDF HTML (experimental) Abstract:Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, cou...