Ai Safety Data Science Machine Learning Nlp

[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

arXiv - AI February 27, 2026 4 min read Article

Summary

This paper presents a novel approach to long-form Bengali Automatic Speech Recognition (ASR) and speaker diarization, introducing a comprehensive dataset and innovative techniques for improved accuracy.

Why It Matters

As Bengali ASR and speaker diarization are under-researched, this study addresses critical gaps by providing a large dataset and demonstrating effective methods for processing long-duration audio. This work could significantly enhance speech technology for Bengali speakers and contribute to advancements in low-resource language processing.

Key Takeaways

Introduces Lipi-Ghor-882, an 882-hour dataset for Bengali ASR.
Highlights the ineffectiveness of raw data scaling for ASR improvement.
Demonstrates that targeted fine-tuning with aligned annotations is crucial.
Finds that heuristic post-processing is more effective than model retraining for diarization.
Establishes a benchmark for low-resource, long-form speech processing with a Real-Time Factor of ~0.019.

Computer Science > Sound arXiv:2602.23070 (cs) [Submitted on 26 Feb 2026] Title:Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment Authors:Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan View a PDF of the paper titled Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment, by Sanjid Hasan and 3 other authors View PDF HTML (experimental) Abstract:Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded n...

Read Original Article

[2602.23070] Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Summary

Why It Matters

Key Takeaways

Related Articles

When Agentic AI Browsers Outrun Governance

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

The state of AI safety in four fake graphs

[2603.14267] DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

No comments

Stay updated with AI News