[2602.21741] Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
Summary
This article presents an end-to-end system for Bangla long-form speech recognition and speaker diarization, detailing significant challenges and innovative solutions in the field.
Why It Matters
The research addresses the complexities of processing Bangla speech, a language with unique phonetic and dialectal characteristics. By improving automatic speech recognition (ASR) and speaker diarization for low-resource languages, this work contributes to the advancement of inclusive AI technologies and enhances accessibility for Bengali speakers.
Key Takeaways
- Achieved a Word Error Rate (WER) of 0.36137 for Bangla ASR.
- Implemented effective vocal source separation and silence-aware chunking.
- Fine-tuning domain-specific models significantly improved performance.
- Addressed challenges of dialectal variation and code-mixing in Bangla.
- Demonstrated the importance of large-scale labeled corpora for ASR tasks.
Computer Science > Computation and Language arXiv:2602.21741 (cs) [Submitted on 25 Feb 2026] Title:Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization Authors:MD. Sagor Chowdhury, Adiba Fairooz Chowdhury View a PDF of the paper titled Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization, by MD. Sagor Chowdhury and 1 other authors View PDF HTML (experimental) Abstract:We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the this http URL pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the s...