Machine Learning Nlp Ai Agents

[2602.21741] Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

arXiv - Machine Learning February 26, 2026 3 min read Article

Summary

This article presents an end-to-end system for Bangla long-form speech recognition and speaker diarization, detailing significant challenges and innovative solutions in the field.

Why It Matters

The research addresses the complexities of processing Bangla speech, a language with unique phonetic and dialectal characteristics. By improving automatic speech recognition (ASR) and speaker diarization for low-resource languages, this work contributes to the advancement of inclusive AI technologies and enhances accessibility for Bengali speakers.

Key Takeaways

Achieved a Word Error Rate (WER) of 0.36137 for Bangla ASR.
Implemented effective vocal source separation and silence-aware chunking.
Fine-tuning domain-specific models significantly improved performance.
Addressed challenges of dialectal variation and code-mixing in Bangla.
Demonstrated the importance of large-scale labeled corpora for ASR tasks.

Computer Science > Computation and Language arXiv:2602.21741 (cs) [Submitted on 25 Feb 2026] Title:Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization Authors:MD. Sagor Chowdhury, Adiba Fairooz Chowdhury View a PDF of the paper titled Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization, by MD. Sagor Chowdhury and 1 other authors View PDF HTML (experimental) Abstract:We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the this http URL pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the s...

Read Original Article

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min · 41 minutes ago

Machine Learning

Yupp shuts down after raising $33M from a16z crypto's Chris Dixon | TechCrunch

Less than a year after launching, with checks from some of the biggest names in Silicon Valley, crowdsourced AI model feedback startup Yu...

TechCrunch - AI · 4 min · about 4 hours ago

Machine Learning

[R] Fine-tuning services report

If you have some data and want to train or run a small custom model but don't have powerful enough hardware for training, fine-tuning ser...

Reddit - Machine Learning · 1 min · about 6 hours ago

Machine Learning

[D] Does ML have a "bible"/reference textbook at the Intermediate/Advanced level?

Hello, everyone! This is my first time posting here and I apologise if the question is, perhaps, a bit too basic for this sub-reddit. A b...

Reddit - Machine Learning · 1 min · about 7 hours ago

[2602.21741] Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

Summary

Why It Matters

Key Takeaways

Related Articles

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Yupp shuts down after raising $33M from a16z crypto's Chris Dixon | TechCrunch

[R] Fine-tuning services report

[D] Does ML have a "bible"/reference textbook at the Intermediate/Advanced level?

No comments

Stay updated with AI News