[2504.02293] Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation
About this article
Abstract page for arXiv paper 2504.02293: Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation
Computer Science > Computation and Language arXiv:2504.02293 (cs) [Submitted on 3 Apr 2025 (v1), last revised 22 Mar 2026 (this version, v2)] Title:Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation Authors:Sharif Mohammad Abdullah, Abhijit Paul, Shubhashis Roy Dipta, Zarif Masud, Shebuti Rayana, Ahmedul Kabir View a PDF of the paper titled Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation, by Sharif Mohammad Abdullah and 5 other authors View PDF HTML (experimental) Abstract:Gloss is a written approximation that bridges Sign Language (SL) and its corresponding spoken language. Despite a deaf and hard-of-hearing population of at least 3 million in Bangladesh, Bangla Sign Language (BdSL) remains largely understudied, with no prior work on Bangla text-to-gloss translation and no publicly available datasets. To address this gap, we construct the first Bangla text-to-gloss dataset, consisting of 1,000 manually annotated and 4,000 synthetically generated Bangla sentence-gloss pairs, along with 159 expert human-annotated pairs used as a test set. Our experimental framework performs a comparative analysis between several fine-tuned open-source models and a leading closed-source LLM to evaluate their performance in low-resource BdSL translation. GPT-5.4 achieves the best overall performance, while a fine-tuned mBART model performs competitively despite being approximately 100% smaller. Qwen-3 outperforms all other mod...