[2602.23388] Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
About this article
Abstract page for arXiv paper 2602.23388: Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
Computer Science > Computation and Language arXiv:2602.23388 (cs) [Submitted on 16 Feb 2026] Title:Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages Authors:Swati Sharma, Divya V. Sharma, Anubha Gupta View a PDF of the paper titled Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages, by Swati Sharma and 2 other authors View PDF HTML (experimental) Abstract:The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. F...