[2603.21478] TaigiSpeech: A Low-Resource Real-World Speech Intent

[2603.21478] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

arXiv - Machine Learning March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.21478: TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Computer Science > Computation and Language arXiv:2603.21478 (cs) [Submitted on 23 Mar 2026] Title:TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild Authors:Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee View a PDF of the paper titled TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild, by Kai-Wei Chang and 11 other authors View PDF HTML (experimental) Abstract:Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design e...

Originally published on March 24, 2026. Curated by AI News.

Ai Startups

Top 10 AI certifications and courses for 2026

This article reviews the top 10 AI certifications and courses for 2026, highlighting their significance in a rapidly evolving field and t...

AI Events · 15 min · about 3 hours ago

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · about 4 hours ago

Data Science

Why would a veteran factory operator help you build the AI that might replace them?

Just read the article about how veteran factory operators have knowledge that can't be captured in any dataset. they can hear a machine f...

Reddit - Artificial Intelligence · 1 min · about 11 hours ago

Data Science

[D] Data Science at Auxia

Can someone tell me about their experience at Auxia during the interviews or working there? Seems like a new company but team looks prett...

Reddit - Machine Learning · 1 min · about 17 hours ago

[2603.21478] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

About this article

Related Articles

Top 10 AI certifications and courses for 2026

UMKC Announces New Master of Science in Artificial Intelligence

Why would a veteran factory operator help you build the AI that might replace them?

[D] Data Science at Auxia

No comments

Stay updated with AI News