[2602.15373] Far Out: Evaluating Language Models on Slang in Australian and Indian English
Summary
This paper evaluates the performance of language models on slang in Australian and Indian English, revealing significant gaps in understanding non-standard language varieties.
Why It Matters
Understanding how language models handle slang is crucial for improving their effectiveness in diverse linguistic contexts. This research highlights the need for better model training on variety-specific language, which is essential for applications in natural language processing and AI development.
Key Takeaways
- Language models show performance gaps in understanding slang.
- Australian English slang is less accurately processed than Indian English slang.
- Models perform better on real-world data compared to synthetically generated examples.
- Target word selection tasks yield higher accuracy than prediction tasks.
- The study underscores the importance of training models on diverse language varieties.
Computer Science > Computation and Language arXiv:2602.15373 (cs) [Submitted on 17 Feb 2026] Title:Far Out: Evaluating Language Models on Slang in Australian and Indian English Authors:Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi View a PDF of the paper titled Far Out: Evaluating Language Models on Slang in Australian and Indian English, by Deniz Kaya Dilsiz and 2 other authors View PDF HTML (experimental) Abstract:Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP...