[2602.20065] Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Summary
This article examines the performance of multilingual large language models (LLMs) across various languages, revealing that comprehension varies significantly and is not uniformly better in English.
Why It Matters
Understanding the limitations of LLMs in comprehending different languages is crucial for improving AI accessibility and performance globally. This research highlights the need for more inclusive benchmarks that account for low-resource languages, ultimately guiding future developments in AI and NLP.
Key Takeaways
- LLMs do not perform equally across all languages, with notable discrepancies in comprehension.
- English is not the best-performing language for LLMs; several Romance languages outperform it.
- The study emphasizes the importance of diverse training data and benchmarks that include low-resource languages.
Computer Science > Computation and Language arXiv:2602.20065 (cs) [Submitted on 23 Feb 2026] Title:Multilingual Large Language Models do not comprehend all natural languages to equal degrees Authors:Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi, Walid Irhaymi, Jin Yan, Tamara Serrano, Elena Pagliarini, Fritz Günther, Evelina Leivada View a PDF of the paper titled Multilingual Large Language Models do not comprehend all natural languages to equal degrees, by Natalia Moskvina and 10 other authors View PDF Abstract:Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind h...