[2510.22747] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

[2510.22747] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

arXiv - AI 4 min read Article

Summary

This article explores the adaptation of large language models (LLMs) for low-resource dialects, focusing on the Québec French dialect using continual pre-training techniques.

Why It Matters

The study highlights the potential for improving accessibility and performance of language models in minority dialects, addressing the disparity in language resources. This research is crucial for linguistic diversity and the inclusion of underrepresented communities in AI advancements.

Key Takeaways

  • Continual pre-training (CPT) can effectively adapt LLMs to low-resource dialects.
  • The study demonstrates improvements in dialect benchmarks with minimal impact on high-resource language performance.
  • Parameter-efficient fine-tuning (PEFT) allows for sustainable language resource creation.
  • The research provides the first Québec French LLMs available on HuggingFace.
  • Corpus composition significantly influences the effectiveness of dialect adaptation.

Computer Science > Computation and Language arXiv:2510.22747 (cs) [Submitted on 26 Oct 2025 (v1), last revised 12 Feb 2026 (this version, v2)] Title:Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study Authors:Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim View a PDF of the paper titled Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study, by Eeham Khan and 4 other authors View PDF HTML (experimental) Abstract:Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dia...

Related Articles

Llms

Attention Is All You Need, But All You Can't Afford | Hybrid Attention

Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a f...

Reddit - Artificial Intelligence · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute
Llms

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

AI Tools & Products · 3 min ·
How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'
Llms

How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'

AI Tools & Products · 9 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime