[2602.12937] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

[2602.12937] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

arXiv - Machine Learning 4 min read Article

Summary

This article presents a novel approach to Arabic Dialect Identification by framing it as a multi-label classification task, utilizing curriculum learning and pseudo-labeling to enhance model generalization.

Why It Matters

The research addresses a significant gap in Arabic Dialect Identification, which has traditionally relied on single-label datasets. By proposing a multi-label framework and improving dataset quality through innovative techniques, this work contributes to more accurate language processing in diverse Arabic dialects, which is crucial for applications in NLP and AI.

Key Takeaways

  • Reframing Arabic Dialect Identification as a multi-label classification task improves accuracy.
  • Curriculum learning strategies enhance model training by aligning with dialect complexity.
  • The LAHJATBERT model achieved a macro F1 score of 0.69, outperforming previous systems.

Computer Science > Computation and Language arXiv:2602.12937 (cs) [Submitted on 12 Feb 2026] Title:Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models Authors:Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov View a PDF of the paper titled Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models, by Ali Mekky and 4 other authors View PDF Abstract:Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label car...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
AI Hiring Growth: AI and ML Hiring Surges 37% in Marche
Machine Learning

AI Hiring Growth: AI and ML Hiring Surges 37% in Marche

AI News - General · 1 min ·
As Meta Flounders, It Reportedly Plans to Open Source Its New AI Models
Machine Learning

As Meta Flounders, It Reportedly Plans to Open Source Its New AI Models

AI Tools & Products · 5 min ·
Google quietly launched an AI dictation app that works offline
Machine Learning

Google quietly launched an AI dictation app that works offline

Google's new offline-first dictation app uses Gemma AI models to take on the apps like Wispr Flow.

TechCrunch - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime