[2602.17054] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Summary
The paper introduces ALPS, a diagnostic challenge set designed to evaluate Arabic linguistic and pragmatic reasoning, highlighting the limitations of existing benchmarks and the performance of various models.
Why It Matters
ALPS addresses a critical gap in Arabic NLP by providing a dataset that emphasizes linguistic depth over scale. This is essential for improving AI models' understanding of Arabic, which is often overlooked in favor of broader benchmarks that may not capture cultural nuances.
Key Takeaways
- ALPS consists of 531 questions across 15 tasks, focusing on deep semantics and pragmatics.
- Existing benchmarks often rely on synthetic data, which can lead to inaccuracies in linguistic understanding.
- The study reveals a significant performance gap between commercial models and Arabic-native models.
- High fluency in models does not equate to understanding morpho-syntactic dependencies.
- The best Arabic-specific model approaches human performance but does not fully match it.
Computer Science > Computation and Language arXiv:2602.17054 (cs) [Submitted on 19 Feb 2026] Title:ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning Authors:Hussein S. Al-Olimat, Ahmad Alshareef View a PDF of the paper titled ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning, by Hussein S. Al-Olimat and Ahmad Alshareef View PDF HTML (experimental) Abstract:While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritic...