[2602.18776] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Summary
The paper introduces ArabicNumBench, a benchmark for evaluating Arabic number reading capabilities in large language models, revealing significant performance variations across models and prompting strategies.
Why It Matters
As Arabic NLP continues to grow, understanding how well language models perform on specific tasks like number reading is crucial. This benchmark provides insights into model capabilities, guiding developers in selecting appropriate models for applications in Arabic-speaking contexts.
Key Takeaways
- ArabicNumBench evaluates 71 models on Arabic number reading tasks.
- Performance varies widely, with accuracy from 14.29% to 99.05%.
- Few-shot Chain-of-Thought prompting significantly improves accuracy.
- High accuracy often correlates with unstructured output generation.
- Only 6 models consistently produce structured outputs across all tasks.
Computer Science > Computation and Language arXiv:2602.18776 (cs) [Submitted on 21 Feb 2026] Title:ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models Authors:Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan View a PDF of the paper titled ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models, by Anas Alhumud and Abdulaziz Alhammadi and Muhammad Badruddin Khan View PDF HTML (experimental) Abstract:We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured ou...