Llms Machine Learning Nlp

[2602.18776] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv - AI February 24, 2026 4 min read Article

Summary

The paper introduces ArabicNumBench, a benchmark for evaluating Arabic number reading capabilities in large language models, revealing significant performance variations across models and prompting strategies.

Why It Matters

As Arabic NLP continues to grow, understanding how well language models perform on specific tasks like number reading is crucial. This benchmark provides insights into model capabilities, guiding developers in selecting appropriate models for applications in Arabic-speaking contexts.

Key Takeaways

ArabicNumBench evaluates 71 models on Arabic number reading tasks.
Performance varies widely, with accuracy from 14.29% to 99.05%.
Few-shot Chain-of-Thought prompting significantly improves accuracy.
High accuracy often correlates with unstructured output generation.
Only 6 models consistently produce structured outputs across all tasks.

Computer Science > Computation and Language arXiv:2602.18776 (cs) [Submitted on 21 Feb 2026] Title:ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models Authors:Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan View a PDF of the paper titled ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models, by Anas Alhumud and Abdulaziz Alhammadi and Muhammad Badruddin Khan View PDF HTML (experimental) Abstract:We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured ou...

Read Original Article

[2602.18776] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Summary

Why It Matters

Key Takeaways

Related Articles

I stopped using Claude like a chatbot — 7 prompt shifts that reclaimed 10 hours of my week

What features do you actually want in an AI chatbot that nobody has built yet?

So, what exactly is going on with the Claude usage limits?

Why the Reddit Hate of AI?

No comments

Stay updated with AI News