[2602.18776] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

[2602.18776] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv - AI 4 min read Article

Summary

The paper introduces ArabicNumBench, a benchmark for evaluating Arabic number reading capabilities in large language models, revealing significant performance variations across models and prompting strategies.

Why It Matters

As Arabic NLP continues to grow, understanding how well language models perform on specific tasks like number reading is crucial. This benchmark provides insights into model capabilities, guiding developers in selecting appropriate models for applications in Arabic-speaking contexts.

Key Takeaways

  • ArabicNumBench evaluates 71 models on Arabic number reading tasks.
  • Performance varies widely, with accuracy from 14.29% to 99.05%.
  • Few-shot Chain-of-Thought prompting significantly improves accuracy.
  • High accuracy often correlates with unstructured output generation.
  • Only 6 models consistently produce structured outputs across all tasks.

Computer Science > Computation and Language arXiv:2602.18776 (cs) [Submitted on 21 Feb 2026] Title:ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models Authors:Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan View a PDF of the paper titled ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models, by Anas Alhumud and Abdulaziz Alhammadi and Muhammad Badruddin Khan View PDF HTML (experimental) Abstract:We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured ou...

Related Articles

Llms

I stopped using Claude like a chatbot — 7 prompt shifts that reclaimed 10 hours of my week

submitted by /u/ThereWas [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

What features do you actually want in an AI chatbot that nobody has built yet?

Hey everyone 👋 I'm building a new AI chat app and before I build anything I want to hear from real users first. Current AI tools like Cha...

Reddit - Artificial Intelligence · 1 min ·
Llms

So, what exactly is going on with the Claude usage limits?

I'm extremely new to AI and am building a local agent for fun. I purchased a Claude Pro account because it helped me a lot in the past wh...

Reddit - Artificial Intelligence · 1 min ·
Llms

Why the Reddit Hate of AI?

I just went through a project where a builder wanted to build a really large building on a small lot next door. The project needed 6 vari...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime