[2602.21226] IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions
Summary
The paper introduces IslamicLegalBench, a benchmark for evaluating LLMs' reasoning on Islamic law, revealing significant limitations in current AI models' performance.
Why It Matters
As AI systems increasingly provide religious guidance, understanding their limitations in reasoning about complex legal traditions is crucial. This study highlights the inadequacies of current models, emphasizing the need for improved AI frameworks in sensitive domains like Islamic jurisprudence.
Key Takeaways
- IslamicLegalBench evaluates LLMs across seven schools of Islamic jurisprudence.
- The best-performing model achieved only 68% correctness, with significant hallucination rates.
- Few-shot prompting showed minimal improvements in model performance.
- Moderate-complexity tasks revealed the highest error rates, indicating gaps in foundational knowledge.
- The study underscores the risks of relying on AI for spiritual guidance without robust evaluation frameworks.
Computer Science > Computation and Language arXiv:2602.21226 (cs) [Submitted on 2 Feb 2026] Title:IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions Authors:Ezieddin Elmahjub, Junaid Qadir, Abdullah Mushtaq, Rafay Naeem, Ibrahim Ghaznavi, Waleed Iqbal View a PDF of the paper titled IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions, by Ezieddin Elmahjub and 5 other authors View PDF HTML (experimental) Abstract:As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates ...