[2602.16467] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Summary
IndicEval introduces a bilingual evaluation framework for large language models, assessing their performance on real examination questions in English and Hindi, highlighting gaps in bilingual reasoning.
Why It Matters
As large language models become integral in educational settings, a robust evaluation framework like IndicEval is crucial for ensuring these models perform effectively across languages and academic standards. This framework addresses the need for realistic assessments that reflect actual examination conditions, thereby enhancing the reliability of AI in multilingual contexts.
Key Takeaways
- IndicEval benchmarks LLMs against real exam questions from UPSC, JEE, and NEET.
- Chain-of-Thought prompting significantly enhances reasoning accuracy across subjects.
- Multilingual performance shows notable degradation, particularly in Hindi.
- The framework supports modular integration of new models and languages.
- Cross-model performance disparities highlight the need for improved bilingual reasoning.
Computer Science > Computation and Language arXiv:2602.16467 (cs) [Submitted on 18 Feb 2026] Title:IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models Authors:Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas View a PDF of the paper titled IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models, by Saurabh Bharti and 2 other authors View PDF HTML (experimental) Abstract:The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly...