[2601.02158] FormationEval, an open multiple-choice benchmark for petroleum geoscience
Summary
FormationEval introduces a benchmark for evaluating language models in petroleum geoscience, featuring 505 questions across multiple domains and performance assessments of various AI models.
Why It Matters
This benchmark addresses the need for standardized evaluation tools in petroleum geoscience, facilitating better assessment of AI models' capabilities in this specialized field. It highlights performance disparities and encourages improvements in model accuracy, particularly for open-weight alternatives.
Key Takeaways
- FormationEval includes 505 multiple-choice questions from seven geoscience domains.
- Top AI models achieved over 97% accuracy, with significant performance from both closed and open-weight models.
- Petrophysics is identified as the most challenging domain for AI models.
- The benchmark and evaluation results are publicly accessible, promoting transparency.
- Bias mitigation strategies were implemented to address dataset length discrepancies.
Computer Science > Computation and Language arXiv:2601.02158 (cs) [Submitted on 5 Jan 2026 (v1), last revised 14 Feb 2026 (this version, v2)] Title:FormationEval, an open multiple-choice benchmark for petroleum geoscience Authors:Almaz Ermilov View a PDF of the paper titled FormationEval, an open multiple-choice benchmark for petroleum geoscience, by Almaz Ermilov View PDF HTML (experimental) Abstract:This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90% accuracy. Petrophysics emerges as the most cha...