[2601.02158] FormationEval, an open multiple-choice benchmark for petroleum geoscience

[2601.02158] FormationEval, an open multiple-choice benchmark for petroleum geoscience

arXiv - Machine Learning 4 min read Article

Summary

FormationEval introduces a benchmark for evaluating language models in petroleum geoscience, featuring 505 questions across multiple domains and performance assessments of various AI models.

Why It Matters

This benchmark addresses the need for standardized evaluation tools in petroleum geoscience, facilitating better assessment of AI models' capabilities in this specialized field. It highlights performance disparities and encourages improvements in model accuracy, particularly for open-weight alternatives.

Key Takeaways

  • FormationEval includes 505 multiple-choice questions from seven geoscience domains.
  • Top AI models achieved over 97% accuracy, with significant performance from both closed and open-weight models.
  • Petrophysics is identified as the most challenging domain for AI models.
  • The benchmark and evaluation results are publicly accessible, promoting transparency.
  • Bias mitigation strategies were implemented to address dataset length discrepancies.

Computer Science > Computation and Language arXiv:2601.02158 (cs) [Submitted on 5 Jan 2026 (v1), last revised 14 Feb 2026 (this version, v2)] Title:FormationEval, an open multiple-choice benchmark for petroleum geoscience Authors:Almaz Ermilov View a PDF of the paper titled FormationEval, an open multiple-choice benchmark for petroleum geoscience, by Almaz Ermilov View PDF HTML (experimental) Abstract:This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90% accuracy. Petrophysics emerges as the most cha...

Related Articles

ChatGPT has a new $100 per month Pro subscription | The Verge
Llms

ChatGPT has a new $100 per month Pro subscription | The Verge

OpenAI has announced a new version of its ChatGPT Pro subscription that costs $100 per month. The new Pro tier offers “5x more” usage of ...

The Verge - AI · 4 min ·
ChatGPT finally offers $100/month Pro plan | TechCrunch
Llms

ChatGPT finally offers $100/month Pro plan | TechCrunch

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min ·
Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch
Llms

Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch

ChatGPT had reportedly been used to plan the attack that killed two and injured five at Florida State University last April. The family o...

TechCrunch - AI · 4 min ·
Llms

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

On April 27 we’re open-sourcing a free diagnostic tool called iFixAi. You run it against your AI system (agent, copilot, LLM integration,...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime