[2603.19252] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark

[2603.19252] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

arXiv - AI March 23, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.19252: GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Computer Science > Computation and Language arXiv:2603.19252 (cs) [Submitted on 25 Feb 2026] Title:GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams Authors:Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu View a PDF of the paper titled GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams, by Yushun Zhang and 7 other authors View PDF HTML (experimental) Abstract:Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and...

Originally published on March 23, 2026. Curated by AI News.

Llms

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

BraiNN An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning BraiNN is a compact research‑...

Reddit - Machine Learning · 1 min · 26 minutes ago

Llms

We hit 150 stars on our AI setup tool!

yo folks, we just hit 150 stars on our open source tool that auto makes AI context files. got 90 PRs merged and 20 issues that ppl are pi...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Is ai getting dummer?

Over the past month, it feels like GPT and Gemini have been giving wrong answers a lot. Do you feel the same, or am I exaggerating? submi...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

If AI is really making us more productive... why does it feel like we are working more, not less...?

The promise of AI was the ultimate system optimisation: Efficiency. On paper, the tools are delivering something similar to what they pro...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

[2603.19252] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

About this article

Related Articles

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

We hit 150 stars on our AI setup tool!

Is ai getting dummer?

If AI is really making us more productive... why does it feel like we are working more, not less...?

No comments

Stay updated with AI News