[2604.01754] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

[2604.01754] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2604.01754: LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Computer Science > Computation and Language arXiv:2604.01754 (cs) [Submitted on 2 Apr 2026] Title:LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches Authors:Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani View a PDF of the paper titled LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches, by Linyang He and 7 other authors View PDF HTML (experimental) Abstract:Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level...

Originally published on April 03, 2026. Curated by AI News.

Related Articles

Llms

Earnestly using Claude to create a shared drive hierarchy and manual maintenance plan = LOL

On a less serious (but perhaps profound?) note: Some guys I know recently decided to use AI for the first time in their lives, while sett...

Reddit - Artificial Intelligence · 1 min ·
OpenAI now lets teams make custom bots that can do work on their own | The Verge
Llms

OpenAI now lets teams make custom bots that can do work on their own | The Verge

OpenAI is bringing “workspace” AI agents to users of its Business, Enterprise, Edu, and Teachers plans that can perform business tasks in...

The Verge - AI · 4 min ·
Llms

My Unsupervised Compliance Layer Project

A bit of context, my work has been mostly around building agentic pipelines. I really love the craft. My latest side project was a delibe...

Reddit - Artificial Intelligence · 1 min ·
Llms

I’m 17 and built an AI that flirts, remembers you, watches your shows, and replies to your reels…

V3 is done and it’s getting… weird. This thing now: auto-replies to DMs with tone adjustment reads images, transcribes voice notes, repli...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime