[2512.24796] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

[2512.24796] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

arXiv - Machine Learning 4 min read Article

Summary

LeanCat introduces a benchmark suite for formal category theory in Lean, highlighting the challenges in reasoning with high-level abstractions and proposing LeanBridge as a solution to improve performance.

Why It Matters

This research addresses a significant gap in the evaluation of large language models' capabilities in formal theorem proving, particularly in category theory. By establishing LeanCat, it provides a structured way to measure and enhance the performance of AI in abstract reasoning tasks, which is crucial for advancements in both mathematics and software engineering.

Key Takeaways

  • LeanCat presents 100 formalized tasks in category theory to benchmark AI reasoning.
  • Current AI models show a significant performance drop in high-difficulty tasks, indicating a need for better compositional generalization.
  • LeanBridge, a retrieval-augmented agent, improves performance by utilizing dynamic library retrieval and iterative refinement.

Computer Science > Logic in Computer Science arXiv:2512.24796 (cs) [Submitted on 31 Dec 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories) Authors:Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, Zhi-Hao Zhang View a PDF of the paper titled LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories), by Rongge Xu and 7 other authors View PDF HTML (experimental) Abstract:While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of ...

Related Articles

Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Llms

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

hey everyone. been lurking here for a while and wanted to share something we been building. the problem: ai coding agents are only as goo...

Reddit - Artificial Intelligence · 1 min ·
Llms

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Last night I was testing Maestro University, the first fully AI-taught university. I walked into their enrollment chatbot and asked it to...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime