[2512.24796] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)
Summary
LeanCat introduces a benchmark suite for formal category theory in Lean, highlighting the challenges in reasoning with high-level abstractions and proposing LeanBridge as a solution to improve performance.
Why It Matters
This research addresses a significant gap in the evaluation of large language models' capabilities in formal theorem proving, particularly in category theory. By establishing LeanCat, it provides a structured way to measure and enhance the performance of AI in abstract reasoning tasks, which is crucial for advancements in both mathematics and software engineering.
Key Takeaways
- LeanCat presents 100 formalized tasks in category theory to benchmark AI reasoning.
- Current AI models show a significant performance drop in high-difficulty tasks, indicating a need for better compositional generalization.
- LeanBridge, a retrieval-augmented agent, improves performance by utilizing dynamic library retrieval and iterative refinement.
Computer Science > Logic in Computer Science arXiv:2512.24796 (cs) [Submitted on 31 Dec 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories) Authors:Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, Zhi-Hao Zhang View a PDF of the paper titled LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories), by Rongge Xu and 7 other authors View PDF HTML (experimental) Abstract:While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of ...