Llms Machine Learning Ai Agents

[2512.24796] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

arXiv - Machine Learning February 27, 2026 4 min read Article

Summary

LeanCat introduces a benchmark suite for formal category theory in Lean, highlighting the challenges in reasoning with high-level abstractions and proposing LeanBridge as a solution to improve performance.

Why It Matters

This research addresses a significant gap in the evaluation of large language models' capabilities in formal theorem proving, particularly in category theory. By establishing LeanCat, it provides a structured way to measure and enhance the performance of AI in abstract reasoning tasks, which is crucial for advancements in both mathematics and software engineering.

Key Takeaways

LeanCat presents 100 formalized tasks in category theory to benchmark AI reasoning.
Current AI models show a significant performance drop in high-difficulty tasks, indicating a need for better compositional generalization.
LeanBridge, a retrieval-augmented agent, improves performance by utilizing dynamic library retrieval and iterative refinement.

Computer Science > Logic in Computer Science arXiv:2512.24796 (cs) [Submitted on 31 Dec 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories) Authors:Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, Zhi-Hao Zhang View a PDF of the paper titled LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories), by Rongge Xu and 7 other authors View PDF HTML (experimental) Abstract:While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of ...

Read Original Article

[2512.24796] LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Summary

Why It Matters

Key Takeaways

Related Articles

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

World models will be the next big thing, bye-bye LLMs

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

No comments

Stay updated with AI News