[2601.15812] ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Summary
This article introduces ErrorMap and ErrorAtlas, innovative tools designed to analyze and categorize the failure patterns of large language models (LLMs), enhancing model evaluation and debugging.
Why It Matters
Understanding the reasons behind LLM failures is crucial for improving their performance and reliability. ErrorMap and ErrorAtlas provide a framework for identifying specific error types, which can guide developers in refining models and aligning benchmarks with real-world applications.
Key Takeaways
- ErrorMap charts unique failure signatures of LLMs, enhancing debugging.
- ErrorAtlas categorizes model errors, revealing underexplored failure types.
- The approach shifts focus from success metrics to understanding failure causes.
- Tools are applicable across various models and datasets, promoting broader insights.
- Public availability of the taxonomy and code supports ongoing research and development.
Computer Science > Artificial Intelligence arXiv:2601.15812 (cs) [Submitted on 22 Jan 2026 (v1), last revised 17 Feb 2026 (this version, v2)] Title:ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models Authors:Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen View a PDF of the paper titled ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models, by Shir Ashury-Tahan and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details...