Open Source Ai Ai Agents Machine Learning Ai Infrastructure

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Hugging Face Blog February 18, 2026 11 min read Article

Summary

IBM and UC Berkeley explore the failures of enterprise agents in IT automation, utilizing IT-Bench and MAST to diagnose issues and improve reliability.

Why It Matters

Understanding why enterprise agents fail is crucial for developing more reliable AI systems in IT automation. This research provides actionable insights that can help engineers enhance the performance of AI agents, ultimately leading to better operational efficiency in critical IT tasks.

Key Takeaways

MAST provides a structured approach to diagnosing agent failures.
Frontier models often fail due to isolated bottlenecks, while larger models face cascading failures.
External verification and clear termination conditions are essential for robust agent performance.

Back to Articles IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST Enterprise Article Published February 18, 2026 Upvote 2 Ayhan Sebin ayhansebin Follow ibm-research Rohan Arora rohan-arora Follow ibm-research Saurabh Jha saurabhjha1 Follow ibm-research Ayhan Sebin Saurabh Jha Rohan Arora Daby Sow Mert Cemri Melissa Pan Ion Stoica ITBench HF Space ITBench HF Dataset MAST HF Dataset ITBench Github MAST Github IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops. Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). By leveraging MAST to analyze ITBench—the industry benchmark for SRE, Security, and FinOps automation—we turned raw execution traces into structured failure signatures, revealing exactly what broke and how to fix it. We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. Key Findings: Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification. Large open models like GPT-OSS-120B suffer from cascading failure modes (5.3 failure modes/trace...

Read Original Article

Llms

[2603.25112] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Abstract page for arXiv paper 2603.25112: Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

arXiv - AI · 4 min · 2 days ago

Llms

[2603.24772] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Abstract page for arXiv paper 2603.24772: Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Val...

arXiv - Machine Learning · 4 min · 2 days ago

Llms

[2603.25325] How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Abstract page for arXiv paper 2603.25325: How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

arXiv - AI · 4 min · 2 days ago

Open Source Ai

Liberate your OpenClaw

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Blog · 3 min · 3 days ago

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.25112] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

[2603.24772] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

[2603.25325] How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Liberate your OpenClaw

No comments

Stay updated with AI News