[2602.20292] Quantifying the Expectation-Realisation Gap for Agentic AI Systems
Summary
This article examines the expectation-realisation gap in agentic AI systems, revealing discrepancies between anticipated productivity gains and actual outcomes across various domains, including software engineering and clinical documentation.
Why It Matters
Understanding the expectation-realisation gap is crucial for stakeholders in AI deployment, as it highlights the need for realistic assessments of AI capabilities and the importance of integrating human oversight in planning frameworks. This research can inform better decision-making in AI investments and implementations.
Key Takeaways
- Agentic AI systems often underperform compared to initial expectations.
- In software development, AI tools can slow down processes instead of speeding them up.
- Clinical documentation tools may not deliver the promised time savings.
- The gap is influenced by integration challenges and measurement mismatches.
- Structured planning frameworks are needed to set realistic expectations.
Computer Science > Software Engineering arXiv:2602.20292 (cs) [Submitted on 23 Feb 2026] Title:Quantifying the Expectation-Realisation Gap for Agentic AI Systems Authors:Sebastian Lobentanzer View a PDF of the paper titled Quantifying the Expectation-Realisation Gap for Agentic AI Systems, by Sebastian Lobentanzer View PDF Abstract:Agentic AI systems are deployed with expectations of substantial productivity gains, yet rigorous empirical evidence reveals systematic discrepancies between pre-deployment expectations and post-deployment outcomes. We review controlled trials and independent validations across software engineering, clinical documentation, and clinical decision support to quantify this expectation-realisation gap. In software development, experienced developers expected a 24% speedup from AI tools but were slowed by 19% -- a 43 percentage-point calibration error. In clinical documentation, vendor claims of multi-minute time savings contrast with measured reductions of less than one minute per note, and one widely deployed tool showed no statistically significant effect. In clinical decision support, externally validated performance falls substantially below developer-reported metrics. These shortfalls are driven by workflow integration friction, verification burden, measurement construct mismatches, and systematic heterogeneity in treatment effects. The evidence motivates structured planning frameworks that require explicit, quantified benefit expectations with ...