[2503.14499] Measuring AI Ability to Complete Long Software Tasks
Summary
The paper introduces a new metric to evaluate AI's ability to complete long software tasks, revealing significant advancements in AI capabilities and their implications for automation.
Why It Matters
Understanding AI's ability to handle complex tasks is crucial for industries relying on automation. This research provides a framework for assessing AI performance against human benchmarks, highlighting the rapid evolution of AI capabilities and potential impacts on the workforce and software development.
Key Takeaways
- A new metric, 50%-task-completion time horizon, quantifies AI capabilities against human performance.
- Current AI models can complete tasks with a 50% success rate in about 50 minutes.
- AI's time horizon has been doubling approximately every seven months since 2019.
- Improved reliability and logical reasoning are key drivers of AI's enhanced capabilities.
- Predictions indicate AI could automate tasks currently taking humans a month within five years.
Computer Science > Artificial Intelligence arXiv:2503.14499 (cs) [Submitted on 18 Mar 2025 (v1), last revised 25 Feb 2026 (this version, v3)] Title:Measuring AI Ability to Complete Long Software Tasks Authors:Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan View a PDF of the paper titled Measuring AI Ability to Complete Long Software Tasks, by Thomas Kwa and 24 other authors View PDF HTML (experimental) Abstract:Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons se...