[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
About this article
Abstract page for arXiv paper 2512.18470: SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Computer Science > Software Engineering arXiv:2512.18470 (cs) [Submitted on 20 Dec 2025 (v1), last revised 4 Apr 2026 (this version, v5)] Title:SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Authors:Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui View a PDF of the paper titled SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios, by Minh V. T. Thai and 4 other authors View PDF HTML (experimental) Abstract:Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon...