[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon

[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

arXiv - AI April 07, 2026 4 min read

About this article

Abstract page for arXiv paper 2512.18470: SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Computer Science > Software Engineering arXiv:2512.18470 (cs) [Submitted on 20 Dec 2025 (v1), last revised 4 Apr 2026 (this version, v5)] Title:SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Authors:Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui View a PDF of the paper titled SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios, by Minh V. T. Thai and 4 other authors View PDF HTML (experimental) Abstract:Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon...

Originally published on April 07, 2026. Curated by AI News.

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · 43 minutes ago

Llms

[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Abstract page for arXiv paper 2603.10047: Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination ...

arXiv - AI · 4 min · about 2 hours ago

Machine Learning

[2512.18388] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models

Abstract page for arXiv paper 2512.18388: Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creatio...

arXiv - AI · 4 min · about 2 hours ago

Llms

[2601.00263] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

Abstract page for arXiv paper 2601.00263: Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counter...

arXiv - AI · 4 min · about 2 hours ago

[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

About this article

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence

[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

[2512.18388] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models

[2601.00263] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

No comments

Stay updated with AI News