[2603.03823] SWE-CI: Evaluating Agent Capabilities in Maintaining

[2603.03823] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

arXiv - AI March 05, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.03823: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Computer Science > Software Engineering arXiv:2603.03823 (cs) [Submitted on 4 Mar 2026] Title:SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration Authors:Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao View a PDF of the paper titled SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration, by Jialong Chen and 4 other authors View PDF HTML (experimental) Abstract:Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can...

Originally published on March 05, 2026. Curated by AI News.

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

Google's latest upgrade for Gemini will allow the chatbot to generate interactive 3D models and simulations in response to your questions...

The Verge - AI · 4 min · about 2 hours ago

Llms

Moody’s Integrates AI Agents With Anthropic’s Claude

AI Tools & Products · 4 min · about 2 hours ago

Llms

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

AI Tools & Products · 6 min · about 2 hours ago

Llms

These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?

AI Tools & Products · 6 min · about 2 hours ago

[2603.03823] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

About this article

Related Articles

Google’s Gemini AI can answer your questions with 3D models and simulations

Moody’s Integrates AI Agents With Anthropic’s Claude

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

These AI Glasses Switch Between ChatGPT and Gemini. Why Don't More Wearables Do This?

No comments

Stay updated with AI News