[2603.25764] Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
About this article
Abstract page for arXiv paper 2603.25764: Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Computer Science > Software Engineering arXiv:2603.25764 (cs) [Submitted on 26 Mar 2026] Title:Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy Authors:Aman Mehta View a PDF of the paper titled Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy, by Aman Mehta View PDF HTML (experimental) Abstract:As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3....