[2602.19008] Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Summary
This article explores the reliability failures of language agents in long-horizon tasks, attributing these failures to deviations from canonical solution paths rather than a lack of capability.
Why It Matters
Understanding the causal mechanisms behind agent failures is crucial for improving AI reliability. This research highlights that enhancing agent performance requires more than just scaling capabilities; it necessitates monitoring adherence to established solution paths.
Key Takeaways
- Agent failures in long-horizon tasks are often due to stochastic deviations from canonical paths.
- Successful task completion is significantly correlated with adherence to these canonical paths.
- A monitoring intervention can improve success rates by restarting low-performing runs based on adherence metrics.
Computer Science > Computation and Language arXiv:2602.19008 (cs) [Submitted on 22 Feb 2026] Title:Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks Authors:Wilson Y. Lee View a PDF of the paper titled Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks, by Wilson Y. Lee View PDF HTML (experimental) Abstract:Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This...