[2602.19008] Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

[2602.19008] Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

arXiv - Machine Learning 4 min read Article

Summary

This article explores the reliability failures of language agents in long-horizon tasks, attributing these failures to deviations from canonical solution paths rather than a lack of capability.

Why It Matters

Understanding the causal mechanisms behind agent failures is crucial for improving AI reliability. This research highlights that enhancing agent performance requires more than just scaling capabilities; it necessitates monitoring adherence to established solution paths.

Key Takeaways

  • Agent failures in long-horizon tasks are often due to stochastic deviations from canonical paths.
  • Successful task completion is significantly correlated with adherence to these canonical paths.
  • A monitoring intervention can improve success rates by restarting low-performing runs based on adherence metrics.

Computer Science > Computation and Language arXiv:2602.19008 (cs) [Submitted on 22 Feb 2026] Title:Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks Authors:Wilson Y. Lee View a PDF of the paper titled Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks, by Wilson Y. Lee View PDF HTML (experimental) Abstract:Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This...

Related Articles

Ai Agents

Considering NeurIPS submission [D]

Wondering if it worth submitting paper I’m working on to NeurIPS. I have formal mathematical proof for convergence of a novel agentic sys...

Reddit - Machine Learning · 1 min ·
Ai Agents

Agent frameworks waste ~350,000+ tokens per session resending static files. 95% reduction benchmarked.

Measured the actual token waste on a local Qwen 3.5 122B setup. The numbers are unreal. Found a compile-time approach that cuts query con...

Reddit - Artificial Intelligence · 1 min ·
OpenClaw gives users yet another reason to be freaked out about security - Ars Technica
Ai Agents

OpenClaw gives users yet another reason to be freaked out about security - Ars Technica

The viral AI agentic tool let attackers silently gain admin unauthenticated access.

Ars Technica - AI · 5 min ·
Robotics

What happens when you let AI agents run a sitcom 24/7 with zero human involvement

Ran an experiment — gave AI agents full control over writing, character creation, and performing a sitcom. Left it running nonstop for ov...

Reddit - Artificial Intelligence · 1 min ·
More in Ai Agents: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime