[2603.27343] Beyond Completion: Probing Cumulative State Tracking to

[2603.27343] Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

arXiv - AI March 31, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.27343: Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

Computer Science > Artificial Intelligence arXiv:2603.27343 (cs) [Submitted on 28 Mar 2026] Title:Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance Authors:Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, Kazunori D Yamada View a PDF of the paper titled Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance, by Dengzhe Hou and 5 other authors View PDF HTML (experimental) Abstract:Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall's tau = 0.612 (p < 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where p...

Originally published on March 31, 2026. Curated by AI News.

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min · 44 minutes ago

Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min · about 8 hours ago

Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min · about 9 hours ago

[2603.27343] Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

About this article

Related Articles

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

OpenClaw security checklist: practical safeguards for AI agents

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

No comments

Stay updated with AI News