[2602.07672] Debugging code world models
Summary
The paper explores Code World Models (CWMs), which simulate program execution and identify error sources, focusing on local semantic execution and long-horizon state tracking.
Why It Matters
Understanding CWMs is crucial as they represent a significant advancement in AI's ability to model and verify code execution. The insights into error sources and limitations can guide future improvements in AI programming tools, enhancing efficiency and accuracy in software development.
Key Takeaways
- CWMs simulate program execution by predicting runtime states.
- Errors in CWMs primarily arise from token-intensive execution traces and string-valued state limitations.
- Long-horizon degradation in CWMs is mainly due to incorrect action generation.
Computer Science > Software Engineering arXiv:2602.07672 (cs) [Submitted on 7 Feb 2026 (v1), last revised 14 Feb 2026 (this version, v2)] Title:Debugging code world models Authors:Babak Rahmani View a PDF of the paper titled Debugging code world models, by Babak Rahmani View PDF HTML (experimental) Abstract:Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-...