[2602.20502] ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
Summary
The paper presents ActionEngine, a novel framework that enhances GUI agents by transitioning from reactive execution to programmatic planning using state machine memory, improving efficiency and accuracy in task execution.
Why It Matters
This research addresses the limitations of current GUI agents, which often suffer from high costs and latency. By introducing a two-agent architecture, ActionEngine significantly improves task success rates and reduces operational costs, making it relevant for developers and researchers in AI and machine learning.
Key Takeaways
- ActionEngine improves GUI agent efficiency by using state machine memory.
- The framework achieves a 95% task success rate with reduced costs and latency.
- It employs a Crawling Agent for memory construction and an Execution Agent for task execution.
- Robustness is ensured through vision-based re-grounding to handle interface changes.
- The approach combines programmatic planning with localized action validation.
Computer Science > Artificial Intelligence arXiv:2602.20502 (cs) [Submitted on 24 Feb 2026] Title:ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory Authors:Hongbin Zhong, Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath View a PDF of the paper titled ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory, by Hongbin Zhong and 6 other authors View PDF HTML (experimental) Abstract:Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks fr...