[2502.08047] WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
Summary
WorldGUI introduces a benchmark for evaluating desktop GUI automation agents under varied initial conditions, addressing the challenges of real-world applications.
Why It Matters
This research highlights the limitations of current GUI agents in adapting to non-standard environments, providing a framework for improving their robustness. By establishing a benchmark that reflects realistic user interactions, it paves the way for advancements in AI-driven automation tools, which are increasingly crucial in diverse applications.
Key Takeaways
- WorldGUI benchmark evaluates GUI agents under diverse initial conditions.
- Current state-of-the-art agents struggle with non-default environments.
- The WorldGUI-Agent framework enhances planning and execution reliability.
- Research aims to improve adaptability of GUI automation tools.
- The benchmark and code are publicly available for further research.
Computer Science > Artificial Intelligence arXiv:2502.08047 (cs) [Submitted on 12 Feb 2025 (v1), last revised 22 Feb 2026 (this version, v4)] Title:WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point Authors:Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou View a PDF of the paper titled WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point, by Henry Hengyuan Zhao and 4 other authors View PDF HTML (experimental) Abstract:Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execut...