[2602.13559] OpAgent: Operator Agent for Web Navigation
Summary
The paper presents OpAgent, an innovative online reinforcement learning agent designed for effective web navigation, achieving a state-of-the-art success rate of 71.6%.
Why It Matters
As web environments become increasingly complex, traditional methods for training autonomous agents fall short. OpAgent addresses these challenges by employing a robust online learning framework that adapts to dynamic web conditions, enhancing the capabilities of AI agents in real-world applications.
Key Takeaways
- OpAgent utilizes hierarchical multi-task fine-tuning for enhanced instruction-following.
- The model employs online reinforcement learning to adapt in real-time to web environments.
- A hybrid reward mechanism effectively addresses credit assignment challenges in navigation tasks.
- OpAgent's modular framework improves error recovery and self-correction.
- The model achieves a significant performance improvement over existing baselines.
Computer Science > Artificial Intelligence arXiv:2602.13559 (cs) [Submitted on 14 Feb 2026] Title:OpAgent: Operator Agent for Web Navigation Authors:Yuyu Guo, Wenjie Yang, Siyuan Yang, Ziyang Liu, Cheng Chen, Yuan Wei, Yun Hu, Yang Huang, Guoliang Hao, Dongsheng Yuan, Jianming Wang, Xin Chen, Hang Yu, Lei Lei, Peng Di View a PDF of the paper titled OpAgent: Operator Agent for Web Navigation, by Yuyu Guo and 14 other authors View PDF HTML (experimental) Abstract:To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online inte...