[2602.12544] Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

[2602.12544] Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

arXiv - AI 3 min read Article

Summary

This paper presents a scalable pipeline for generating high-quality training data for web agents, introducing a novel evaluation framework for assessing task completion progress.

Why It Matters

The research addresses a significant challenge in AI training by improving the quality and quantity of training data through automatic generation and fine-grained evaluation. This can enhance the performance of web agents in complex tasks, making AI systems more efficient and effective in real-world applications.

Key Takeaways

  • Introduces a scalable pipeline for automatic training data generation.
  • Presents a constraint-based evaluation framework for task completion.
  • Demonstrates improved performance of distilled models over existing systems.
  • Expands usable training data by leveraging partially successful trajectories.
  • Proposes a new benchmark, BookingArena, for evaluating web interaction tasks.

Computer Science > Artificial Intelligence arXiv:2602.12544 (cs) [Submitted on 13 Feb 2026] Title:Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation Authors:Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee View a PDF of the paper titled Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation, by Lajanugen Logeswaran and 4 other authors View PDF HTML (experimental) Abstract:We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex struc...

Related Articles

Machine Learning

[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes

I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-s...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML Rebuttal Question

I am currently working on my response on the rebuttal acknowledgments for ICML and I doubting how to handle the strawman argument of that...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ML researcher looking to switch to a product company.

Hey, I am an AI researcher currently working in a deep tech company as a data scientist. Prior to this, I was doing my PhD. My current ro...

Reddit - Machine Learning · 1 min ·
Machine Learning

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

Hey guys, I’m the same creator of Netryx V2, the geolocation tool. I’ve been working on something new called COGNEX. It learns how a pers...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime