[2602.14257] AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Summary
The paper introduces AD-Bench, a benchmark for evaluating Large Language Model (LLM) agents in real-world advertising analytics, highlighting performance gaps in complex tasks.
Why It Matters
As LLMs become integral in various domains, understanding their performance in real-world scenarios, especially in advertising, is crucial. AD-Bench addresses the limitations of existing benchmarks by focusing on practical applications, thus enabling better evaluation and improvement of LLM capabilities in marketing contexts.
Key Takeaways
- AD-Bench is designed to evaluate LLM agents in real-world advertising scenarios.
- The benchmark categorizes tasks into three difficulty levels to assess agent capabilities.
- Current state-of-the-art models show significant performance gaps in complex marketing tasks.
Computer Science > Computation and Language arXiv:2602.14257 (cs) [Submitted on 15 Feb 2026] Title:AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Authors:Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang View a PDF of the paper titled AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents, by Lingxiang Hu and 8 other authors View PDF HTML (experimental) Abstract:While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, G...