[2603.03565] Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
About this article
Abstract page for arXiv paper 2603.03565: Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Computer Science > Artificial Intelligence arXiv:2603.03565 (cs) [Submitted on 3 Mar 2026] Title:Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants Authors:Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood, Hongtai Wei, Sudeep Das View a PDF of the paper titled Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants, by Alejandro Breen Herrera and 7 other authors View PDF HTML (experimental) Abstract:Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-opti...