[2603.26796] Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
About this article
Abstract page for arXiv paper 2603.26796: Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
Computer Science > Machine Learning arXiv:2603.26796 (cs) [Submitted on 25 Mar 2026] Title:Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints Authors:Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder View a PDF of the paper titled Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints, by Jelena Markovic-Voronov and 5 other authors View PDF HTML (experimental) Abstract:We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized...