[2604.00136] ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
About this article
Abstract page for arXiv paper 2604.00136: ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Computer Science > Machine Learning arXiv:2604.00136 (cs) [Submitted on 31 Mar 2026] Title:ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving Authors:Annette Taberner-Miller View a PDF of the paper titled ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving, by Annette Taberner-Miller View PDF HTML (experimental) Abstract:Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four d...