[2602.24044] Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
About this article
Abstract page for arXiv paper 2602.24044: Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.24044 (cs) [Submitted on 27 Feb 2026] Title:Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving Authors:Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral View a PDF of the paper titled Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving, by Ferran Agullo and 7 other authors View PDF HTML (experimental) Abstract:Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploi...