[2602.23722] SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud
About this article
Abstract page for arXiv paper 2602.23722: SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud
Computer Science > Networking and Internet Architecture arXiv:2602.23722 (cs) [Submitted on 27 Feb 2026] Title:SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud Authors:Hariz Yet, Nguyen Thanh Tam, Mao V. Ngo, Lim Yi Shen, Lin Wei, Jihong Park, Binbin Chen, Tony Q. S. Quek View a PDF of the paper titled SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud, by Hariz Yet and 7 other authors View PDF HTML (experimental) Abstract:Embodied AI requires sub-second inference near the Radio Access Network (RAN), but deployments span heterogeneous tiers (on-device, RAN-edge, cloud) and must not disrupt real-time baseband processing. We report measurements from a 5G Standalone (SA) AI-RAN testbed using a fixed baseline policy for repeatability. The setup includes an on-device tier, a three-node RAN-edge cluster co-hosting a containerized 5G RAN, and a cloud tier. We find that on-device execution remains multi-second and fails to meet sub-second budgets. At the RAN edge, SLA feasibility is primarily determined by model variant choice: quantized models concentrate below 0.5\,s, while unquantized and some larger quantized models incur deadline misses due to stalls and queuing. In the cloud tier, meeting a 0.5\,s deadline is challenging on the measured WAN path (up to 32.9\% of requests complete within 0.5\,s), but all evaluated variants meet a 1.0\,s deadline (100\% within 1.0\,s). Under saturated downlink traffic and up to $N{=}20$ concurrent inference clients, Multi-I...