Llms Machine Learning Ai Infrastructure

[2602.22812] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

arXiv - Machine Learning February 27, 2026 3 min read Article

Summary

The paper presents a method for enhancing the performance of local large language models (LLMs) on resource-constrained edge devices through distributed prompt caching, significantly reducing inference times.

Why It Matters

As the demand for AI applications on edge devices grows, optimizing LLM performance is crucial. This research addresses the limitations of such devices, enabling more efficient use of AI in various applications, from IoT to mobile computing.

Key Takeaways

Distributed prompt caching can significantly enhance LLM performance on low-end devices.
The proposed method reduces Time to First Token (TTFT) by 93.12% and Time to Last Token (TTLT) by 50.07%.
Utilizing a Bloom-filter-based catalog minimizes unnecessary communication overhead.
Partial matching support in caching leverages prompt similarity effectively.
The approach is validated using the Gemma-3 model on a Raspberry Pi platform.

Computer Science > Machine Learning arXiv:2602.22812 (cs) [Submitted on 26 Feb 2026] Title:Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching Authors:Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura View a PDF of the paper titled Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching, by Hiroki Matsutani and 2 other authors View PDF HTML (experimental) Abstract:Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2602.22812 [cs.LG]...

Read Original Article