[2508.07675] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Summary
This article presents a framework for semantic caching in large language models (LLMs) to reduce inference costs by leveraging semantic similarities between queries.
Why It Matters
As LLMs become integral to information systems, their high operational costs hinder scalability. This research proposes a novel caching method that not only enhances efficiency but also adapts to real-world uncertainties, making it crucial for developers and researchers in AI and machine learning.
Key Takeaways
- Semantic caching improves efficiency by retrieving responses based on semantic similarity.
- The proposed framework addresses cache eviction challenges under uncertainty.
- Algorithms developed show superior performance compared to existing methods.
- The framework includes both offline optimization and online learning components.
- Understanding query distribution is essential for effective cache management.
Computer Science > Machine Learning arXiv:2508.07675 (cs) [Submitted on 11 Aug 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation Authors:Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong View a PDF of the paper titled Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation, by Xutong Liu and 7 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In th...