[2508.07675] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

[2508.07675] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

arXiv - Machine Learning 4 min read Article

Summary

This article presents a framework for semantic caching in large language models (LLMs) to reduce inference costs by leveraging semantic similarities between queries.

Why It Matters

As LLMs become integral to information systems, their high operational costs hinder scalability. This research proposes a novel caching method that not only enhances efficiency but also adapts to real-world uncertainties, making it crucial for developers and researchers in AI and machine learning.

Key Takeaways

  • Semantic caching improves efficiency by retrieving responses based on semantic similarity.
  • The proposed framework addresses cache eviction challenges under uncertainty.
  • Algorithms developed show superior performance compared to existing methods.
  • The framework includes both offline optimization and online learning components.
  • Understanding query distribution is essential for effective cache management.

Computer Science > Machine Learning arXiv:2508.07675 (cs) [Submitted on 11 Aug 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation Authors:Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong View a PDF of the paper titled Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation, by Xutong Liu and 7 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In th...

Related Articles

AI: Fragility of today's Claude Cowork type AI Agent Apps. RTZ 1061
Llms

AI: Fragility of today's Claude Cowork type AI Agent Apps. RTZ 1061

...realities like memory management, highlight a longer road to resilient AI Agents and AGI

AI Tools & Products · 11 min ·
Llms

Gemini caught a $280M crypto exploit before it hit the news, then retracted it as a hallucination because I couldn't verify it - because the news hadn't dropped yet

So this happened mere hours ago and I feel like I genuinely stumbled onto something worth documenting for people interested in AI behavio...

Reddit - Artificial Intelligence · 1 min ·
Llms

GPT-4 vs Claude vs Gemini for coding — honest breakdown after 3 months of daily use

I am a solo developer who has been using all three seriously. Here is what I actually think: GPT-4o — Strengths: Large context window, st...

Reddit - Artificial Intelligence · 1 min ·
Llms

You're giving feedback on a new version of ChatGPT

So I will be paying attention to these system messages more now- the last time I got one of these not so long back the 'tone' changed to ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime