Llms Machine Learning Nlp Ai Infrastructure Ai Agents

[2602.13165] Asynchronous Verified Semantic Caching for Tiered LLM Architectures

arXiv - AI February 16, 2026 4 min read Article

Summary

The paper introduces Krites, an asynchronous caching policy for large language models (LLMs) that enhances semantic caching efficiency while maintaining latency, significantly improving response accuracy in conversational and search tasks.

Why It Matters

As LLMs become integral to various applications, optimizing their performance through effective caching strategies is crucial. Krites addresses the trade-off between response accuracy and latency, making it a significant advancement in LLM deployment strategies.

Key Takeaways

Krites improves static caching efficiency for LLMs without increasing latency.
The policy allows for asynchronous verification of cached responses, enhancing accuracy.
Simulations show a potential increase of up to 3.9 times in effective static answer usage.

Computer Science > Information Retrieval arXiv:2602.13165 (cs) [Submitted on 13 Feb 2026] Title:Asynchronous Verified Semantic Caching for Tiered LLM Architectures Authors:Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu View a PDF of the paper titled Asynchronous Verified Semantic Caching for Tiered LLM Architectures, by Asmit Kumar Singh and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promo...

Read Original Article

[2602.13165] Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.29171] Segmentation of Gray Matters and White Matters from Brain MRI data

[2602.09924] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

[2602.01528] Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning

[2601.22783] Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

No comments

Stay updated with AI News