[2502.03771] vCache: Verified Semantic Prompt Caching
Summary
The paper presents vCache, a verified semantic prompt caching system that enhances LLM inference efficiency by dynamically adjusting similarity thresholds for cached prompts, achieving significant improvements in cache hit rates and error reduction.
Why It Matters
As large language models (LLMs) become integral to various applications, optimizing their performance is crucial. vCache addresses limitations of traditional caching methods by providing user-defined error rate guarantees, which can lead to more reliable and efficient LLM deployments. This innovation is significant for researchers and developers working on AI infrastructure and applications.
Key Takeaways
- vCache introduces a dynamic thresholding method for semantic caching.
- It provides user-defined error rate guarantees for improved reliability.
- The system outperforms existing static-threshold methods in cache efficiency.
- vCache demonstrates up to 12.5x higher cache hit rates and 26x lower error rates.
- The implementation and benchmarks are made available for further research.
Computer Science > Machine Learning arXiv:2502.03771 (cs) [Submitted on 6 Feb 2025 (v1), last revised 21 Feb 2026 (this version, v5)] Title:vCache: Verified Semantic Prompt Caching Authors:Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez View a PDF of the paper titled vCache: Verified Semantic Prompt Caching, by Luis Gaspar Schroeder and 9 other authors View PDF HTML (experimental) Abstract:Semantic caches return cached responses for semantically similar prompts to reduce LLM inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees for predictable performance. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training. Our experiments show that vCache consistently meets the spe...