Bala Pillai
18.1K posts

Bala Pillai
@balapillai
Quantum leap inventiveness platform orchestrator. Reversing the factors that fell #SpiceTradeAsia region. Parameswara-scale. https://t.co/BT6TqT9mUZ












Your LLM inference is burning 50% of its compute on work it has already done. If you are running RAG or Multi-Turn Chat, you are likely recomputing the KV Cache for the same documents over and over again. I found the open-source library that solves this. It’s called LMCache. It makes the KV cache Persistent and Shareable across different engine instances (vLLM, SGLang). The "Cheat Code" for AI Infrastructure => Instead of the cache dying when a request finishes, LMCache offloads it to a shared layer (CPU/Disk/Network). This unlocks architecture patterns that were previously impossible: 1./ Instant RAG Process a 100-page PDF once. Store the KV cache. Now any user query against that doc starts instantly (Zero Time-To-First-Token). 2./ Disaggregated Serving Run heavy "Prefill" on H100s. Stream the cache to cheaper L4s for "Decoding." 3./ Context Sharing Multiple users asking about the same context? Compute it once, serve everyone. 🚀 15x throughput gain in multi-round QA workloads. ⚡ 3-10x reduction in Time-To-First-Token (TTFT). It integrates directly with vLLM and SGLang. Stop letting your GPUs do the same homework twice. GitHub Repo: github.com/lmcache/lmcache














