LMCache Lab

145 posts

LMCache Lab

@lmcache

🧪 Open-Source Team that maintains LMCache and Production Stack 🤖 Democratizing AI by providing efficient LLM serving for ALL

Github, Online เข้าร่วม Eylül 2024

48 กำลังติดตาม798 ผู้ติดตาม

LMCache Lab@lmcache·3d

🧵 LMCache was spotlighted at Jensen Huang's GTC 2026 keynote — a real milestone for the community! A late post, intentionally. Just one more dose of GTC after the feed rush settles. ☕ For those new here: LMCache is a KV cache sharing layer that cuts LLM serving costs & latency. It works seamlessly with vLLM and SGLang, minimal setup. But the real story isn't the tech. It's the community that built it. In any role from researcher 🧐 , engineer 🧑‍💻, student 👩‍🎓, or just curious, there's a place for you here. 🔗 Explore LMCache 💻 Code: github.com/LMCache/LMCache 📖 Docs: docs.lmcache.ai 📝 Blog: blog.lmcache.ai/en/ ⭐ Star the repo, open an issue, submit a PR. Every contribution matters! The future of AI infrastructure is open. Come build it with us. #LMCache #KVCache #NVIDIAGTC #LLM #opensource

English

142

LMCache Lab รีทวีตแล้ว

Tensormesh@tensormesh·3d

"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: tinyurl.com/GTC-talk

English

319

LMCache Lab@lmcache·22 Ara

We ran a tiny one-shot experiment from a one-shot SWE-bench task with Claude Code to study the Context Engineering & Reuse Pattern: • 92 LLM calls invoked • ~2M input tokens • 13 minutes runtime • 92% prefix reuse rate With prefix caching, this single task drops from $6.00 → $1.15 in input cost (≈ 81% savings) and dramatically reduces TTFT. This trace shows Claude Code is essentially a prefix-reuse machine with warm-up calls to prime the cache, parallel multi-agent system, and a ReAct-style execution loop — all optimized for KV cache reuse. Blog post: huggingface.co/blog/kobe0938/… Raw trace: github.com/kobe0938/blog/… Trace visualizer: v0-llm-agent-dashboard.vercel.app If you care about context engineering, agent architecture, and KV-cache economics, this is a concrete, end-to-end look under the hood. You may paste the raw trace to the trace visualizer to peek more.

English

284.7K

LMCache Lab@lmcache·21 Kas

Help wanted: we would like to conduct another round of experiments on running agents listed here: #software-development" target="_blank" rel="nofollow noopener">github.com/kyrolabs/aweso… to analyze the reuse pattern of agents and to see if LMCache can be of help in accelerating open source agent applications More details: github.com/LMCache/LMCach… github.com/LMCache/lmcach… If that sounds interesting to you or aligns with your research interest, dm us in the slack channel🫡

English

410

LMCache Lab@lmcache·14 Kas

Yesterday we hosted our first LMCache office hours! Jiayi Yao, Research Engineer at Tensormesh and one of the top contributors, covered LMCache architecture, key performance optimizations, and benchmark results-based on the newly published technical report available at arxiv.org/pdf/2510.09665. You can watch the recording here: youtu.be/y14ruG6CNGE?si… Join us for the next LMCache office hours on December 11. Register to get it added to your calendar: lmcache-officehours.zapier.app

YouTube

English

784

LMCache Lab รีทวีตแล้ว

b/acc, context platform engineer@AccBalanced·12 Eki

@SemiAnalysis_ Oooh, MoE + disagg PD is what @weka is about to test with @NVIDIAAI Dynamo KVBM/NIXL and @RedHat_AI llm-d over @vllm_project & @lmcache We’re gonna add InferenceMax workloads to these, which already scale HBM linearly over TP, DP & PP

b/acc, context platform engineer tweet media

English

2.1K

LMCache Lab@lmcache·8 Eki

Check out our new blog and colab with Google GKE about LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage blog.lmcache.ai/2025-10-07-LMC…

English

492

LMCache Lab@lmcache·7 Eki

Good job! Salute🫡

Kobe@kobe0938

An OpenAI veteran from the text-davinci-003 era reporting for Dev Day 2025! 🫡

English

589

LMCache Lab@lmcache·26 Eyl

Gotta Cache 'Em All.

English

269

LMCache Lab@lmcache·24 Eyl

... Through the implementation of LMCache Plugin Framework and lmcache_frontend, we gained an important insight: when handling specific scenario requirements in open source projects, functional abstraction and universal design are crucial. The success of Plugin Framework lies in that it doesn’t directly implement various customization requirements, but provides a flexible extension mechanism. LMCache found a balance point through Plugin Framework, satisfying diverse requirements while maintaining project maintainability and extensibility. This design pattern is also reflected in the LMCache Remote External Connector framework and LMCache External backend framework, and is worth promoting in future development. By defining clear extension interfaces and specifications, we enable the community to meet specific requirements without modifying core code, thus achieving long-term healthy development of the project. We hope that as LMCache becomes increasingly powerful, it can continue to maintain healthy development.

English

232

LMCache Lab@lmcache·24 Eyl

In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production environments. However, instead of directly customizing the lmcache core, we introduced the LMCache Plugin Framework - a lightweight yet powerful plugin system that allows developers to run custom scripts within LMCache processes. Based on this plugin framework, we implemented lmcache_frontend(github.com/LMCache/lmcach…), a monitoring and proxy service that runs as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services. This design not only facilitates deployment and management but also provides developers with an excellent plugin implementation example, demonstrating how to use the Plugin Framework to enhance system observability and control capabilities. Read more: blog.lmcache.ai/2025-09-23-lmc… Doc: docs.lmcache.ai/developer_guid…

English

858

LMCache Lab@lmcache·19 Eyl

github.com/LMCache/LMCach…

ZXX

215

LMCache Lab@lmcache·18 Eyl

A deep dive on LMCache x NVIDIA Dynamo: blog.lmcache.ai/2025-09-18-dyn… to learn how LMCache integrates with NVIDIA Dynamo to slash KV-cache bottlenecks and push LLM inference efficiency to the next level. We’re also honored to be featured in NVIDIA’s official blog on Dynamo: developer.nvidia.com/blog/how-to-re… #LMCache #NVIDIA #Dynamo #LLM #Inference #KVcache

English

LMCache Lab@lmcache·16 Eyl

@sursaikat @JunchenJiang Thanks for sharing!

English

Saikat Sur@sursaikat·16 Eyl

My writing on LMCache techniques: - 𝐂𝐚𝐜𝐡𝐞𝐆𝐞𝐧 - How to quickly transfer KV cache to GPU memory . - 𝐂𝐚𝐜𝐡𝐞𝐁𝐥𝐞𝐧𝐝 - How to quickly combine multiple KV caches on demand . @lmcache @JunchenJiang LMCache github: github.com/LMCache/LMCache

English

LMCache Lab รีทวีตแล้ว

EyeingAI@EyeingAI·15 Eyl

Wow… LLMs can now get insane speed & memory boosts. This open-source trick makes any large language model faster than you thought possible... LMCache caches and reuses key-value data across instances and hardware, so your AI: – Remembers context – Handles multi-round Q&A effortlessly – Runs faster and smoother 🔥 Why this is next-level: • Prompt Caching: Instantly pull long conversations, AI actually remembers stuff now. • Fast RAG: Combine cached data for lightning-accurate results. • Scale Like a Boss: No messy GPU routing. • Cheaper AF: Compression tech keeps costs low. • Lightning Speed: Streaming + decompression = almost zero lag. • Plug & Play: Works with vLLM, TGI, and all your favorite LLM engines. • Better Quality: Offline upgrades make AI smarter than ever. If your AI deals with context-heavy conversations or RAG, this is the hack that changes everything. Serve more users, slash compute waste, and see your AI dominate every task.

GIF

English

52.7K

LMCache Lab รีทวีตแล้ว

Daily Dose of Data Science@DailyDoseOfDS_·13 Eyl

The fastest serving engine for LLMs is here (open-source)! LMCache is an LLM serving engine designed to reduce time-to-first-token and increase throughput, especially under long-context scenarios. It boosts vLLM with 7x faster access to 100x more KV caches. 100% open-source!

English

190

1.2K

66.8K

LMCache Lab รีทวีตแล้ว

Yacine Mahdid@yacinelearning·11 Eyl

I got deep respect for niche open source project in AI. you gotta have deep expertise and a good heart to run those. kuddos to teams like LMCache on keeping the open source dream alive ❤️

English

447

17.4K

ค้นพบ

@tensormesh @SemiAnalysis_ @weka @NVIDIAAI @RedHat_AI @vllm_project @sursaikat @JunchenJiang