Kuntai Du

94 posts

Kuntai Du

@this_will_echo

Chief Scientist | Committer of vLLM / LMCache / Production Stack

Joined Ocak 2022

56 Following198 Followers

Kuntai Du@this_will_echo·3d

Physical LLM is on the way lol

Tensormesh@tensormesh

"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: tinyurl.com/GTC-talk

English

Kuntai Du retweeted

Tensormesh@tensormesh·17 Mar

🔴 Live from #GTC2026 On the floor with our Chief Scientist @this_will_echo and CTO #Yihua Chang — #KVCache is the hottest topic of the day. Even Jensen opened with it. 🎙️They covered topics like: #CacheBlend, @lmcache 0.4.0. and the super cool collab with @nvidia around a bot called #reachy using LMCache under the hood for 20x speedup #GTC2026 #KVCache #LMCache #TensorMesh

English

333

Kuntai Du@this_will_echo·14 Oca

By offloading KV caches to SSD, we managed to reduce the time-to-first-token for @gmi_cloud without ANY extra infra cost!

GMI Cloud@gmi_cloud

Happy 2026 🥂 First post of the year: a technical benchmark. In a joint study with @tensormesh , we achieved: - 4× TTFT improvement - Prefix cache hit rate >50% Using SSD-augmented KVCache on realistic multi-turn LLM traffic. Full write-up on GMI Cloud: gmicloud.ai/blog/gmi-cloud…

English

363

Kuntai Du retweeted

Junchen Jiang@JunchenJiang·10 Ara

🚀 LMCache has officially been out for 1.5 years now! Within its success, LMCache has become the default KV-cache library for open-source LLM inference (CPU offload, P2P sharing, multi-backend storage, vLLM/SGLang integration, and more). As a PyTorch Foundation Ecosystem project, LMCache is now used by enterprise leaders across the industry (GKE, AWS, Nvidia's Dynamo, llm-d…). 🤔What’s the secret to our product?? 🔎 Come see yourself: arxiv.org/pdf/2510.09665 ♥️ A huge thank you to our contributors and community, you’ve influenced what makes LMCache today. (@lmcache) #KVCache #LMCache #LLM #vLLM

English

1.6K

Kuntai Du@this_will_echo·19 Kas

Github is not acting normal... Our LMCache logo suddenly disappeared today, we didn't make any change. And we cannot even clone the repo using ssh. Github bad bad.

English

202

Kuntai Du retweeted

Akshay 🚀@akshay_pachaar·14 Kas

Meta just solved the biggest problem in RAG! Most RAG systems waste your money. They retrieve 100 chunks when you only need 10. They force the LLM to process thousands of irrelevant tokens. You pay for compute you don't need. Meta AI just solved this. They built REFRAG, a new RAG approach that compresses and filters context before it hits the LLM. The results are insane: - 30.85x faster time-to-first-token - 16x larger context windows - 2-4x fewer tokens processed - Outperforms LLaMA on 16 RAG benchmarks Here's what makes REFRAG different: Traditional RAG dumps everything into the LLM. Every chunk. Every token. Even the irrelevant stuff. REFRAG works at the embedding level instead: ↳ It compresses each chunk into a single embedding ↳ An RL-trained policy scores each chunk for relevance ↳ Only the best chunks get expanded and sent to the LLM ↳ The rest stay compressed or get filtered out entirely The LLM only processes what matters. The workflow is straightforward: 1. Encode your docs and store them in a vector database 2. When a query arrives, retrieve relevant chunks as usual 3. The RL policy evaluates compressed embeddings and picks the best ones 4. Selected chunks are expanded into full token embeddings 5. Rejected chunks stay as single compressed vectors 6. Everything goes to the LLM together This means you can process 16x more context at 30x the speed with zero accuracy loss. I have shared link to the paper in the next tweet!

English

279

1.4K

103.2K

Kuntai Du retweeted

LMCache Lab@lmcache·14 Kas

Yesterday we hosted our first LMCache office hours! Jiayi Yao, Research Engineer at Tensormesh and one of the top contributors, covered LMCache architecture, key performance optimizations, and benchmark results-based on the newly published technical report available at arxiv.org/pdf/2510.09665. You can watch the recording here: youtu.be/y14ruG6CNGE?si… Join us for the next LMCache office hours on December 11. Register to get it added to your calendar: lmcache-officehours.zapier.app

YouTube

English

784

Kuntai Du retweeted

vLLM@vllm_project·12 Kas

Thanks to @github for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year. 🏆 Top OSS by contributors 🚀 Fastest-growing by contributors 🌱 Attracting the most first-time contributors Trusted by leading open model communities and industry partners — including NVIDIA, Meta, Red Hat, DeepSeek, Qwen, Moonshot, and others — vLLM has become a preferred engine for efficient LLM inference. With almost 63K stars and 1800 contributors, this growth belongs to the community. Together, we’re building an easier, faster, and cheaper LLM serving for everyone. 👉gh.io/octoverse #vLLM #OpenSource #AIInfra #Octoverse

English

119

10.8K

Kuntai Du@this_will_echo·6 Kas

Happy to meet with @EmbeddedLLM people in person! Thanks for all the hardware supports in @vllm_project and @lmcache !

English

656

Kuntai Du@this_will_echo·6 Kas

vLLM team @vllm_project , reunited at Ray Summit! (Lowkey: can't wait to meet our magic @KaichaoYou and @darklight1337 in person!) Can't wait to meet our magic @KaichaoYou and @darklight1337 in person!

English

1.2K

Kuntai Du@this_will_echo·6 Kas

Ray summit 2025 takeaways: 1. Ray is now in PyTorch Foundation 2. Ray supports RDMA (finally) 3. Anyscale runtime, basically Ray + better perf and fault tolerance/observability etc. 4. Anyscale is building multi-resource cloud, and collaborating with Azure. #raysummit2025 #vllm

English

160

Kuntai Du retweeted

OpenAI Developers@OpenAIDevs·29 Eki

🧑‍💻 gpt-oss-safeguard Hackathon 🧑‍💻 Join us Dec. 8 in SF for the Open Safeguard Hackathon — a collaborative event by OpenAI, ROOST & @HuggingFace to explore how open models can shape safer digital spaces and explore the future of open-weight reasoning and online safety. Apply to participate: events.openai.com/gpt-oss-safegu…

English

237

34.9K

Kuntai Du@this_will_echo·30 Eki

Why computer system community has a paper acceptance rate <20%? Reason 1: elite community. The core conference OSDI was literally a group discussion between elite professors. Reason 2: Everyone knows each other. Submitting subpar paper ruins ur & ur professor's reputation.

English

Kuntai Du@this_will_echo·28 Eki

💰Want CHEAP GPU cloud? 💡Wanna store ALL your users’ history&docs as KV cache to save cost—but can’t get open source to run? Try TensorMesh SaaS: ⚡️$3.09/hr H100 😄No vendor lock-in 🧠Any open-source model 🪄OpenAI-API compatible Join the waitlist🚀: tensormesh.ai/beta-waitlist

English

192

Kuntai Du@this_will_echo·23 Eki

The paper of KTransformers is in SOSP 2025 🎉 dl.acm.org/doi/pdf/10.114… 💡 Motivation in plain words: 8x H100 GPUs to serve MoE? Move experts to CPU! 🚀 Intel AMX boosts prefill 8×. 🧠 “Expert deferral” overlaps CPU & GPU perfectly. 2.8x faster than Llama.cpp!

English

122

Kuntai Du@this_will_echo·21 Eki

#ByteDance’s Seed team published LLM communication debugging paper at #SOSP🚀 This paper arxiv.org/abs/2509.03018 modifies NCCL with near-zero overhead: ⚡ Detects 90% of issues in 15s ⚡ Finds root causes in 20s A must-read for anyone training large models #LLM #NCCL #SOSP

English

255

Kuntai Du@this_will_echo·19 Eyl

Nvudia dynamo is using lmcache🥹🥹

LMCache Lab@lmcache

A deep dive on LMCache x NVIDIA Dynamo: blog.lmcache.ai/2025-09-18-dyn… to learn how LMCache integrates with NVIDIA Dynamo to slash KV-cache bottlenecks and push LLM inference efficiency to the next level. We’re also honored to be featured in NVIDIA’s official blog on Dynamo: developer.nvidia.com/blog/how-to-re… #LMCache #NVIDIA #Dynamo #LLM #Inference #KVcache

Lietuvių

184

Kuntai Du@this_will_echo·5 Eyl

LMCache in Redis Release event! Plus random creation by Kobe from @lmcache

English

154

Kuntai Du retweeted

LMCache Lab@lmcache·4 Eyl

LMCache highlighted by CEO of Redis @rowantrollope at Redis Released SF 2025! 🎉 We’re thrilled to partner with Redis, bringing KV cache acceleration to the infra ecosystem. #Redis #LMCache #AIInfra #LLM #Caching #SFTech #RedisReleased2025 📌 PS: Our team @kobe_eee(Kobe) & @this_will_echo(Kuntai) are here at the conference today—happy to connect in person if you’re around!

English

651

Kuntai Du@this_will_echo·31 Ağu

Legit singing with jaw dropped

English

231

Discover

@lmcache @nvidia @gmi_cloud @github @EmbeddedLLM @vllm_project @KaichaoYou @darklight1337