Kuntai Du

94 posts

Kuntai Du

Kuntai Du

@this_will_echo

Chief Scientist | Committer of vLLM / LMCache / Production Stack

Joined Ocak 2022
56 Following198 Followers
Kuntai Du
Kuntai Du@this_will_echo·
Physical LLM is on the way lol
Tensormesh@tensormesh

"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: tinyurl.com/GTC-talk

English
0
0
1
80
Kuntai Du retweeted
Junchen Jiang
Junchen Jiang@JunchenJiang·
🚀 LMCache has officially been out for 1.5 years now! Within its success, LMCache has become the default KV-cache library for open-source LLM inference (CPU offload, P2P sharing, multi-backend storage, vLLM/SGLang integration, and more). As a PyTorch Foundation Ecosystem project, LMCache is now used by enterprise leaders across the industry (GKE, AWS, Nvidia's Dynamo, llm-d…). 🤔What’s the secret to our product?? 🔎 Come see yourself: arxiv.org/pdf/2510.09665 ♥️ A huge thank you to our contributors and community, you’ve influenced what makes LMCache today. (@lmcache) #KVCache #LMCache #LLM #vLLM
Junchen Jiang tweet media
English
0
2
16
1.6K
Kuntai Du
Kuntai Du@this_will_echo·
Github is not acting normal... Our LMCache logo suddenly disappeared today, we didn't make any change. And we cannot even clone the repo using ssh. Github bad bad.
Kuntai Du tweet media
English
0
0
0
202
Kuntai Du retweeted
Akshay 🚀
Akshay 🚀@akshay_pachaar·
Meta just solved the biggest problem in RAG! Most RAG systems waste your money. They retrieve 100 chunks when you only need 10. They force the LLM to process thousands of irrelevant tokens. You pay for compute you don't need. Meta AI just solved this. They built REFRAG, a new RAG approach that compresses and filters context before it hits the LLM. The results are insane: - 30.85x faster time-to-first-token - 16x larger context windows - 2-4x fewer tokens processed - Outperforms LLaMA on 16 RAG benchmarks Here's what makes REFRAG different: Traditional RAG dumps everything into the LLM. Every chunk. Every token. Even the irrelevant stuff. REFRAG works at the embedding level instead: ↳ It compresses each chunk into a single embedding ↳ An RL-trained policy scores each chunk for relevance ↳ Only the best chunks get expanded and sent to the LLM ↳ The rest stay compressed or get filtered out entirely The LLM only processes what matters. The workflow is straightforward: 1. Encode your docs and store them in a vector database 2. When a query arrives, retrieve relevant chunks as usual 3. The RL policy evaluates compressed embeddings and picks the best ones 4. Selected chunks are expanded into full token embeddings 5. Rejected chunks stay as single compressed vectors 6. Everything goes to the LLM together This means you can process 16x more context at 30x the speed with zero accuracy loss. I have shared link to the paper in the next tweet!
Akshay 🚀 tweet media
English
49
279
1.4K
103.2K
Kuntai Du retweeted
LMCache Lab
LMCache Lab@lmcache·
Yesterday we hosted our first LMCache office hours! Jiayi Yao, Research Engineer at Tensormesh and one of the top contributors, covered LMCache architecture, key performance optimizations, and benchmark results-based on the newly published technical report available at arxiv.org/pdf/2510.09665. You can watch the recording here: youtu.be/y14ruG6CNGE?si… Join us for the next LMCache office hours on December 11. Register to get it added to your calendar: lmcache-officehours.zapier.app
YouTube video
YouTube
English
0
1
6
784
Kuntai Du retweeted
vLLM
vLLM@vllm_project·
Thanks to @github for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year. 🏆 Top OSS by contributors 🚀 Fastest-growing by contributors 🌱 Attracting the most first-time contributors Trusted by leading open model communities and industry partners — including NVIDIA, Meta, Red Hat, DeepSeek, Qwen, Moonshot, and others — vLLM has become a preferred engine for efficient LLM inference. With almost 63K stars and 1800 contributors, this growth belongs to the community. Together, we’re building an easier, faster, and cheaper LLM serving for everyone. 👉gh.io/octoverse #vLLM #OpenSource #AIInfra #Octoverse
vLLM tweet media
English
7
22
119
10.8K
Kuntai Du
Kuntai Du@this_will_echo·
Ray summit 2025 takeaways: 1. Ray is now in PyTorch Foundation 2. Ray supports RDMA (finally) 3. Anyscale runtime, basically Ray + better perf and fault tolerance/observability etc. 4. Anyscale is building multi-resource cloud, and collaborating with Azure. #raysummit2025 #vllm
Kuntai Du tweet mediaKuntai Du tweet mediaKuntai Du tweet mediaKuntai Du tweet media
English
0
0
1
160
Kuntai Du retweeted
OpenAI Developers
OpenAI Developers@OpenAIDevs·
🧑‍💻 gpt-oss-safeguard Hackathon 🧑‍💻 Join us Dec. 8 in SF for the Open Safeguard Hackathon — a collaborative event by OpenAI, ROOST & @HuggingFace to explore how open models can shape safer digital spaces and explore the future of open-weight reasoning and online safety. Apply to participate: events.openai.com/gpt-oss-safegu…
English
9
26
237
34.9K
Kuntai Du
Kuntai Du@this_will_echo·
Why computer system community has a paper acceptance rate <20%? Reason 1: elite community. The core conference OSDI was literally a group discussion between elite professors. Reason 2: Everyone knows each other. Submitting subpar paper ruins ur & ur professor's reputation.
English
0
0
2
98
Kuntai Du
Kuntai Du@this_will_echo·
💰Want CHEAP GPU cloud? 💡Wanna store ALL your users’ history&docs as KV cache to save cost—but can’t get open source to run? Try TensorMesh SaaS: ⚡️$3.09/hr H100 😄No vendor lock-in 🧠Any open-source model 🪄OpenAI-API compatible Join the waitlist🚀: tensormesh.ai/beta-waitlist
Kuntai Du tweet media
English
0
3
3
192
Kuntai Du
Kuntai Du@this_will_echo·
The paper of KTransformers is in SOSP 2025 🎉 dl.acm.org/doi/pdf/10.114… 💡 Motivation in plain words: 8x H100 GPUs to serve MoE? Move experts to CPU! 🚀 Intel AMX boosts prefill 8×. 🧠 “Expert deferral” overlaps CPU & GPU perfectly. 2.8x faster than Llama.cpp!
English
0
0
1
122
Kuntai Du
Kuntai Du@this_will_echo·
#ByteDance’s Seed team published LLM communication debugging paper at #SOSP🚀 This paper arxiv.org/abs/2509.03018 modifies NCCL with near-zero overhead: ⚡ Detects 90% of issues in 15s ⚡ Finds root causes in 20s A must-read for anyone training large models #LLM #NCCL #SOSP
English
0
0
2
255
Kuntai Du
Kuntai Du@this_will_echo·
LMCache in Redis Release event! Plus random creation by Kobe from @lmcache
Kuntai Du tweet mediaKuntai Du tweet mediaKuntai Du tweet media
English
0
1
2
154
Kuntai Du
Kuntai Du@this_will_echo·
Legit singing with jaw dropped
English
0
0
3
231