Kuntai Du
109 posts

Kuntai Du
@this_will_echo
Chief Scientist | Committer of vLLM / LMCache / Production Stack

GOBLIN MODE: ON , , /(.-" "-.)\ |\ \/ \/ /| | \ / =. .= \ / | \( \ o\/o / ) / \_, '-/ \-' ,_/ / \__/ \ \ \__/\__/ / ___\ \| -- |/ /___ /` \ / `\ / '----' \ [!] Goblin breach detected // LMCache docs > Click to make them go away ▓▒░ docs.lmcache.ai

Cutting a sweet mango with machine

llm-d published a new post on KServe + llm-d + vLLM for production LLM inference on Kubernetes. Authors from @RedHat and Tesla describe how the stack addressed routing, customization, and day-2 operational challenges, citing 3x higher output tokens/s and 2x lower TTFT in one deployment after enabling prefix-cache aware routing. By Yuan Tang, Scott Cabrinha, Robert Shaw, and Sai Krishna @CloudNativeFdn 🔗 @_llm_d_ llm-d.ai/blog/productio… #vLLM #KServe #Kubernetes #LLMOps #OpenSource


ReasoningBank, a novel agent memory framework, enables LLM agents to continuously learn from both successful & failed experiences. Our evaluation shows that it enhances agent effectiveness, boosting success rates and efficiency. Learn more: goo.gle/4dWrPGb

A company with 60+ accounts just had its entire AI infrastructure taken offline by their provider. No reason given, all that was provided was an appeal path as a Google Form. This is not a one-off, we have mapped the pattern across every major closed-weight provider and what enterprise teams can do about it. 📖 Read the full blog: tensormesh.ai/blog-posts/ent… 🚀 Try Tensormesh with $100 in free GPU Credits: app.tensormesh.ai/login?logged_o…

Someone built a transparent Mario game that runs OVER IDE so can play while waiting for Copilot to write code.

⚡ Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog:qwen.ai/blog?id=qwen3.… Qwen Studio:chat.qwen.ai HuggingFace:huggingface.co/Qwen/Qwen3.6-3… ModelScope:modelscope.cn/models/Qwen/Qw… API(‘Qwen3.6-Flash’ on Model Studio):Coming soon~ Stay tuned

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

GPU memory alone won’t carry the next generation of LLM serving. At #RaySummit, our Chief Scientist @this_will_echo shared how #LMCache offloads KV Cache across CPU RAM, local disk, Redis, and S3, while enabling cache reuse beyond basic prefix caching. Watch the full talk on YouTube: 👉🏻youtube.com/watch?v=aVpkkV… #RaySummit #LMCache #Tensormesh #KVCache

Some former colleagues from @lmcache shared this photo from the GTC Keynote. I am honestly surprised how fast the team has been growing. (We were a research lab on 2 A40 GPUs in 2023!) btw I think they are hiring LLM hackers (or product hackers I am not sure 🤪, you should just check with @JunchenJiang @ChengYihuaA) #GTC #LLM #Inference #Nvidia #LMCache #KVCache


"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: tinyurl.com/GTC-talk


Happy 2026 🥂 First post of the year: a technical benchmark. In a joint study with @tensormesh , we achieved: - 4× TTFT improvement - Prefix cache hit rate >50% Using SSD-augmented KVCache on realistic multi-turn LLM traffic. Full write-up on GMI Cloud: gmicloud.ai/blog/gmi-cloud…


