KVCache.AI

37 posts

KVCache.AI banner
KVCache.AI

KVCache.AI

@KVCache_AI

Hi, this is https://t.co/EO7MXLjRIs official account. We build systems for efficient LLM serving, including KTransformers and Mooncake.

Beijing Katılım Ağustos 2018
99 Takip Edilen322 Takipçiler
KVCache.AI
KVCache.AI@KVCache_AI·
Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀 Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs. From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference. Full PyTorch blog 👇 pytorch.org/blog/up-to-580…
PyTorch@PyTorch

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

English
1
4
9
1.1K
KVCache.AI
KVCache.AI@KVCache_AI·
@ursusspeculus Of course! Cohere Command series are now supported. Feel free to give it a try.
English
0
0
2
41
KVCache.AI
KVCache.AI@KVCache_AI·
🚀 We just launched the open-source KV Cache Size Calculator by KVCache.ai! Calculate KV cache size for mainstream LLMs with flexible precision settings and detailed breakdowns. Supports DeepSeek, GLM, Kimi, Qwen3 and MiniMax. Try it now: kvcache.ai/tools/kv-cache…
KVCache.AI tweet media
English
9
18
137
46.6K
KVCache.AI
KVCache.AI@KVCache_AI·
@zaafonin Thanks for the feedback! Qwen 3.5/3.6 and Gemma 4 series are now supported. Feel free to give it a try.
English
0
0
2
66
zaafonin
zaafonin@zaafonin·
@KVCache_AI Cool but really vibecodey... any info on Qwen3.5/3.6 and Gemma 4 models?
English
1
0
0
199
KVCache.AI
KVCache.AI@KVCache_AI·
@PurplefinNep Thanks for the feedback! Qwen 3.5 is now supported. Feel free to give it a try.
English
0
0
1
57
KVCache.AI
KVCache.AI@KVCache_AI·
Thanks so much for using the KV cache size calculator and for all the great suggestions! We’ve seen the requests for more models. We’ll do our best to add support as soon as possible. Really appreciate all the feedback!
English
0
0
2
209
KVCache.AI
KVCache.AI@KVCache_AI·
🚀 Mooncake is powering agentic workloads serving with @vllm_project Agentic traces reach 80K+ tokens with highly reusable prefixes. By turning KV cache into a distributed, reusable resource, we eliminate redundant compute and unlock massive gains: 🚀 3.8x higher throughput, ⚡ 46x lower P50 TTFT, 🌐Scales near-linearly to 60 GB200 GPUs at >95% hit rate. Built in close collaboration with @Inferact 🤝
vLLM@vllm_project

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: vllm.ai/blog/mooncake-… 🧵👇

English
0
2
8
529
KVCache.AI
KVCache.AI@KVCache_AI·
One of the biggest challenges with large-scale EP deployments is the expanding blast radius. Fault tolerance and recovery capabilities are critical for supporting truly large-scale EP, and they are also among the most difficult parts to implement. To address this, the Mooncake and SGLang teams jointly developed Elastic EP. If you’re interested in EP deployments, feel free to give it a try! Details: lmsys.org/blog/2026-03-2…
English
0
1
4
209
KVCache.AI
KVCache.AI@KVCache_AI·
We’re excited to share our experience in improving the user experience of OpenClaw. By leveraging SGLang HiCache and Mooncake, we not only reduced fast-path latency, but also significantly improved TTFT tail latency. 🔗 Read our latest blog for more details: kvcache.ai/blog/openclaw-…
English
0
0
2
132
KVCache.AI
KVCache.AI@KVCache_AI·
Great work! Scalable speculative decoding training is an important step forward as models continue to grow in size and context length. Excited to see Mooncake play a key role here by providing efficient and reliable streaming of hidden states, making fully disaggregated inference and training pipelines practical.
PyTorch@PyTorch

We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams. By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated pipelines where inference and training scale independently. 🔗 Read our latest blog from TorchSpec & Mooncake teams: pytorch.org/blog/torchspec… @lightseekorg @KT_Project_AI #PyTorch #TorchSpec #Mooncake #OpenSourceAI

English
0
3
5
743
KVCache.AI
KVCache.AI@KVCache_AI·
Huge congratulations to the @lmsysorg SGLang team and @nvidia on these impressive GB300 results! 🚀 Powerful hardware + excellent software optimization is exactly how you unlock the full potential of long-context inference. Glad that Mooncake, as the KV cache transfer component, could contribute to this milestone. Excited to see what’s next!
LMSYS Org@lmsysorg

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

English
1
1
7
439
KVCache.AI
KVCache.AI@KVCache_AI·
Huge congrats to Minimax, this awesome new model is now open-source! KTransformers is happy to provided day0 support for M2.5. You can use KTransformers to enjoy the cutting edge ability of M2.5 with only 1 5090 + 300GB DRAM!
MiniMax (official)@MiniMax_AI

MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows. Hugging Face: huggingface.co/MiniMaxAI/Mini… GitHub: github.com/MiniMax-AI/Min… Coding Plan: platform.minimax.io/subscribe/codi… Intelligence with Everyone

English
1
1
5
254
PyTorch
PyTorch@PyTorch·
We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure
PyTorch tweet media
English
7
52
403
105K
KVCache.AI
KVCache.AI@KVCache_AI·
🚀 Exciting news! Mooncake is now officially part of the PyTorch Ecosystem! Mooncake brings high-performance KVCache transfer and storage to PyTorch-native LLM serving, enabling better prefill–decode disaggregation, global KVCache reuse, elastic MoE support, and fault-tolerant PyTorch distributed backends. Already integrated with engines like SGLang, vLLM & TensorRT LLM, we are thrilled to build the future of scalable LLM serving together. 👉 Read more: pytorch.org/blog/mooncake-… #Mooncake #PyTorch #LLM #OpenSourceAI
PyTorch@PyTorch

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure

English
0
1
3
179