KVCache.AI

59 posts

KVCache.AI

@KVCache_AI

Hi, this is https://t.co/EO7MXLjRIs official account. We build systems for efficient LLM serving, including KTransformers and Mooncake.

Beijing Katılım Ağustos 2018

107 Takip Edilen792 Takipçiler

KVCache.AI@KVCache_AI·11h

@baboonAI4S Yes, this is our official account. Thanks for checking!

English

Claude Qin@baboonAI4S·17h

@KVCache_AI 真是真官号？

日本語

KVCache.AI@KVCache_AI·1d

We're excited to open-source AgentENV! Built for large-scale Agentic RL, AgentENV provides secure Firecracker-based execution environments while reducing infrastructure costs by up to 96.8%. Already powering the Agentic RL training of advanced models, including Kimi K3 from @Kimi_Moonshot. 📖 Learn more: kvcache.ai/blog/agentenv-… ⭐ Try it on GitHub: github.com/kvcache-ai/Age… #OpenSource #AI #AgenticRL #LLM

English

4.2K

KVCache.AI@KVCache_AI·11h

@teortaxesTex @Kimi_Moonshot Yes! K3 is supported in the KV cache calculator. Feel free to give it a try!

English

12.1K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·1d

@KVCache_AI @Kimi_Moonshot can we see K3 on kv cache calculator?

English

879

KVCache.AI@KVCache_AI·1d

Excited to be part of vLLM’s Day-0 support for Kimi K3! 🚀 Mooncake powers production-scale prefill/decode disaggregation in the vLLM serving stack, enabling efficient KV cache sharing for agentic workloads and scalable deployment of Kimi K3 from @Kimi_Moonshot. Huge congratulations to @vllm_project and everyone involved in bringing the largest open-source model into production! Learn more: vllm.ai/blog/2026-07-2… #KimiK3 #vLLM #Mooncake #LLMInference #OpenSourceAI

English

2.6K

KVCache.AI@KVCache_AI·21h

Following Mooncake, AgentENV marks the next project we've built and open-sourced together, supporting the infrastructure behind Kimi K3’s agentic RL training. Grateful to the @Kimi_Moonshot team for the close collaboration, and looking forward to building more open AI infrastructure together.

Kimi.ai@Kimi_Moonshot

We've open-sourced AgentENV in collaboration with kvcache-ai. AgentENV is a distributed system for running agent environments at scale. Its components power agentic RL training for Kimi K3, with fast snapshot, resume, and fork support for large-scale parallel agent workflows. Explore on GitHub: github.com/kvcache-ai/Age…

English

2.7K

KVCache.AI retweetledi

LightSeek Foundation@lightseekorg·1d

TokenSpeed is only two months old, yet over the past two weeks we've had the privilege of being the Day 0 inference partner for both @thinkymachines's Inkling and @Kimi_Moonshot's Kimi K3. Grateful to both teams for their trust, and proud of what our small, fast-moving team has accomplished so far. 🚀 #Kimi #K3 #Inkling #5-deployment" target="_blank" rel="nofollow noopener">huggingface.co/moonshotai/Kim… #inkling-availability" target="_blank" rel="nofollow noopener">thinkingmachines.ai/news/introduci…

English

KVCache.AI@KVCache_AI·1d

Great to see TokenSpeed delivering Day-0 support for Kimi K3! 🚀 Mooncake powers the data plane behind TokenSpeed's disaggregated serving stack, transferring multimodal embeddings and KV cache across serving stages. Together with TokenSpeed's unified flat KV layout, this enables scalable LLM serving for next-generation model architectures. Congratulations to the @LightSeekOrg team and @Kimi_Moonshot on this impressive release. Looking forward to enabling even more next-generation model architectures together. More details: lightseek.org/blog/tokenspee… #KimiK3 #TokenSpeed #Mooncake #LLMInference #OpenSourceAI

English

1.5K

KVCache.AI@KVCache_AI·1d

Proud to be part of the Day-0 ecosystem for Kimi K3! 🚀 Mooncake powers SGLang's Day-0 support for Kimi K3, providing distributed cache storage for both KV cache and KDA state checkpoints, enabling efficient prefix cache reuse across instances while keeping the new KDA architecture production-ready. Huge congratulations to the @lmsysorg team and @Kimi_Moonshot on an incredible release. Looking forward to seeing what the community builds with K3!

LMSYS Org@lmsysorg

SGLang day-0 speed on Kimi K3: 423 tok/s (measured on gsm8k), plus RL support ready in Miles @radixark! How the largest open-source model runs this fast: we natively implemented and deeply optimized K3’s new architecture with fused KDA decode kernels, DP attention, DSpark, PD disagg, and KDA-aware prefix caching. We've passed Kimi Vendor Verifier and are ready for production! Thanks to @Kimi_Moonshot, @nvidia, @AMD, @KVCache_AI, @modal, and @baseten for building this with us, and to @googlecloud, @nebiustf, @fal, @digitalocean, @runpod, @DeepInfra and @gmi_cloud for serving K3 on SGLang. Blog, cookbook, benchmarks in the comments. P.S. This demo video? Kimi K3 made it itself. Play the game 👇

English

2.7K

KVCache.AI@KVCache_AI·3d

@teortaxesTex Exactly. MUSA architecture from Moore Threads.

English

183

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3d

> • Distributed SSD-backed KV cache pooling everyone doing this now > • Published wheel packages for AWS EFA and MUSA MUSA? As in Moore Threads?

KVCache.AI@KVCache_AI

🚀 Mooncake v0.3.12 is out! This release brings major upgrades across Mooncake Store, Transfer Engine, Expert Parallelism, and platform support. A huge thank you to our amazing community: we're thrilled to welcome 53 new contributors in this release! ❤️ ✨ Highlights: • Distributed SSD-backed KV cache pooling • Smarter transfer scheduling with intent- & policy-based routing and deadline-aware QoS • Broader platform support: TPU/PJRT, HPE Slingshot/CXI, AMD HIP/RDMA, and Sunrise • Published wheel packages for AWS EFA and MUSA • Official Docker Hub images • Many performance optimizations and reliability fixes throughout the project #Mooncake #AIInfrastructure #LLM #OpenSource

English

5.8K

KVCache.AI@KVCache_AI·3d

English

13.1K

KVCache.AI@KVCache_AI·15 Tem

🚀Mooncake now supports SSD Offloading for KV Cache. As agentic workloads become the norm, KV cache lifetimes are getting much longer, but keeping everything in DRAM simply doesn't scale. With Mooncake's distributed SSD tier, you can: ✅ Expand KV cache capacity far beyond memory. ✅ Preserve long-tail KV caches instead of evicting them. ✅ Increase KV cache hit rates. Even small gains translate nonlinearly into significant prefill speedups. Read more: kvcache.ai/blog/scaling-k… #LLM #Inference #KVCache #AIInfrastructure #Mooncake

English

2.1K

KVCache.AI@KVCache_AI·14 Tem

Thanks to @vllm_project , Mooncake is now available in the official vLLM Recipes! You can now deploy vLLM with Mooncake KV Store directly from the official Recipes page. The generated recipe includes the required configurations for both Mooncake and vLLM, making it much easier to deploy a distributed KV cache pool. Give it a try: recipes.vllm.ai/deepseek-ai/De… #Mooncake #vLLM #LLM #KVCache #OpenSource

English

1.9K

KVCache.AI@KVCache_AI·7 Tem

It's exciting to see Mooncake being used in the latest 1T-scale RL infrastructure from @PrimeIntellect .🚀 As LLM systems continue to scale, efficient KV cache management is becoming critical infrastructure. Prime Intellect highlights Mooncake Store as a centralized KV cache backend that pools RAM/disk across nodes into one large cache, accessible by any inference worker from any node — this provides a significant advantage, especially when using more sophisticated routing strategies. Read more: primeintellect.ai/blog/rl-at-1t-… #LLM #RL #AIInfrastructure #KVCache

Prime Intellect@PrimeIntellect

One Mooncake store pools KV cache across all nodes, so any worker can reuse any prefix. The router picks workers by a score over load, queue depth, KV usage and prefix overlap. You get cross-replica cache hits with balanced routing across the whole deployment.

English

KVCache.AI@KVCache_AI·3 Tem

Congrats to the @vLLM_Project and excited to see Mooncake powering the full online DSpark training pipeline on GB300 NVL72 🚀 Mooncake efficiently moves hidden-state from vLLM nodes to Speculators through RDMA, eliminating the need for massive hidden-state storage in offline training. Excited to keep pushing the limits together!

Michael Goin@mgoin_

DSpark update: Turns out with a little Speculators+Mooncake, I'm able to scale training on GB300 NVL72! 9 vLLM nodes serve the full GLM 5.2 FP8 verifier -> Mooncake RDMA store -> 6 nodes train the DSpark with FSDP (DP=24). 125k prefill tok/s, 1.5 steps/s, full online training :)

English

4.3K

KVCache.AI@KVCache_AI·2 Tem

🎉 Excited to see Mooncake highlighted as a core middleware in the AI infrastructure stack. 🚀 Proud to contribute to the open-source ecosystem that's making LLM serving faster and more scalable. Try Mooncake: github.com/kvcache-ai/Moo…

NVIDIA@nvidia

NVIDIA inference software keeps driving down token costs, long after AI infrastructure is deployed. ⚡ In just one month on NVIDIA Blackwell, software optimizations improved DeepSeek V4 performance by up to 5×, reducing token costs to roughly one-fifth of previous levels. NVIDIA's integrated inference software stack compounds improvements across runtimes, kernels, networking, and hardware, delivering up to 20× higher throughput on the same GPU. Co-designed with NVIDIA GPUs, CPUs, networking, and systems, and powered by CUDA-native open source frameworks, NVIDIA's inference software stack ensures new model breakthroughs and optimizations run on NVIDIA from day zero, and keep improving throughput and lowering cost after deployment. See how @Baseten, @Cognition, @DeepInfra, @togethercompute, and @Cursor_ai are turning continuous software innovation into lower cost per token: nvda.ws/4eRT43m

English

1.4K

KVCache.AI@KVCache_AI·26 Haz

🧠How much KV cache is enough for LLM serving? 📉Too little → poor reuse. 📦Too much → wasted storage. ⚡The sweet spot is where marginal speedup of prefill starts to flatten. 🛠️In this blog, we show a simple and practical way to find that sweet spot for KV cache capacity planning, using our online KV Cache Hit Rate Simulator. Read more 👉 kvcache.ai/blog/calculate… #KVCache #LLMServing #Optimization

English

1.7K

KVCache.AI@KVCache_AI·25 Haz

🚀Mooncake is now officially available on Docker Hub! Thanks to the @Docker Sponsored Open Source program, pull rate limits are removed for Mooncake images, providing a smoother and more reliable experience for users. Pull the image and get started: docker pull kvcacheai/mooncake:latest Docker Hub: hub.docker.com/r/kvcacheai/mo… GitHub: github.com/kvcache-ai/Moo… #Mooncake #Docker #LLMServing

English

508

KVCache.AI@KVCache_AI·18 Haz

GLM-5.2 vs GLM-5.1: What changed in the KV Cache? 🚀 1M-context cache comparison: GLM-5.1: 41.8 GiB KV cache + 9.3 GiB Indexer cache GLM-5.2: 41.8 GiB KV cache + 2.5 GiB Indexer cache ⚡ Key takeaway: GLM-5.2 uses shared index, making long-context inference more practical. Explore KV Cache with our calculator: kvcache.ai/tools/kv-cache… #GLM #KVCache #LongContext #LLMInference

English

46.9K

KVCache.AI@KVCache_AI·17 Haz

🚀 GLM-5.2 is here — and KVCache.ai is Day 0 ready. 🧠 Stable 1M context 💻 Stronger coding & agent capabilities 📜 MIT-licensed weights ⚡ KTransformers now supports running GLM-5.2 token service on edge devices, powered by SGLang + KT-Kernel. 👉 Get started: github.com/kvcache-ai/ktr…

Z.ai@Zai_org

Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: z.ai/blog/glm-5.2 Weights: huggingface.co/zai-org/GLM-5.2 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Chat: chat.z.ai

English

918

KVCache.AI@KVCache_AI·13 Haz

@1i__is Thank you! This means a lot to us 🙏 More open-source work in this area is coming soon. Stay tuned!

English

1iis, the New Team@1i__is·12 Haz

i love these guys they're hard on the narrow problem space that makes or breaks AI setups please don't stop keep being awesome we need you!

KVCache.AI@KVCache_AI

🚀 We just launched KV Cache Analyzer by KVCache.AI! 📊 Analyze KV cache hit rates and estimate prefill throughput speedup under different cache budgets and eviction policies. 🧪 Use preset traces or your own local traces, choose the model and parameters, and see how KV cache reuse improves LLM inference performance across different settings. 👉 Try it now: kvcache.ai/tools/kv-cache…

English

167

Keşfet

@baboonAI4S @Kimi_Moonshot @teortaxesTex @vllm_project @thinkymachines @LightSeekOrg @lmsysorg @PrimeIntellect