KVCache.AI

24 posts

KVCache.AI banner
KVCache.AI

KVCache.AI

@KT_Project_AI

Hi, this is https://t.co/EO7MXLjRIs official account. We build systems for efficient LLM serving, including KTransformers and Mooncake.

Beijing شامل ہوئے Ağustos 2018
97 فالونگ120 فالوورز
KVCache.AI
KVCache.AI@KT_Project_AI·
Great work! Scalable speculative decoding training is an important step forward as models continue to grow in size and context length. Excited to see Mooncake play a key role here by providing efficient and reliable streaming of hidden states, making fully disaggregated inference and training pipelines practical.
PyTorch@PyTorch

We’re excited to introduce TorchSpec, a torch-native framework for scalable speculative decoding training developed by the TorchSpec and Mooncake teams. By streaming hidden states from inference engines to training workers via Mooncake, TorchSpec enables fully disaggregated

English
0
2
3
324
KVCache.AI
KVCache.AI@KT_Project_AI·
Huge congratulations to the @lmsysorg SGLang team and @nvidia on these impressive GB300 results! 🚀 Powerful hardware + excellent software optimization is exactly how you unlock the full potential of long-context inference. Glad that Mooncake, as the KV cache transfer component, could contribute to this milestone. Excited to see what’s next!
@

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X

English
1
1
7
371
KVCache.AI
KVCache.AI@KT_Project_AI·
⚡ Day-0 support for Qwen3.5-397B-A17B just landed in KTransformers! This beast features Gated Delta Networks + sparse MoE (397B total, 17B active), unified vision-language, and 262K native context. Ready to run on your local machine.
@

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201

English
0
4
21
19.8K
KVCache.AI
KVCache.AI@KT_Project_AI·
Huge congrats to Minimax, this awesome new model is now open-source! KTransformers is happy to provided day0 support for M2.5. You can use KTransformers to enjoy the cutting edge ability of M2.5 with only 1 5090 + 300GB DRAM!
@

MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows. Hugging Face: huggingface.co/MiniMaxAI/Mini… GitHub:

English
1
1
5
236
PyTorch
PyTorch@PyTorch·
We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure
PyTorch tweet media
English
7
51
400
103.6K
KVCache.AI
KVCache.AI@KT_Project_AI·
🚀 Exciting news! Mooncake is now officially part of the PyTorch Ecosystem! Mooncake brings high-performance KVCache transfer and storage to PyTorch-native LLM serving, enabling better prefill–decode disaggregation, global KVCache reuse, elastic MoE support, and fault-tolerant PyTorch distributed backends. Already integrated with engines like SGLang, vLLM & TensorRT LLM, we are thrilled to build the future of scalable LLM serving together. 👉 Read more: pytorch.org/blog/mooncake-… #Mooncake #PyTorch #LLM #OpenSourceAI
PyTorch@PyTorch

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure

English
0
1
3
153
KVCache.AI
KVCache.AI@KT_Project_AI·
Also, You can use KTransformers with LLamaFactory to Finetune K2.5 in a local low HBM hardware (96GB) plus many DDR5 DRAM!
English
0
0
3
98
KVCache.AI
KVCache.AI@KT_Project_AI·
We are excited to provide day0-support for Kimi-K2.5. KTransformers is a growing opensource project which provides local deployment ability for large models in Low HBM scenario(maybe 64GB).
KVCache.AI tweet media
English
0
2
2
173
KVCache.AI
KVCache.AI@KT_Project_AI·
RT @Kimi_Moonshot: 🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), B…
English
0
2
0
31
KVCache.AI
KVCache.AI@KT_Project_AI·
Huge congrats to Kimi K2.5! Newest SOTA VL model with very complete agents support🎉🎉🎉 KTransformers is happy to provide day0 support for K2.5 in local deployment scenario, please see our github.
English
0
0
1
82
KVCache.AI
KVCache.AI@KT_Project_AI·
KTransformers release v0.5.0, new kt cli is coming! Now it is very easy to manage your local AI deployment system with kt + commands.
KVCache.AI tweet media
English
0
0
2
70
KVCache.AI
KVCache.AI@KT_Project_AI·
🚀 KTransformers now supports MiniMax-M2.1 with native FP8 inference! On a single RTX 5090: ✅Prefill: 2500+ tokens/s ✅Decode: 33+ tokens/s Compared to llama.cpp: 🚀4.5x faster prefill 📈30% faster decodeRun locally with just: "kt run m2" 🔗 github.com/kvcache-ai/ktr…
KVCache.AI tweet media
English
1
4
29
9K
KVCache.AI ری ٹویٹ کیا
@·
MiniMax M2.1 is OPEN SOURCE: SOTA for real-world dev & agents • SOTA on coding benchmarks (SWE / VIBE / Multi-SWE) • Beats Gemini 3 Pro & Claude Sonnet 4.5 • 10B active / 230B total (MoE) Not just SOTA, faster to infer, easier to deploy, and yes, you can even run it locally
 tweet media
English
57
174
1.5K
1.1M
KVCache.AI
KVCache.AI@KT_Project_AI·
🎉 As a KTransformers maintainer, I’m genuinely happy to say this: RL-DPO is now basically “plug-and-play” 😄 With LLaMA-Factory + LoRA + DPO: ✅ one line in YAML: `use_kt: true` ✅ one command: `USE_KT=1 llamafactory-cli train ...` …and you can start preference alignment on DeepSeek-V2-Lite-Chat (aka: models that speak more “human”) Full tutorial: blog.llamafactory.net/en/posts/ktran… #KTransformers #LLaMAFactory #DPO #RLHF
English
0
1
3
293