KVCache.AI

2

3

356

KVCache.AI@KT_Project_AI·21 Şub

Huge congratulations to the @lmsysorg SGLang team and @nvidia on these impressive GB300 results! 🚀 Powerful hardware + excellent software optimization is exactly how you unlock the full potential of long-context inference. Glad that Mooncake, as the KV cache transfer component, could contribute to this milestone. Excited to see what’s next!

LMSYS Org@lmsysorg

🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev

English

7

371

KVCache.AI@KT_Project_AI·16 Şub

⚡ Day-0 support for Qwen3.5-397B-A17B just landed in KTransformers! This beast features Gated Delta Networks + sparse MoE (397B total, 17B active), unified vision-language, and 262K native context. Ready to run on your local machine.

Qwen@Alibaba_Qwen

🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201 languages & dialects 📜Apache2.0 licensed 🔗Dive in: GitHub: github.com/QwenLM/Qwen3.5 Chat: chat.qwen.ai API：modelstudio.console.alibabacloud.com/ap-southeast-1… Qwen Code: github.com/QwenLM/qwen-co… Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… blog: qwen.ai/blog?id=qwen3.5

English

4

21

19.8K

KVCache.AI@KT_Project_AI·13 Şub

IMPRESSIVE! with such a size of 200~B parameters and 10B activation!

Introducing M2.5, an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution, 37% faster at complex tasks. - At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible MiniMax Agent: agent.minimax.io API: platform.minimax.io CodingPlan: platform.minimax.io/subscribe/codi…

English

3

193

KVCache.AI@KT_Project_AI·13 Şub

Huge congrats to Minimax, this awesome new model is now open-source! KTransformers is happy to provided day0 support for M2.5. You can use KTransformers to enjoy the cutting edge ability of M2.5 with only 1 5090 + 300GB DRAM!

MiniMax-M2.5 is now open source. Trained with reinforcement learning across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows. Hugging Face: huggingface.co/MiniMaxAI/Mini… GitHub: github.com/MiniMax-AI/Min… Coding Plan: platform.minimax.io/subscribe/codi… Intelligence with Everyone

English

5

236

KVCache.AI@KT_Project_AI·13 Şub

@aiktp_com @PyTorch Thanks for your interest and support! Regarding PD disaggregation performance, we have seen very promising improvements. Here are some benchmark results for reference: lmsys.org/blog/2025-07-2… lmsys.org/blog/2025-06-1…

English

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure

42

PyTorch@PyTorch·13 Şub

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 hubs.la/Q042Zf9N0 #PyTorch #OpenSourceAI #LLM #AIInfrastructure

English

7

51

401

103.6K

KVCache.AI@KT_Project_AI·13 Şub

🚀 Exciting news! Mooncake is now officially part of the PyTorch Ecosystem! Mooncake brings high-performance KVCache transfer and storage to PyTorch-native LLM serving, enabling better prefill–decode disaggregation, global KVCache reuse, elastic MoE support, and fault-tolerant PyTorch distributed backends. Already integrated with engines like SGLang, vLLM & TensorRT LLM, we are thrilled to build the future of scalable LLM serving together. 👉 Read more: pytorch.org/blog/mooncake-… #Mooncake #PyTorch #LLM #OpenSourceAI

PyTorch@PyTorch

English

One-shot "Video to code" result from Kimi K2.5 It not only clones a website, but also all the visual interactions and UX designs. No need to describe it in detail, all you need to do is take a screen recording and ask Kimi: "Clone this website with all the UX designs." riyd2bvh7ofju.beta-ok.kimi.link

1

3

153

KVCache.AI@KT_Project_AI·27 Oca

Amazing!

Kimi Product@KimiProduct

English

1

122

KVCache.AI@KT_Project_AI·27 Oca

Also, You can use KTransformers with LLamaFactory to Finetune K2.5 in a local low HBM hardware (96GB) plus many DDR5 DRAM!

English

3

98

KVCache.AI@KT_Project_AI·27 Oca

We are excited to provide day0-support for Kimi-K2.5. KTransformers is a growing opensource project which provides local deployment ability for large models in Low HBM scenario(maybe 64GB).

English

2

173

KVCache.AI@KT_Project_AI·27 Oca

Huge congrats to Kimi K2.5! Newest SOTA VL model with very complete agents support🎉

Kimi.ai@Kimi_Moonshot

🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%) 🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%) 🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion. 🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup. - 🥝 K2.5 is now live on kimi.com in chat mode and agent mode. 🥝 K2.5 Agent Swarm in beta for high-tier users. 🥝 For production-grade coding, you can pair K2.5 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blogs/kimi-k2-… 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English

92

KVCache.AI@KT_Project_AI·27 Oca

RT @Kimi_Moonshot: 🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), B…

English

2

0

31

KVCache.AI@KT_Project_AI·27 Oca

Huge congrats to Kimi K2.5! Newest SOTA VL model with very complete agents support🎉🎉🎉 KTransformers is happy to provide day0 support for K2.5 in local deployment scenario, please see our github.

English

1

82

KVCache.AI@KT_Project_AI·28 Ara

KTransformers release v0.5.0, new kt cli is coming! Now it is very easy to manage your local AI deployment system with kt + commands.

English

2

70

KVCache.AI@KT_Project_AI·27 Ara

@SkylerMiao7 Have fun with kt and minimax m2.1!

English

97

Skyler Miao@SkylerMiao7·27 Ara

@KT_Project_AI Great work, huge thanks!

English

0

4

293

KVCache.AI@KT_Project_AI·26 Ara

🚀 KTransformers now supports MiniMax-M2.1 with native FP8 inference! On a single RTX 5090: ✅Prefill: 2500+ tokens/s ✅Decode: 33+ tokens/s Compared to llama.cpp: 🚀4.5x faster prefill 📈30% faster decodeRun locally with just: "kt run m2" 🔗 github.com/kvcache-ai/ktr…

English

4

29

9K

KVCache.AI@KT_Project_AI·26 Ara

MiniMax M2.1 is impressive! With so much improvement on coding and other tool uses. Congrats to @minimax_ai Team!

MiniMax M2.1 is OPEN SOURCE: SOTA for real-world dev & agents • SOTA on coding benchmarks (SWE / VIBE / Multi-SWE) • Beats Gemini 3 Pro & Claude Sonnet 4.5 • 10B active / 230B total (MoE) Not just SOTA, faster to infer, easier to deploy, and yes, you can even run it locally Weights: huggingface.co/MiniMaxAI/Mini…

English

4

487

KVCache.AI أُعيد تغريده

MiniMax (official)@MiniMax_AI·26 Ara

MiniMax M2.1 is OPEN SOURCE: SOTA for real-world dev & agents • SOTA on coding benchmarks (SWE / VIBE / Multi-SWE) • Beats Gemini 3 Pro & Claude Sonnet 4.5 • 10B active / 230B total (MoE) Not just SOTA, faster to infer, easier to deploy, and yes, you can even run it locally Weights: huggingface.co/MiniMaxAI/Mini…

English

57

174

1.5K

1.1M

KVCache.AI@KT_Project_AI·26 Ara

New open source SOTA code model! Released with KTransformers day-0 support!

MiniMax M2.1 is OPEN SOURCE: SOTA for real-world dev & agents • SOTA on coding benchmarks (SWE / VIBE / Multi-SWE) • Beats Gemini 3 Pro & Claude Sonnet 4.5 • 10B active / 230B total (MoE) Not just SOTA, faster to infer, easier to deploy, and yes, you can even run it locally Weights: huggingface.co/MiniMaxAI/Mini…

English

1

139

KVCache.AI@KT_Project_AI·23 Ara

🎉 As a KTransformers maintainer, I’m genuinely happy to say this: RL-DPO is now basically “plug-and-play” 😄 With LLaMA-Factory + LoRA + DPO: ✅ one line in YAML: `use_kt: true` ✅ one command: `USE_KT=1 llamafactory-cli train ...` …and you can start preference alignment on DeepSeek-V2-Lite-Chat (aka: models that speak more “human”) Full tutorial: blog.llamafactory.net/en/posts/ktran… #KTransformers #LLaMAFactory #DPO #RLHF

English