Guanduo Chen

24 posts

Guanduo Chen banner
Guanduo Chen

Guanduo Chen

@GuanduoChen

Infra @Kimi_Moonshot

🤖 Katılım Mart 2024
302 Takip Edilen63 Takipçiler
Guanduo Chen retweetledi
DeepSeek
DeepSeek@deepseek_ai·
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n
DeepSeek tweet media
English
1.6K
7.7K
45.5K
9.8M
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…
English
46
185
1.8K
212.5K
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…
Kimi.ai tweet media
English
937
2.4K
18.1K
7.5M
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: arxiv.org/html/2604.1503…
Kimi.ai tweet media
English
73
343
2.8K
686.6K
Guanduo Chen
Guanduo Chen@GuanduoChen·
Manner 的 SOE 竟然涨价 25% 😭 给本不富裕的家庭雪上加霜
中文
0
0
0
31
Guanduo Chen retweetledi
Beidi Chen
Beidi Chen@BeidiChen·
??? So Anthropic’s showing who’s boss now — closed models trained on everyone’s data. SaaS playbook soon: cut engineers, pay Claude, pray the codebase doesn’t outgrow the only thing that can read it.
Boris Cherny@bcherny

Starting tomorrow at 12pm PT, Claude subscriptions will no longer cover usage on third-party tools like OpenClaw. You can still use these tools with your Claude login via extra usage bundles (now available at a discount), or with a Claude API key.

English
5
1
94
24.4K
小島みなこ 🏳️‍⚧️
“attention residual 的计算逻辑很简单,即使用大模型 vibe 也能搞出来一个访存性能比较好的算子(没有说 mhc 不好 vibe 的意思)” 。。。
Yu Zhang 🐙🌘@yzhang_cs

and I highly recommend this Zhihu answer. Shoutout to kimi infra team for backing AttnRes! 🙌 Kimi 新发布的「注意力残差」有什么亮点?马斯克和 Karpathy 为何同时点赞? - Reku的回答 - 知乎 zhihu.com/question/20169… translation kimi.com/share/19cfc9bc…

中文
1
0
20
3.8K
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
336
2.1K
13.5K
5.1M
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Kimi K2.5 has arrived! 🥝 Here are 2 things to know: Aesthetic Coding x Agent Swarm.
English
134
510
6.1K
616.3K
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing Kimi Code, an open-source coding agent under the Apache 2.0 License. 🔹 Python-based, easy to extend. 🔹 Fully transparent — clear, safe, reliable. 🔹 Seamlessly integrates with VS Code, Cursor, JetBrains, Zed, and more. 🔹 Fully-featured & out-of-the-box ready. Switch over in seconds. 🔹 Native Multimodal support Experience it with Kimi K2.5.
Kimi.ai tweet media
English
61
214
2.6K
257.8K
Guanduo Chen
Guanduo Chen@GuanduoChen·
RT @Kimi_Moonshot: 🥝Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹Global SOTA on Agentic Benchmarks: HLE full set (50.2%), Bro…
English
0
42
0
21
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: platform.moonshot.ai 🔗 Tech blog: moonshotai.github.io/Kimi-K2/thinki… 🔗 Weights & code: huggingface.co/moonshotai
Kimi.ai tweet media
English
578
1.5K
9.7K
4.8M
Guanduo Chen retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡
English
28
197
1.4K
522.5K
Guanduo Chen retweetledi
Horace He
Horace He@cHHillee·
I quite enjoyed this and it covers a bunch of topics without good introductory resources! 1. A bunch of GPU hardware details in one place (warp schedulers, shared memory, etc.) 2. A breakdown/walkthrough of reading PTX and SASS. 3. Some details/walkthroughs of a number of other hardware features (TMAs, wgmmas, etc.)
Aleksa Gordić (水平问题)@gordic_aleksa

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute both during training and inference. Most of NVIDIA compute is spent on it. Gaining 1% in efficiency translates to massive savings in the order of many nuclear reactors :P) I, yet again, realized i underestimated the effort. 😅 Here is one more booklet (lol). 47 figures! I covered: * The fundamentals of the GPU architecture with an emphasis on the memory hierarchy, building mental models for GMEM, SMEM, and L1/L2, and then connecting them to the CUDA programming model. Along the way we also looked at the "speed of light," how it's bounded by power, with hardware reality leaking into our model. * PTX/SASS, and how to steer the compiler into generating what we actually want (is that loop being unrolled, are we using vectorized loads like LDG.128, etc.). I've annotated one PTX/SASS example for a simple matmul kernel in excruciating detail. Even if you're new to compilers you should find this useful. (i actually found various inefficiencies in both compilers - fun!) * Many core concepts such as tile/wave quantization, occupancy, ILP (instruction-level parallelism), roofline model, etc. Also building intuition around fundamental equivalences: dot product as a sum of partial outer products, why square tiles are the right shape for high arithmetic intensity, etc. * The warp tiling method - which is near SOTA assuming you can't use tensor cores, TMA, async mem instructions, and bf16. Just maximizing GPU's performance using nothing but CUDA cores, registers and shared memory. * Finally, we step into Hopper (H100): TMA, swizzling, tensor cores and the wgmma instruction, async load/store pipelines, scheduling policies like Hilbert curves, clusters with TMA multicast, faster PTX barriers, and more. As always lots of examples, lots of visuals. This is the first time i could see warp tiling kernel and be like "oh i get it completely". I just needed my mental image transformed into an actual image. A few years ago I was really inspired by @Si_Boehm's excellent blog post on how matmul works, but I also found it had several errors, some unclear explanations, and it was quite outdated. Building on @pranjalssh amazing work (who did a great job building sota kernels for H100) and my own research, this is the final result. --- Again a huge thank you to @Hyperstackcloud (GPU cloud) for giving me an H100 (PCIe) node to run some of the experiments and analysis that i needed to write this up. Also a big thank you to my friends Aroun (who did a very thorough review of the post; Aroun's doing cool GPU/AI stuff at Magic and was previously GPU architect at Apple and Imagine, he's one of the best GPU people i know and we worked together on llm.c w/ @karpathy) and the amazing @marksaroufim! (PyTorch) for taking the time during weekend when they didn't have to. :)

English
8
93
975
99.9K
GitHub
GitHub@github·
What inspired you to get into tech as a career?
English
447
59
1K
309.3K
Guanduo Chen retweetledi
Arena.ai
Arena.ai@arena·
🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 overall, overtaking DeepSeek as the top open model. Huge congrats to the Moonshot team on this impressive milestone! The leaderboard now features 7 different providers in the top 15 - the most competitive it’s ever been. More insights in the thread 🧵
Arena.ai tweet media
Kimi.ai@Kimi_Moonshot

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build! 🔌 API is here: platform.moonshot.ai - $0.15 / million input tokens (cache hit) - $0.60 / million input tokens (cache miss) - $2.50 / million output tokens 🔗 Tech blog: moonshotai.github.io/Kimi-K2/ 🔗 Weights & code: huggingface.co/moonshotai 🔗 Github: github.com/MoonshotAI/Kim… Try it now at Kimi.ai or via API!

English
46
148
1.2K
275.7K