Zhiyuan1i

46 posts

Zhiyuan1i

@uniartisan

INTP @Kimi_Moonshot

Katılım Kasım 2017

59 Takip Edilen544 Takipçiler

Zhiyuan1i@uniartisan·1d

🚀🚀

Yu Zhang 🐙🌘@yzhang_cs

flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…

ART

228

Zhiyuan1i retweetledi

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

333

2.1K

13.6K

4.9M

Zhiyuan1i@uniartisan·3 Mar

@im_datta0 @hu_yifei please share a minimal script. FLA provide multiple ways to accelerate training. Even qwen3.5 itself use FLA. In my opinion, to avoid compile and h2d/d2h is the key

English

Datta Nimmaturi@im_datta0·2 Mar

@hu_yifei Idk the issue here but... I think FLA seems to be having repeated compilations at every step (at least that's what I faced when fine-tuning Qwen 3.5 with FLA)

English

751

Yifei Hu@hu_yifei·2 Mar

training infra is the moat

English

8.6K

Zhiyuan1i retweetledi

Dillon Uzar@DillonUzar·22 Kas

Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards. The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle. kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle. All results at: contextarena.ai The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end. However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level. kimi-linear-48b results: 2-Needle Performance (@ 128k / @ 1M): - AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%) - Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%) 4-Needle Performance (@ 128k / @ 1M): - AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%) - Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%) 8-Needle Performance (@ 128k / @ 1M): - AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%) - Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%) A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR). Enjoy. @Kimi_Moonshot @GoogleDeepMind @googleaidevs @OpenAI @OpenAIDevs

English

472

294.1K

Zhiyuan1i@uniartisan·12 Kas

砍完发现我已经订阅过了，求员工优惠啊🤡

黑@Hx1u0

我发现你们都是砍价之王，就我自己砍不了自己🤡

中文

4.9K

Zhiyuan1i@uniartisan·11 Kas

Can't wait to see them

Lisan al Gaib@scaling01

from Kimi AMA: - K3 will likely use KDA or some other hybrid attention mechanism - Kimi-K2 will get vision

English

Zhiyuan1i@uniartisan·10 Kas

Serialization and then hashing, I remember even after optimization, 45us was needed. In this case, you can consider exporting the cubin after warming up and calling the cubin directly.

maharshi@maharshii

why is triton’s kernel launch cpu overhead so freaking high? the actual kernel takes 10x less execution time than to launch it and i can’t use cuda graphs because the shapes are dynamic.

English

1.8K

Zhiyuan1i@uniartisan·10 Kas

KIMI of course 🙋‍♂️

Vishal ▲@Vixhal

Your current favorite LLM, and why?

English

227

9.7K

Zhiyuan1i@uniartisan·9 Kas

Think deep. Work smart. Focus on the next six months, not the next ten years.

English

102

3.9K

Zhiyuan1i@uniartisan·9 Kas

@aj_kourabi The speed issue has been resolved, and the final problem was surprising - everyone's enthusiasm exposed our network bandwidth issue.

English

345

AJ Kourabi@aj_kourabi·9 Kas

we will see K2.1 in a few months to fix the verbosity, but a baseline speed of 8tok/s shows you that price is not everything and the prices the labs charge are more defensible than you think doesnt matter if a model is open source you still need compute to serve at scale

Artificial Analysis@ArtificialAnlys

Kimi K2 Thinking used the highest number of tokens ever across the evals in Artificial Analysis Intelligence Index

English

129

18.7K

Zhiyuan1i@uniartisan·8 Kas

@indra_himself You will enjoy Kimi Linear(48B). We build it for coding and provide emotional value. A really warm and smart model.

English

Kami Sama@indra_himself·7 Kas

@uniartisan I just want a kimi small LLM That rivals best of emotion chat persona At inference cost lower than gpt5nano if possible

English

115

Zhiyuan1i@uniartisan·7 Kas

I believe that we have the best algorithm and engineering teams, and what's important is that they work closely together like a family.

Kimi.ai@Kimi_Moonshot

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: platform.moonshot.ai 🔗 Tech blog: moonshotai.github.io/Kimi-K2/thinki… 🔗 Weights & code: huggingface.co/moonshotai

English

507

42.9K

Zhiyuan1i@uniartisan·8 Kas

@galuh1300d @deepseek_ai Undoubtedly, I respect and learn from their work. We compete in different aspects, A single flower does not make spring, but a garden full of flowers does.

English

Galuh Budi@galuh1300d·7 Kas

@uniartisan @deepseek_ai Clearly have the best algorithm

English

135

Zhiyuan1i@uniartisan·6 Kas

Cheers! I'm pleased to see several of our PRs featured here. This will boost the broader hybrid model universe!

PyTorch@PyTorch

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗 hubs.la/Q03RWWDD0 #vLLM #PyTorch #OpenSourceAI #HybridModels

English

1.4K

Zhiyuan1i@uniartisan·2 Kas

@HaveFunStayingP @teortaxesTex Each of us has our own direction, and we are extremely united and happy in our work.

English

Le Chiffre@HaveFunStayingP·2 Kas

@teortaxesTex Culture/incentives - meta is more short term with everyone pulling their own direction, China - more top down, hierarchical

English

327

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·1 Kas

So it's very interesting. When a Chinese company decides "ok time to do AGI", they spawn a lab and do it. 100+ people, DS-MoE, add GPUs, get near-frontier weights on the other end. When Meta tries to do it, Zuck trips and spaghetti falls out of his pockets. Again. And again. Why?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Meituan LongCat@Meituan_LongCat

@xeophon Your “LLM delivery”has arrived! Please remember to leave a five-star review. 🐱miao ~~~

English

397

76.1K

Zhiyuan1i retweetledi

Songlin Yang@SonglinYang4·31 Eki

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimax’s openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud. That said, the “regrets” might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow. Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we don’t want our agentic models to be forever bounded by context length - that’s a limitation we’ll have to overcome sooner or later

English

505

61.3K

Zhiyuan1i retweetledi

熊师傅 weight decay 了吗@bigeagle_xd·30 Eki

i still remember the discussions like: - @yzhang_cs listing latest exp results, "this means GDN outperforms ..." - @zxytim and others: "no, it's not fair", "no it's necessary but not sufficient", "is that cheating?", "you should compare ...", "i propose a new test ..." - @yzhang_cs like: yes, sir @yzhang_cs 太难了，都是泪😢

Xinyu Zhou@zxytim

You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. - Trying to find the efficient chunkwise parallelizable form to squeeze juice out of the GPU - RoPE or NoPE, a question haunting for nights. - Fighting buggy implementation that causes one of the long-context benchmarks drops ~20 pts. - RL diverging. Aligning training-inference numerics. - Dedicated efforts to make sure comparisons are solid and fair. - Going back-and-forth in a pool of adversarial gate-keeping tests, and finally it survives. Great teamwork!

English

34.5K

Zhiyuan1i@uniartisan·30 Eki

@qubitium @crystalsssup I wrote this kernel. I will check this tomorrow, since it's late night. You could open a issue on fla to help us track this issue.

English

Qubitium@qubitium·30 Eki

@crystalsssup Please check this. Bug filed. Can't run the model. Thanks. x.com/qubitium/statu…

Qubitium@qubitium

@Kimi_Moonshot I cannot run Kimi Linear on A100 on Pytorch 2.9 + Cuda 13.0. Crashing in fla.kda.gate + triton.

English

261

Crystal@crystalsssup·30 Eki

build, ship, repeat

Kimi.ai@Kimi_Moonshot

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

English

176

14.3K

Zhiyuan1i@uniartisan·30 Eki

There are a lot of works behind Kimi Linear. We've rethought efficient and expressive linear attention from infra. We even first discovered the attn matrix, and then the recurrent. No wait to check out the kda kernel in the FLA repo. We have much more work to do, to open.

Kimi.ai@Kimi_Moonshot

English

1.6K

Keşfet

@im_datta0 @hu_yifei @Kimi_Moonshot @GoogleDeepMind @googleaidevs @OpenAI @OpenAIDevs @aj_kourabi