Zhiyuan1i

46 posts

Zhiyuan1i banner
Zhiyuan1i

Zhiyuan1i

@uniartisan

INTP @Kimi_Moonshot

Katılım Kasım 2017
59 Takip Edilen544 Takipçiler
Zhiyuan1i retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
333
2.1K
13.6K
4.9M
Zhiyuan1i
Zhiyuan1i@uniartisan·
@im_datta0 @hu_yifei please share a minimal script. FLA provide multiple ways to accelerate training. Even qwen3.5 itself use FLA. In my opinion, to avoid compile and h2d/d2h is the key
English
1
0
0
50
Datta Nimmaturi
Datta Nimmaturi@im_datta0·
@hu_yifei Idk the issue here but... I think FLA seems to be having repeated compilations at every step (at least that's what I faced when fine-tuning Qwen 3.5 with FLA)
English
2
0
1
751
Yifei Hu
Yifei Hu@hu_yifei·
training infra is the moat
Yifei Hu tweet media
English
2
1
52
8.6K
Zhiyuan1i retweetledi
Dillon Uzar
Dillon Uzar@DillonUzar·
Context Arena Update: Added kimi-linear-48b-a3b-instruct [11-08] and kimi-k2 (Thinking) [11-06] to the MRCR leaderboards. The Linear 48b results are fascinating! It actually outperforms the new Gemini 3.0 Pro Thinking on 4-needle and 8-needle tasks at higher context lengths (512k+). I've added it to 2needle, 4needle, and 8needle. kimi-k2 (Thinking) lands lower on the leaderboards (Rank #22 for 2-needle AUC @ 128k), with a hard context ceiling around 262k. I did not run it for 2needle and 4needle. All results at: contextarena.ai The performance curve for the Linear model is distinct: while it underperforms Gemini 3 significantly at shorter contexts (<=256k) on the difficult 8-needle test, its degradation slope is much flatter. Gemini starts higher and drops fast; Kimi starts lower but holds steady, overtaking Gemini at the higher end. However, note that kimi-linear-48b has noticeable performance drops past 128k on the easier 2 & 4 needle tests. Additionally, due to lower token efficiency compared to Gemini/GPT, only ~60% of the 1M token tests successfully ran (hitting limits/OOM). So some caution with the results at the 1M level. kimi-linear-48b results: 2-Needle Performance (@ 128k / @ 1M): - AUC: 96.5% (vs Gem 3: 99.5%) / 81.7% (vs Gem 3: 85.5%) - Pointwise: 96.0% (vs Gem 3: 99.0%) / 77.0% (vs Gem 3: 72.2%) 4-Needle Performance (@ 128k / @ 1M): - AUC: 85.5% (vs 85.8%) / 62.7% (#1, beating Gem 3: 57.3%) - Pointwise: 83.7% (vs 80.8%) / 51.5% (#1, beating Gem 3: 34.3%) 8-Needle Performance (@ 128k / @ 1M): - AUC: 54.9% (vs 73.0%) / 43.8% (#1, beating Gem 3: 39.0%) - Pointwise: 49.0% (vs 54.2%) / 35.3% (#1, beating Gem 3: 24.5%) A very different architectural approach yielding impressive stability at scale. Because of its current price point, it is very competitive for long context (MRCR). Enjoy. @Kimi_Moonshot @GoogleDeepMind @googleaidevs @OpenAI @OpenAIDevs
Dillon Uzar tweet mediaDillon Uzar tweet mediaDillon Uzar tweet mediaDillon Uzar tweet media
English
21
57
472
294.1K
Zhiyuan1i
Zhiyuan1i@uniartisan·
Think deep. Work smart. Focus on the next six months, not the next ten years.
English
3
0
102
3.9K
Zhiyuan1i
Zhiyuan1i@uniartisan·
@aj_kourabi The speed issue has been resolved, and the final problem was surprising - everyone's enthusiasm exposed our network bandwidth issue.
English
1
0
10
345
Zhiyuan1i
Zhiyuan1i@uniartisan·
@indra_himself You will enjoy Kimi Linear(48B). We build it for coding and provide emotional value. A really warm and smart model.
English
0
0
1
78
Kami Sama
Kami Sama@indra_himself·
@uniartisan I just want a kimi small LLM That rivals best of emotion chat persona At inference cost lower than gpt5nano if possible
English
1
0
0
115
Zhiyuan1i
Zhiyuan1i@uniartisan·
@galuh1300d @deepseek_ai Undoubtedly, I respect and learn from their work. We compete in different aspects, A single flower does not make spring, but a garden full of flowers does.
English
1
0
0
77
Le Chiffre
Le Chiffre@HaveFunStayingP·
@teortaxesTex Culture/incentives - meta is more short term with everyone pulling their own direction, China - more top down, hierarchical
English
1
0
0
327
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
So it's very interesting. When a Chinese company decides "ok time to do AGI", they spawn a lab and do it. 100+ people, DS-MoE, add GPUs, get near-frontier weights on the other end. When Meta tries to do it, Zuck trips and spaghetti falls out of his pockets. Again. And again. Why?
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Meituan LongCat@Meituan_LongCat

@xeophon Your “LLM delivery”has arrived! Please remember to leave a five-star review. 🐱miao ~~~

English
28
25
397
76.1K
Zhiyuan1i retweetledi
Songlin Yang
Songlin Yang@SonglinYang4·
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually appreciate Minimax’s openness here: they admitted the challenges and regrets of hybrid linear or sliding-window attention on multi-hop reasoning tasks, which not many labs would say out loud. That said, the “regrets” might not be as bad as they sound. Minimax used a very simple linear attention variant (largely due to insufficient evaluation at the time), so the performance gap was probably exaggerated. The continual pretraining strategy (i.e., switching from global attention to hybrid sliding-window attention) also seemed quite suboptimal. And afaik, hybrid linear attention can still perform very strongly on nearly all benchmarks except multi-hop reasoning. If the performance drop on multi-hop reasoning can be kept small enough to trade for better inference efficiency and data efficiency, hybrid linear attention still has plenty of room to grow. Better linear-complexity layers are still worth exploring, especially with improving infrastructure from frameworks like vLLM and SGLang. After all, we don’t want our agentic models to be forever bounded by context length - that’s a limitation we’ll have to overcome sooner or later
English
13
58
505
61.3K
Zhiyuan1i retweetledi
Zhiyuan1i
Zhiyuan1i@uniartisan·
@qubitium @crystalsssup I wrote this kernel. I will check this tomorrow, since it's late night. You could open a issue on fla to help us track this issue.
English
1
0
1
25
Crystal
Crystal@crystalsssup·
build, ship, repeat
Kimi.ai@Kimi_Moonshot

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

English
10
2
176
14.3K
Zhiyuan1i
Zhiyuan1i@uniartisan·
There are a lot of works behind Kimi Linear. We've rethought efficient and expressive linear attention from infra. We even first discovered the attn matrix, and then the recurrent. No wait to check out the kda kernel in the FLA repo. We have much more work to do, to open.
Kimi.ai@Kimi_Moonshot

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

English
1
3
12
1.6K