Chenwei Cui (@ccui42) - Twitter-Profil | Zamantika Mersobahis Locabet

Angehefteter Tweet

Introducing Multi-Head LatentMoE 🚀 Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic. It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally. We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens. Results: - We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1]) - Up to 1.61x faster training than MoE+EP with identical model performance. - Higher model performance while still 1.11x faster with doubled granularity. 📄 Paper: arxiv.org/abs/2602.04870… 💻 Code: github.com/kerner-lab/Spa… [1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. arxiv.org/abs/2601.18089

English

12

72

564

95.3K

Chenwei Cui retweetet

Ashwinee Panda@PandaAshwinee·4d

> scaling model width mattered more than all hparam tuning auto research models just out here reading the Gemstones paper instead of tuning the random seed as god intended

SkyPilot@skypilot_org

Karpathy's Autoresearch is bottlenecked by a single GPU. We removed the bottleneck. We gave the agent access to our K8s cluster with H100s and H200s and let it provision its own GPUs. Over 8 hours: • ~910 experiments instead of ~96 sequentially • Discovered that scaling model width mattered more than all hparam tuning • Taught itself to exploit heterogenous hardware: use H200s for validation, screen ideas on H100s Full setup and results: blog.skypilot.co/scaling-autore… @karpathy

English

2

8

65

10.3K

Chenwei Cui retweetet

Yura Kuratov@yurakuratov·4d

Introducing GradMem: writing context into memory with test-time gradient descent. Instead of encoding text with a forward pass, we optimize memory tokens per example with a reconstruction loss. So memory is written by running actual gradient descent on it at test time.

English

8

89

655

38.7K

Chenwei Cui retweetet

Hanchi Sun@sun_hanchi·5d

Introducing Expert Threshold Routing: - ✅ load balance - ✅ dynamic computation - ✅ autoregressive - ✅ zero train-inference mismatch At 2.4B params, Expert Threshold achieves 0.067 lower CE loss than Token Choice (equivalent to 1.6× data efficiency).

English

8

24

130

25.5K

Chenwei Cui retweetet

Hanchi Sun@sun_hanchi·5d

Conceptually, ET = Expert Choice on an infinitely large batch. As batch size grows, each token's influence on the threshold vanishes, making routing independent and causal. This also means ET enables causal inference for EC-trained models without retraining.

English

1

2

12

604

Chenwei Cui retweetet

Mayank Mishra@MayankMish98·5d

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

9

108

505

134.1K

Chenwei Cui retweetet

Rosinality@rosinality·5d

Updating reward model and policy together during RLHF in a per-batch manner, with active learning. It is possible here with model-based feedback. But how could it be implemented practically with actual human feedback?

English

5

22

128

8K

Chenwei Cui retweetet

mehul@emptysaysstuff·6d

cross-entropy loss creates a score for every token in the vocabulary (128K+ for Llama-3). you use it once and discard it. the fix: process it in chunks, keep a running total. same math, 97% less memory. opened a PR on JAX, the issue author asked for more features, built those too

English

1

11

337

Chenwei Cui retweetet

bycloud@bycloudai·5d

matrix POV of mHC vs AttnRes

jianlin.su@Jianlin_S

Attention Residuals Revisited kexue.fm/archives/11664

0

25

185

17.5K

Chenwei Cui retweetet

jianlin.su@Jianlin_S·5d

Attention Residuals Revisited kexue.fm/archives/11664

English

9

67

490

118.3K

Chenwei Cui retweetet

Korbinian Poeppel @ NeurIPS 2025@KorbiPoeppel·6d

So non-linear RNNs are back in the game beyond xLSTM! Nice work improving upon our #FlashRNN library using matrix valued hidden states!

fly51fly@fly51fly

[LG] M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling M Mishra, S Tan, I Stoica, J Gonzalez… [UC Berkeley & MIT-IBM Watson Lab] (2026) arxiv.org/abs/2603.14360

English

1

16

64

10.1K

Chenwei Cui@ccui42·17 Mar

Cute

pc@pcshipp

I'm building an app called "FocusFish". It's a focus timer app. - Users focus and build an aquarium. - Each time they focus, they earn coins🪙. - With coins, they can purchase fish🐠. - If the user doesn't focus, the fish will die🥲. - They add plants, stones, bamboo, and more🌱. This is just an MVP. I need to test it more, so your feedback will help me take FocusFish to the next level.

English

0

22

Chenwei Cui retweetet

OpenAI@OpenAI·17 Mar

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…

English

583

698

6.3K

1.6M

Chenwei Cui retweetet

Lancer@HOPPMOHUO·17 Mar

和Kimi的AttnRes类似，都致力于解决深度方向信息传递问题，但 MoDA 在注意力算子内部融合，AttnRes 在残差连接层面替代固定权重

Rosinality@rosinality

ByteDance also implemented attention over depth. They literally combined it with sequence attention.

中文

0

1

0

97

Chenwei Cui retweetet

Harry Partridge@part_harry_·16 Mar

Attention residuals and mixture of expert reuse (x.com/yichen4nlp/sta…) are two independent results pointing in the same direction: a single transformer layer, looped n times, is more efficient than n independent transformer layers. As @willccbb has often remarked, the best, most enduring discoveries are when you get improved performance by making the architecture LESS complicated. It seems abundantly clear to me that a single ultra wide layer, looped n times, can be made into a strict generalisation of the current paradigm, whilst also being more elegant in its simplicity.

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

4

64

18.8K

Chenwei Cui retweetet

Lilian Weng@lilianweng·15 Oca

I’ve been telling people this a lot today: I enjoy so much working with people who care about what they are building and craftsmanship. It is a privilege to have a chance to work on something I’m passionate about, beyond making a living. I cherish it and don’t take it for granted.

English

64

63

1.6K

168.6K

Chenwei Cui retweetet

You Jiacheng@YouJiacheng·16 Mar

AttnRes might be one of the most natural way to achieve diverse communication channels. Good job @nathancgy4 @yzhang_cs !

You Jiacheng@YouJiacheng

I thought about it again. The whole purpose of expanding model dim is to diversify the communication channel between layers. ResC: I HC: c * I where c = H^{post}_i \Prod_{k from i+1 to j-1} H^{res}_k H^{pre}_j LatentMoE: up_i @ down_j

English

5

10

149

15.8K

Chenwei Cui retweetet

Joey (e/λ)@shxf0072·16 Mar

my fav paper of this year yet > take attention over res steam, > pp cause bottlenecks in large training so take it block wise > hc/mhc keep state (res steam) by structed matrix like mamba or other linear attention this is beautiful, i love it attention is all you need :D

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

5

8

164

23.7K

Chenwei Cui retweetet

Chen Chen@ChenChen_0201·16 Mar

Love this table! Layer is another kind of token. BTW: Post-LN TRMs (DeepNorm ieeexplore.ieee.org/stamp/stamp.js… and our KEEL arxiv.org/abs/2601.19895 ) also have dynamic mixing coefficients: affine weight (learned during training) and rms norm ops(input-dependent)

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

0

11

78

7.2K

Chenwei Cui retweetet

Andrej Karpathy@karpathy·16 Mar

@Yulun_Du @ilyasut SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so... 🤔 We're not taking the Attention is All You Need part literally enough? :D

English

28

39

586

100.5K

Chenwei Cui retweetet

Yu Zhang 🐙🌘@yzhang_cs·16 Mar

The idea of rotating attention by 90° is sooooooo cool (credits to @Jianlin_S 's insights), and it surprisingly works. We (w/ the amazing @nathan) are so excited about this— been working on the paper for months and couldn't stop. Go give it a try. It's a drop-in replacement for standard residuals, born in 2015. really like the figs btw :-)

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

19

69

961

122.6K

Chenwei Cui

Entdecken