Chenwei Cui

5.3K posts

Chenwei Cui

Chenwei Cui

@ccui42

CS PhD Student @ Kerner Lab @hannah_kerner @SCAI_ASU. I am interested in the science of machine learning.

Tempe, AZ Beigetreten Mart 2023
142 Folgt274 Follower
Angehefteter Tweet
Chenwei Cui
Chenwei Cui@ccui42ยท
Introducing Multi-Head LatentMoE ๐Ÿš€ Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic. It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally. We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens. Results: - We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1]) - Up to 1.61x faster training than MoE+EP with identical model performance. - Higher model performance while still 1.11x faster with doubled granularity. ๐Ÿ“„ Paper: arxiv.org/abs/2602.04870โ€ฆ ๐Ÿ’ป Code: github.com/kerner-lab/Spaโ€ฆ [1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. arxiv.org/abs/2601.18089
Chenwei Cui tweet media
English
12
72
564
95.3K
Chenwei Cui retweetet
Chenwei Cui retweetet
Yura Kuratov
Yura Kuratov@yurakuratovยท
Introducing GradMem: writing context into memory with test-time gradient descent. Instead of encoding text with a forward pass, we optimize memory tokens per example with a reconstruction loss. So memory is written by running actual gradient descent on it at test time.
Yura Kuratov tweet media
English
8
89
655
38.7K
Chenwei Cui retweetet
Hanchi Sun
Hanchi Sun@sun_hanchiยท
Introducing Expert Threshold Routing: - โœ… load balance - โœ… dynamic computation - โœ… autoregressive - โœ… zero train-inference mismatch At 2.4B params, Expert Threshold achieves 0.067 lower CE loss than Token Choice (equivalent to 1.6ร— data efficiency).
Hanchi Sun tweet media
English
8
24
130
25.5K
Chenwei Cui retweetet
Hanchi Sun
Hanchi Sun@sun_hanchiยท
Conceptually, ET = Expert Choice on an infinitely large batch. As batch size grows, each token's influence on the threshold vanishes, making routing independent and causal. This also means ET enables causal inference for EC-trained models without retraining.
English
1
2
12
604
Chenwei Cui retweetet
Rosinality
Rosinality@rosinalityยท
Updating reward model and policy together during RLHF in a per-batch manner, with active learning. It is possible here with model-based feedback. But how could it be implemented practically with actual human feedback?
Rosinality tweet media
English
5
22
128
8K
Chenwei Cui retweetet
mehul
mehul@emptysaysstuffยท
cross-entropy loss creates a score for every token in the vocabulary (128K+ for Llama-3). you use it once and discard it. the fix: process it in chunks, keep a running total. same math, 97% less memory. opened a PR on JAX, the issue author asked for more features, built those too
mehul tweet media
English
1
1
11
337
Chenwei Cui retweetet
OpenAI
OpenAI@OpenAIยท
GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And itโ€™s 2x faster than GPT-5 mini. openai.com/index/introducโ€ฆ
OpenAI tweet media
English
583
698
6.3K
1.6M
Chenwei Cui retweetet
Lancer
Lancer@HOPPMOHUOยท
ๅ’ŒKimi็š„AttnRes็ฑปไผผ๏ผŒ้ƒฝ่‡ดๅŠ›ไบŽ่งฃๅ†ณๆทฑๅบฆๆ–นๅ‘ไฟกๆฏไผ ้€’้—ฎ้ข˜๏ผŒไฝ† MoDA ๅœจๆณจๆ„ๅŠ›็ฎ—ๅญๅ†…้ƒจ่žๅˆ๏ผŒAttnRes ๅœจๆฎ‹ๅทฎ่ฟžๆŽฅๅฑ‚้ขๆ›ฟไปฃๅ›บๅฎšๆƒ้‡
Rosinality@rosinality

ByteDance also implemented attention over depth. They literally combined it with sequence attention.

ไธญๆ–‡
0
1
0
97
Chenwei Cui retweetet
Harry Partridge
Harry Partridge@part_harry_ยท
Attention residuals and mixture of expert reuse (x.com/yichen4nlp/staโ€ฆ) are two independent results pointing in the same direction: a single transformer layer, looped n times, is more efficient than n independent transformer layers. As @willccbb has often remarked, the best, most enduring discoveries are when you get improved performance by making the architecture LESS complicated. It seems abundantly clear to me that a single ultra wide layer, looped n times, can be made into a strict generalisation of the current paradigm, whilst also being more elegant in its simplicity.
Kimi.ai@Kimi_Moonshot

Introducing ๐‘จ๐’•๐’•๐’†๐’๐’•๐’Š๐’๐’ ๐‘น๐’†๐’”๐’Š๐’…๐’–๐’‚๐’๐’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐Ÿ”น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐Ÿ”น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐Ÿ”น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐Ÿ”น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Ÿ”—Full report: github.com/MoonshotAI/Attโ€ฆ

English
4
4
64
18.8K
Chenwei Cui retweetet
Lilian Weng
Lilian Weng@lilianwengยท
Iโ€™ve been telling people this a lot today: I enjoy so much working with people who care about what they are building and craftsmanship. It is a privilege to have a chance to work on something Iโ€™m passionate about, beyond making a living. I cherish it and donโ€™t take it for granted.
English
64
63
1.6K
168.6K
Chenwei Cui retweetet
Chenwei Cui retweetet
Chenwei Cui retweetet
Andrej Karpathy
Andrej Karpathy@karpathyยท
@Yulun_Du @ilyasut SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so... ๐Ÿค” We're not taking the Attention is All You Need part literally enough? :D
English
28
39
586
100.5K
Chenwei Cui retweetet
Yu Zhang ๐Ÿ™๐ŸŒ˜
Yu Zhang ๐Ÿ™๐ŸŒ˜@yzhang_csยท
The idea of rotating attention by 90ยฐ is sooooooo cool (credits to @Jianlin_S 's insights), and it surprisingly works. We (w/ the amazing @nathan) are so excited about thisโ€” been working on the paper for months and couldn't stop. Go give it a try. It's a drop-in replacement for standard residuals, born in 2015. really like the figs btw :-)
Yu Zhang ๐Ÿ™๐ŸŒ˜ tweet media
Kimi.ai@Kimi_Moonshot

Introducing ๐‘จ๐’•๐’•๐’†๐’๐’•๐’Š๐’๐’ ๐‘น๐’†๐’”๐’Š๐’…๐’–๐’‚๐’๐’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐Ÿ”น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐Ÿ”น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐Ÿ”น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐Ÿ”น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Ÿ”—Full report: github.com/MoonshotAI/Attโ€ฆ

English
19
69
961
122.6K