Chenwei Cui

5.3K posts

Chenwei Cui

Chenwei Cui

@ccui42

CS PhD Student @ Kerner Lab @hannah_kerner @SCAI_ASU. I am interested in the science of machine learning.

Tempe, AZ เข้าร่วม Mart 2023
142 กำลังติดตาม274 ผู้ติดตาม
ทวีตที่ปักหมุด
Chenwei Cui
Chenwei Cui@ccui42·
Introducing Multi-Head LatentMoE 🚀 Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic. It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally. We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens. Results: - We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1]) - Up to 1.61x faster training than MoE+EP with identical model performance. - Higher model performance while still 1.11x faster with doubled granularity. 📄 Paper: arxiv.org/abs/2602.04870… 💻 Code: github.com/kerner-lab/Spa… [1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. arxiv.org/abs/2601.18089
Chenwei Cui tweet media
English
12
72
564
95.3K
Chenwei Cui รีทวีตแล้ว
Chenwei Cui รีทวีตแล้ว
Yura Kuratov
Yura Kuratov@yurakuratov·
Introducing GradMem: writing context into memory with test-time gradient descent. Instead of encoding text with a forward pass, we optimize memory tokens per example with a reconstruction loss. So memory is written by running actual gradient descent on it at test time.
Yura Kuratov tweet media
English
8
89
655
38.7K
Chenwei Cui รีทวีตแล้ว
Hanchi Sun
Hanchi Sun@sun_hanchi·
Introducing Expert Threshold Routing: - ✅ load balance - ✅ dynamic computation - ✅ autoregressive - ✅ zero train-inference mismatch At 2.4B params, Expert Threshold achieves 0.067 lower CE loss than Token Choice (equivalent to 1.6× data efficiency).
Hanchi Sun tweet media
English
8
24
130
25.5K
Chenwei Cui รีทวีตแล้ว
Hanchi Sun
Hanchi Sun@sun_hanchi·
Conceptually, ET = Expert Choice on an infinitely large batch. As batch size grows, each token's influence on the threshold vanishes, making routing independent and causal. This also means ET enables causal inference for EC-trained models without retraining.
English
1
2
12
603
Chenwei Cui รีทวีตแล้ว
Rosinality
Rosinality@rosinality·
Updating reward model and policy together during RLHF in a per-batch manner, with active learning. It is possible here with model-based feedback. But how could it be implemented practically with actual human feedback?
Rosinality tweet media
English
5
22
128
8K
Chenwei Cui รีทวีตแล้ว
mehul
mehul@emptysaysstuff·
cross-entropy loss creates a score for every token in the vocabulary (128K+ for Llama-3). you use it once and discard it. the fix: process it in chunks, keep a running total. same math, 97% less memory. opened a PR on JAX, the issue author asked for more features, built those too
mehul tweet media
English
1
1
11
337
Chenwei Cui รีทวีตแล้ว
OpenAI
OpenAI@OpenAI·
GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. openai.com/index/introduc…
OpenAI tweet media
English
582
698
6.3K
1.6M
Chenwei Cui รีทวีตแล้ว
Harry Partridge
Harry Partridge@part_harry_·
Attention residuals and mixture of expert reuse (x.com/yichen4nlp/sta…) are two independent results pointing in the same direction: a single transformer layer, looped n times, is more efficient than n independent transformer layers. As @willccbb has often remarked, the best, most enduring discoveries are when you get improved performance by making the architecture LESS complicated. It seems abundantly clear to me that a single ultra wide layer, looped n times, can be made into a strict generalisation of the current paradigm, whilst also being more elegant in its simplicity.
Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English
4
4
64
18.8K
Chenwei Cui รีทวีตแล้ว
Lilian Weng
Lilian Weng@lilianweng·
I’ve been telling people this a lot today: I enjoy so much working with people who care about what they are building and craftsmanship. It is a privilege to have a chance to work on something I’m passionate about, beyond making a living. I cherish it and don’t take it for granted.
English
64
63
1.6K
168.6K
Chenwei Cui รีทวีตแล้ว
Chenwei Cui รีทวีตแล้ว
Chenwei Cui รีทวีตแล้ว
Andrej Karpathy
Andrej Karpathy@karpathy·
@Yulun_Du @ilyasut SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so... 🤔 We're not taking the Attention is All You Need part literally enough? :D
English
28
39
585
100.5K
Chenwei Cui รีทวีตแล้ว