Yilong Chen (@Yichen4NLP) - ملف تويتر | Zamantika Mersobahis Locabet

تغريدة مثبتة

Yilong Chen@Yichen4NLP·6 Mar

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

English

6

16

112

32.8K

Yilong Chen@Yichen4NLP·3d

@YouJiacheng @classiclarryd We can quickly verify it on nano, but I don't see a particularly big gap between this article and other btye tokenizers, such as the source of 20x gain.

English

0

46

You Jiacheng@YouJiacheng·4d

HUGE if true. If true, this is probably a larger efficiency gain than ALL publicly available techniques since DeepSeekMoE(Jan 2024) COMBINED. And it can just win modded-nanogpt speedrun. (1e18 is 250s@50%MFU, but the loss is significantly lower than 3.28) cc @classiclarryd

Chen-Hao (Lance) Chao@chenhao_chao

(2/7) 💵 With training costs exceeding $100M for GPT-4, efficient alternatives matter. We show that diffusion LMs unlock a new paradigm for compute-optimal language pre-training.

English

7

13

227

48.6K

Yilong Chen@Yichen4NLP·11 Mar

@karpathy Used your repo for my latest experiments, super cool stuff! I did notice that a lot of the gains come from hyperparameter tweaks and existing methods, though. Any ideas on how to take it a step further into some really original territory?

English

1

0

271

Yilong Chen أُعيد تغريده

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

970

2.1K

19.4K

3.5M

Yilong Chen@Yichen4NLP·11 Mar

Yeah, totally — I think this is a very meaningful point. And honestly, it’s super valuable for me to get feedback from someone with real large-scale training experience. I’ve been thinking about these issues a lot too. I think you’re right here: if we try to share across the whole model, at least for now it actually seems worse than more local / smaller-scale sharing, just because the systems complexity gets too high. Really appreciate the suggestion overall. And please feel free to keep the ideas coming — I’d love to hear more.

English

0

25

biased estimator@selfattentive·11 Mar

Maybe. To be clear I don’t mean to detract from your work, I like it a lot. I just think at large enough param scale one has to move from combining the entire model’s experts in one layer toward combining chunks of layers’ experts in one layer. Still gets many of the benefits of a small scale implementation though.

English

1

0

45

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Mar

This reframing is almost as important as the paper itself. DeepSeek-MoE, the archetype of the modern MoE shape, was aimed at «Ultimate Expert Specialization». But that's per layer. In a MoUE with layer-independent routing, you can have true specialization, and interpretability.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Yilong Chen@Yichen4NLP

So we changed both the architecture and the objective. MoUE uses: - a Staggered Rotational Topology to localize search, - UELB to balance experts by exposure, not raw count, - a lightweight Universal Router for coherent multi-step routing.

English

3

6

78

8.4K

Yilong Chen@Yichen4NLP·11 Mar

One more thought: if sharing is on the table, infra may get a new optimization lever. On an existing cluster, better PP / expert placement is not only a throughput issue — it can also shape expert locality, and possibly how much specialization MoUE can extract in practice. So in some cases, colocating experts may help quality too, not just systems efficiency.

English

0

5

Yilong Chen@Yichen4NLP·11 Mar

@selfattentive @teortaxesTex Our intuition is that PP and experts operate on different axes. PP slices the model by depth, while experts slice it by function. As long as a PP rank contains multiple experts, routing can still produce specialization locally and you still get benefits from parameter sharing.

English

2

0

1

14

Yilong Chen@Yichen4NLP·11 Mar

Also, increasing expert count often means increasing node count, which has real cost both in training and inference. So another angle is: instead of constantly adding parameters and hardware, can we organize the existing parameters better so they specialize more effectively? If that works, you can get algorithmic gains without paying the full economic cost of scaling out.

English

1

0

1

10

Yilong Chen@Yichen4NLP·11 Mar

@selfattentive @teortaxesTex Another lever here is experts per node. If expert capacity is useful, you don’t necessarily have to scale EP across more nodes. You can also increase the number of experts within a node, which keeps routing more local and reduces cross-node traffic.

English

1

0

1

17

Yilong Chen@Yichen4NLP·11 Mar

@selfattentive @teortaxesTex We may also want to re-search the optimal tradeoff among depth, width, and expert count. Since some capacity can be covered by shared experts, we might allocate parameters to more important parts of the model instead, such as hidden dimension.

English

1

0

1

85

biased estimator@selfattentive·10 Mar

@teortaxesTex Okay but for a big enough model you will need pipeline parallelism no? At that point why bother with parameter sharing, seems better to just get new params for each PP group.

English

1

0

1

250

Yilong Chen@Yichen4NLP·11 Mar

@strong_signal1 Yes, that’s a feasible approach for downstream systems built on open-source models. For model developers though it might work even better, since they have access to the original pretraining data and can maintain a more consistent expert distribution.

English

0

1

48

strongsignal@strong_signal1·10 Mar

@Yichen4NLP Would a feasible extension be initializing experts using mlp/moe blocks from already trained/existing models and then pretraining to unify the system? There are tons of open source specialized models out there which seems to fit

English

1

0

62

Yilong Chen@Yichen4NLP·6 Mar

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

English

6

16

112

32.8K

Yilong Chen@Yichen4NLP·11 Mar

@swonzon @pmarca haha I also hope OpenAI ends up using it someday then maybe $MOUE will finally pump :)

English

3

0

4

2.6K

Swan@swonzon·11 Mar

@pmarca just followed this stacked chinese quant (@Yichen4NLP) who is working to improve the Mixture-of-Experts (MoE) systems used in large AI models like transformers, which is sick. MoE works by dividing the model into specialized "experts" that only activate for relevant parts of the input, saving compute power He has gotten compliments from huge accounts like @huggingface, this could be an eye opener for new compute systems Here is his github: github.com/arxiv Sending fees to him!

English

6

0

3

743

Yilong Chen@Yichen4NLP·10 Mar

@teortaxesTex @kalomaze Great intuition! Stronger routers can actually benefit from these harder load-balancing regimes. We’ve also tried a few small router improvements (Universal Router and I'll try yours). At this point standard LBL is mostly solved — maybe it’s time to focus on the harder cases. 😂

English

1

0

2

137

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Mar

@Yichen4NLP @kalomaze Thanks for responding. Generally, don't you think this is straining the capability of classical MoE routers? I suspect something like Zyphra's ZAYA 1 is preferable for your scenario.

English

1

0

2

169

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Mar

Very cool. MoE with cross-layer expert sharing (+reuse), so vastly richer combinatorially than the normal case, but what's neat is it can be warm-started from normal MoE checkpoints. I'm surprised at the claim of benign training dynamics and routing.

Yilong Chen@Yichen4NLP

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

English

4

7

93

10.8K

Yilong Chen@Yichen4NLP·10 Mar

@Xinyu2ML if we could truly solve load balancing, an infinitely wide MoE-MLP might already be enough for AGI lol.

English

0

41

Xinyu Yang@Xinyu2ML·9 Mar

Dense + Loop → more FLOPs. MoE + Loop → more flexibility and better expressiveness: experts can be selected across layers. Feels like this could reshape the width vs depth tradeoff.

Yilong Chen@Yichen4NLP

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

English

2

5

63

9.7K

Yilong Chen@Yichen4NLP·10 Mar

@rudzinskimaciej @teortaxesTex Yes! I have tried this idea, and it has some effect, but the improvement is not particularly significant. The challenge lies in achieving balanced training while ensuring good utilization of both local and shared experts.

English

1

0

2

24

Rudzinski Maciej@rudzinskimaciej·10 Mar

and I forgot to mention maybe obvious fact that inverted layers are free for memory so gain is much larger than expresivity similarly with looping and with looping it bothered me can we reuse attention from first pass to keep cost down? last thing looping has this nice property that the more expresive neurons start to make more sense - the more nonlinearity, addition or some exotic operations make a tone more sense in looped case and cost amortizes

English

1

0

39

Yilong Chen@Yichen4NLP·10 Mar

@rudzinskimaciej @teortaxesTex What you said makes a lot of sense! Considering that this topology can have countless combinations, there is a lot of work to be done to study how to reduce the complexity of the algorithm's search space while achieving good results.

English

0

2

19

Rudzinski Maciej@rudzinskimaciej·10 Mar

@teortaxesTex even nicer would be a set of global experts and local to make it easier for a router this should naturaly allow for looping in model with some effort a meta router for layer choice or reuse per token oh now it is geting interesting :D

English

1

0

36

Yilong Chen@Yichen4NLP·10 Mar

Thank you for your interest! CPT does indeed yield excellent MoUE results. In our experiments, we achieved good results without even complex design and hyperparameter searches (for example, universal expert selection was directly randomized). However, CPT requires a very fine-grained warmup to ensure that the routing does not crash.

English

1

0

1

115

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Mar

@kalomaze Yessss But we have so many MoEs it seems worth checking if warm start actually works

English

1

0

5

519

Yilong Chen@Yichen4NLP·10 Mar

MoEUT is a representative work combining MoE and UT, and I really like it! However, they mainly focus on model-level recursion, which is different from the layer-level reuse problem that MoUE addresses. Nevertheless, I still really like it and will add a discussion section to the paper. Thank you for your suggestion!

English

0

2

134

rarply@rarply·9 Mar

@Yichen4NLP No relationship to moeut? arxiv.org/abs/2405.16039

English

1

0

4

242

Yilong Chen@Yichen4NLP·6 Mar

Tagging a few communities and curators who might find this interesting: @_akhaliq @huggingface @dair_ai @the_gradient @rasbt Would greatly appreciate any feedback or discussions!

English

0

675

Yilong Chen@Yichen4NLP·6 Mar

The result is a useful scaling trade: instead of buying capacity mainly with more activated compute or more stored parameters, we can trade **algorithmic structure** for capacity by increasing global reusable experts and their recursive compositions. In practice: - up to +1.3 avg from scratch with no increase in activated params or total params - ~+2.5 in depth expansion - up to +4.2% avg in checkpoint conversion / CPT Our bet is that MoE may scale not only by adding more experts, but by making experts more reusable, modular, and globally composable. That is the direction behind MoUE.

English

1

8

812

Yilong Chen@Yichen4NLP·6 Mar

The UELB point is central. Under reuse, load balancing should not just be layer-local. It should reflect the computation graph. That gives a new depth-wise / topology-aware view of load balancing: balance experts relative to where they can be used, not how often they appear globally. This is a different optimization problem from standard MoE.

English

1

9

1.2K

Yilong Chen

اكتشف