Yilong Chen

28 posts

Yilong Chen

Yilong Chen

@Yichen4NLP

PhD Student at UCAS. Research on efficiency and generalization in LLM architectures. Still a lot to explore in model architect.

انضم Ocak 2025
141 يتبع115 المتابعون
تغريدة مثبتة
Yilong Chen
Yilong Chen@Yichen4NLP·
We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining
English
6
16
112
32.8K
Yilong Chen
Yilong Chen@Yichen4NLP·
@YouJiacheng @classiclarryd We can quickly verify it on nano, but I don't see a particularly big gap between this article and other btye tokenizers, such as the source of 20x gain.
English
0
0
0
46
You Jiacheng
You Jiacheng@YouJiacheng·
HUGE if true. If true, this is probably a larger efficiency gain than ALL publicly available techniques since DeepSeekMoE(Jan 2024) COMBINED. And it can just win modded-nanogpt speedrun. (1e18 is 250s@50%MFU, but the loss is significantly lower than 3.28) cc @classiclarryd
Chen-Hao (Lance) Chao@chenhao_chao

(2/7) 💵 With training costs exceeding $100M for GPT-4, efficient alternatives matter. We show that diffusion LMs unlock a new paradigm for compute-optimal language pre-training.

English
7
13
227
48.6K
Yilong Chen
Yilong Chen@Yichen4NLP·
@karpathy Used your repo for my latest experiments, super cool stuff! I did notice that a lot of the gains come from hyperparameter tweaks and existing methods, though. Any ideas on how to take it a step further into some really original territory?
English
1
0
0
271
Yilong Chen أُعيد تغريده
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
970
2.1K
19.4K
3.5M
Yilong Chen
Yilong Chen@Yichen4NLP·
Yeah, totally — I think this is a very meaningful point. And honestly, it’s super valuable for me to get feedback from someone with real large-scale training experience. I’ve been thinking about these issues a lot too. I think you’re right here: if we try to share across the whole model, at least for now it actually seems worse than more local / smaller-scale sharing, just because the systems complexity gets too high. Really appreciate the suggestion overall. And please feel free to keep the ideas coming — I’d love to hear more.
English
0
0
0
25
biased estimator
biased estimator@selfattentive·
Maybe. To be clear I don’t mean to detract from your work, I like it a lot. I just think at large enough param scale one has to move from combining the entire model’s experts in one layer toward combining chunks of layers’ experts in one layer. Still gets many of the benefits of a small scale implementation though.
English
1
0
0
45
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
This reframing is almost as important as the paper itself. DeepSeek-MoE, the archetype of the modern MoE shape, was aimed at «Ultimate Expert Specialization». But that's per layer. In a MoUE with layer-independent routing, you can have true specialization, and interpretability.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Yilong Chen@Yichen4NLP

So we changed both the architecture and the objective. MoUE uses: - a Staggered Rotational Topology to localize search, - UELB to balance experts by exposure, not raw count, - a lightweight Universal Router for coherent multi-step routing.

English
3
6
78
8.4K
Yilong Chen
Yilong Chen@Yichen4NLP·
One more thought: if sharing is on the table, infra may get a new optimization lever. On an existing cluster, better PP / expert placement is not only a throughput issue — it can also shape expert locality, and possibly how much specialization MoUE can extract in practice. So in some cases, colocating experts may help quality too, not just systems efficiency.
English
0
0
0
5
Yilong Chen
Yilong Chen@Yichen4NLP·
@selfattentive @teortaxesTex Our intuition is that PP and experts operate on different axes. PP slices the model by depth, while experts slice it by function. As long as a PP rank contains multiple experts, routing can still produce specialization locally and you still get benefits from parameter sharing.
English
2
0
1
14
Yilong Chen
Yilong Chen@Yichen4NLP·
Also, increasing expert count often means increasing node count, which has real cost both in training and inference. So another angle is: instead of constantly adding parameters and hardware, can we organize the existing parameters better so they specialize more effectively? If that works, you can get algorithmic gains without paying the full economic cost of scaling out.
English
1
0
1
10
Yilong Chen
Yilong Chen@Yichen4NLP·
@selfattentive @teortaxesTex Another lever here is experts per node. If expert capacity is useful, you don’t necessarily have to scale EP across more nodes. You can also increase the number of experts within a node, which keeps routing more local and reduces cross-node traffic.
English
1
0
1
17
Yilong Chen
Yilong Chen@Yichen4NLP·
@selfattentive @teortaxesTex We may also want to re-search the optimal tradeoff among depth, width, and expert count. Since some capacity can be covered by shared experts, we might allocate parameters to more important parts of the model instead, such as hidden dimension.
English
1
0
1
85
biased estimator
biased estimator@selfattentive·
@teortaxesTex Okay but for a big enough model you will need pipeline parallelism no? At that point why bother with parameter sharing, seems better to just get new params for each PP group.
English
1
0
1
250
Yilong Chen
Yilong Chen@Yichen4NLP·
@strong_signal1 Yes, that’s a feasible approach for downstream systems built on open-source models. For model developers though it might work even better, since they have access to the original pretraining data and can maintain a more consistent expert distribution.
English
0
0
1
48
strongsignal
strongsignal@strong_signal1·
@Yichen4NLP Would a feasible extension be initializing experts using mlp/moe blocks from already trained/existing models and then pretraining to unify the system? There are tons of open source specialized models out there which seems to fit
English
1
0
0
62
Yilong Chen
Yilong Chen@Yichen4NLP·
We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining
English
6
16
112
32.8K
Yilong Chen
Yilong Chen@Yichen4NLP·
@swonzon @pmarca haha I also hope OpenAI ends up using it someday then maybe $MOUE will finally pump :)
English
3
0
4
2.6K
Swan
Swan@swonzon·
@pmarca just followed this stacked chinese quant (@Yichen4NLP) who is working to improve the Mixture-of-Experts (MoE) systems used in large AI models like transformers, which is sick. MoE works by dividing the model into specialized "experts" that only activate for relevant parts of the input, saving compute power He has gotten compliments from huge accounts like @huggingface, this could be an eye opener for new compute systems Here is his github: github.com/arxiv Sending fees to him!
English
6
0
3
743
Yilong Chen
Yilong Chen@Yichen4NLP·
@teortaxesTex @kalomaze Great intuition! Stronger routers can actually benefit from these harder load-balancing regimes. We’ve also tried a few small router improvements (Universal Router and I'll try yours). At this point standard LBL is mostly solved — maybe it’s time to focus on the harder cases. 😂
English
1
0
2
137
Yilong Chen
Yilong Chen@Yichen4NLP·
@Xinyu2ML if we could truly solve load balancing, an infinitely wide MoE-MLP might already be enough for AGI lol.
English
0
0
0
41
Yilong Chen
Yilong Chen@Yichen4NLP·
@rudzinskimaciej @teortaxesTex Yes! I have tried this idea, and it has some effect, but the improvement is not particularly significant. The challenge lies in achieving balanced training while ensuring good utilization of both local and shared experts.
English
1
0
2
24
Rudzinski Maciej
Rudzinski Maciej@rudzinskimaciej·
and I forgot to mention maybe obvious fact that inverted layers are free for memory so gain is much larger than expresivity similarly with looping and with looping it bothered me can we reuse attention from first pass to keep cost down? last thing looping has this nice property that the more expresive neurons start to make more sense - the more nonlinearity, addition or some exotic operations make a tone more sense in looped case and cost amortizes
English
1
0
0
39
Yilong Chen
Yilong Chen@Yichen4NLP·
@rudzinskimaciej @teortaxesTex What you said makes a lot of sense! Considering that this topology can have countless combinations, there is a lot of work to be done to study how to reduce the complexity of the algorithm's search space while achieving good results.
English
0
0
2
19
Rudzinski Maciej
Rudzinski Maciej@rudzinskimaciej·
@teortaxesTex even nicer would be a set of global experts and local to make it easier for a router this should naturaly allow for looping in model with some effort a meta router for layer choice or reuse per token oh now it is geting interesting :D
English
1
0
0
36
Yilong Chen
Yilong Chen@Yichen4NLP·
Thank you for your interest! CPT does indeed yield excellent MoUE results. In our experiments, we achieved good results without even complex design and hyperparameter searches (for example, universal expert selection was directly randomized). However, CPT requires a very fine-grained warmup to ensure that the routing does not crash.
English
1
0
1
115
Yilong Chen
Yilong Chen@Yichen4NLP·
MoEUT is a representative work combining MoE and UT, and I really like it! However, they mainly focus on model-level recursion, which is different from the layer-level reuse problem that MoUE addresses. Nevertheless, I still really like it and will add a discussion section to the paper. Thank you for your suggestion!
English
0
0
2
134
Yilong Chen
Yilong Chen@Yichen4NLP·
The result is a useful scaling trade: instead of buying capacity mainly with more activated compute or more stored parameters, we can trade **algorithmic structure** for capacity by increasing global reusable experts and their recursive compositions. In practice: - up to +1.3 avg from scratch with no increase in activated params or total params - ~+2.5 in depth expansion - up to +4.2% avg in checkpoint conversion / CPT Our bet is that MoE may scale not only by adding more experts, but by making experts more reusable, modular, and globally composable. That is the direction behind MoUE.
English
1
1
8
812
Yilong Chen
Yilong Chen@Yichen4NLP·
The UELB point is central. Under reuse, load balancing should not just be layer-local. It should reflect the computation graph. That gives a new depth-wise / topology-aware view of load balancing: balance experts relative to where they can be used, not how often they appear globally. This is a different optimization problem from standard MoE.
Yilong Chen tweet media
English
1
1
9
1.2K