Yupei Du

40 posts

Yupei Du banner
Yupei Du

Yupei Du

@YupeiDu

Postdoc at Saarland University. LLM reasoning.

Saarbrücken, Germany Katılım Temmuz 2017
712 Takip Edilen92 Takipçiler
Sabitlenmiş Tweet
Yupei Du
Yupei Du@YupeiDu·
How do language models memorize noise while reason impressively well? Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other! arxiv.org/abs/2507.04782
Yupei Du tweet media
English
1
5
25
2K
Yupei Du retweetledi
Yupei Du retweetledi
Yupei Du retweetledi
Yilong Chen
Yilong Chen@Yichen4NLP·
We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining
English
6
16
112
32.8K
Yupei Du retweetledi
Albert Gu
Albert Gu@_albertgu·
one question we got a lot about H-Net is how it compares to MoE. the idea is that both of them can be seen as dynamic or sparse computation methods that can adjust the FLOPs-to-parameter ratio (in H-Net, via the chunking ratio; in MoE, via number of experts). in other words for a fixed FLOP budget, both methods can increase parameter count by sparsely activating parameters only on some tokens. in the original paper, we compared these while matching both FLOPs *and* parameters and showed that H-Net >> MoE on byte-level language modeling. of course, H-Nets can be applied to any data so an open question remained about whether it's still better than MoE when applied directly to standard tokens instead of bytes. this paper answers the question affirmatively: H-Net seems to still consistently outperform MoE in resource-matched settings! they show this for standard language modeling (on top of BPE tokens) as well as in multimodal (vision-language models). there are a lot of other interesting results on ablations inside the architecture here the results are cool, but the weirdest part of this paper is how hard it tried to avoid stating what they did plainly: it's literally H-Net on tokens. i think being more transparent would have helped rather than diminish the paper's results by making what they did more accessible to the community, whereas the way it is written is a bit confusing 🤷‍♂️
Albert Gu tweet media
English
7
24
200
12.5K
Yupei Du retweetledi
Karan Dalal
Karan Dalal@karansdalal·
LLM memory is considered one of the hardest problems in AI. All we have today are endless hacks and workarounds. But the root solution has always been right in front of us. Next-token prediction is already an effective compressor. We don’t need a radical new architecture. The missing piece is to continue training the model at test-time, using context as training data. Our full release of End-to-End Test-Time Training (TTT-E2E) with @NVIDIAAI, @AsteraInstitute, and @StanfordAILab is now available. Blog: nvda.ws/4syfyMN Arxiv: arxiv.org/abs/2512.23675 This has been over a year in the making with @arnuvtandon and an incredible team.
Karan Dalal tweet media
English
90
324
2.1K
568.1K
Yupei Du retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Excited to announce the OpenForecaster project, we train models at reasoning predict the future. We won't get to AGI by maxxing STEM exam and coding benchmarks. That's not what most humans reason about in their day to day. Instead, we reason about uncertainty to make decisions, using our world-model of how society evolves. Yet, there weren't any large-scale datasets to train AI for this form of reasoning. Until now. We release OpenForesight, a training dataset of 52k forecasting questions, made from global news. Our recipe is fully automated, and can be repeated for more, newer data at low cost. Using it, we RL trained an 8B model, and it became competitive with much larger models like GPT-OSS-120B across benchmarks and metrics. And we want to keep building on this, in public. Our paper with full details, dataset, code etc. in 🧵 Blog: openforecaster.github.io
Shashwat Goel tweet media
English
20
53
382
40.5K
Yupei Du retweetledi
Songlin Yang
Songlin Yang@SonglinYang4·
the residual stream should be viewed as a recurrence, and insights from the RNN literature should apply here
Tianyuan Zhang@tianyuanzhang99

mHC puts lots of efforts on training stability. In some aspect, stable backprop through depth is similar to stable backprop through time(BPTT) for modern RNN. lots of RNN can be written as: S_t+1 = Gate @ S_t + f(S_t), similar to mHC: x_t+1 = H@x_t + f(x_t). And the backprop for both will has cumulative matmuls, where eigen value might explode or vanish. In RNN, common stable parametrization of the gate include: 1. Decay gate: diagonal or scalar gate with value between 0-1. Used by Retnet, Mamba2 2. Identity: same as original residual connect 3. Householder matrix: used by deltanet(if beta=2), one type orthogonal matrix, singular value as 1. Thus cumulative matmuls also is orthogonal. mHC use double stochastic mat, and the cumulative matmuls also yields double stochastic mat. Interestingly, these design space for residual connections and RNN might be shared, and influence each other. And more tricky point is that, stable might not always mean effectiveness.

English
8
15
209
21.6K
Yupei Du retweetledi
Zeyuan Allen-Zhu, Sc.D.
Zeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu·
(1/N)🚀Today we launch two tightly connected milestones in the Physics of LM series: a sharpened Part 4.1 (v2.0) and a brand new Part 4.2 — together forming a clear, reproducible, textbook-style reference for principled architecture research. Part 4.1 introduced a synthetic pretraining playground — our Galileo experiment for LLMs🍎. Our v2.0 strengthens it with Gated DeltaNet (GDN) and stricter alignment, building an even cleaner “Pisa tower” for testing architectural limits. Part 4.2 shows these synthetic predictions resonate in reality 🌍 — across 1–8B / 1T-token pretraining — confirming which design principles actually matter. Together, Parts 4.1 and 4.2 bring the synthetic and real worlds into surprising agreement 🤝— one more step toward a more scientific understanding of LLM architectures. If you’re curious about: 🧠why some models reason deeper ⚙️ why linear models struggle at retrieval 🎶why a tiny horizontal mixer (Canon) changes everything … this release ties it all together. (Links at the end)
Zeyuan Allen-Zhu, Sc.D. tweet media
English
23
159
1K
190.2K
Yupei Du retweetledi
Zachary Horvitz
Zachary Horvitz@zachary_horvitz·
✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think! Fast parallel decoding? 🤔 Any-order decoding? 🤨 Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵
Zachary Horvitz tweet media
English
4
37
163
20.9K
Yupei Du retweetledi
Ronak Malde
Ronak Malde@rronak_·
My takeaways from Neurips 1. Continual learning. To support this next frontier, we’re going to need new architectures, new reward functions, new data sources, and new revenue models. 2. Neolabs. Frontier research for risky bets is being shared across multiple companies now 3. San Diego has way better weather than SF 😭
English
27
36
913
292.5K
Yupei Du
Yupei Du@YupeiDu·
How do language models memorize noise while reason impressively well? Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other! arxiv.org/abs/2507.04782
Yupei Du tweet media
English
1
5
25
2K