Zengzhi Wang (@SinclairWang1) - Twitter Profili

Sabitlenmiş Tweet

Zengzhi Wang@SinclairWang1·26 Haz

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!

English

10

86

513

93.1K

Zengzhi Wang@SinclairWang1·1h

@chenhao_chao the gif is so cool.

English

0

49

Chen-Hao (Lance) Chao@chenhao_chao·14h

(1/7) We introduce MDM-Prime-v2 which scales 21.8× better than autoregressive models (ARMs) in compute-optimal comparisons. 📎 Paper: arxiv.org/abs/2603.16077 🌟 Blog: chen-hao-chao.github.io/mdm-prime-v2 ⌨️ Github: github.com/chen-hao-chao/… Here’s how we did it👇:

GIF

English

6

28

164

12.2K

Zengzhi Wang@SinclairWang1·1h

@code_star @lukemerrick_ indeed

English

0

43

Cody Blakeney@code_star·7h

Found another great midtraining paper. I haven't seen it on my TL so thought I would share. Super excited to dig into it later but looks really promising. (ty @lukemerrick_ ) I love seeing more work unifying understanding of midtraining -> RL

English

3

18

122

5.7K

Zengzhi Wang retweetledi

Patrick Pynadath@PatrickPyn35903·3d

New blog post with @thjashin and @ruqi_zhang! Minor entropy differences can completely flip model rankings in generative perplexity — a direct consequence of both metrics being components of KL divergence. We discuss what this means for model comparison. patrickpynadath1.github.io/blog/eval_meth…

GIF

English

4

14

78

23.9K

Zengzhi Wang retweetledi

Alexander Doria@Dorialexander·4d

"Synthetic pretraining is the way frontier models are built" — by @fujikanaeda

Maarten Van Segbroeck@mvansegb

@inductionheads Spot on. We actually just gave a guest lecture at Berkeley EECS on this exact dynamic (L11: Synthetic Data Powering Pre-Training). @fujikanaeda Here are our slides if anyone wants to go down the rabbit hole: scalable-ai.eecs.berkeley.edu/assets/lecture…

English

5

38

498

45K

Zengzhi Wang retweetledi

Maarten Van Segbroeck@mvansegb·4d

@inductionheads Spot on. We actually just gave a guest lecture at Berkeley EECS on this exact dynamic (L11: Synthetic Data Powering Pre-Training). @fujikanaeda Here are our slides if anyone wants to go down the rabbit hole: scalable-ai.eecs.berkeley.edu/assets/lecture…

English

4

20

166

52.9K

Zengzhi Wang retweetledi

elie@eliebakouch·3d

visual summary of attention residuals by kimi, beautiful paper

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

7

62

714

46.4K

Zengzhi Wang@SinclairWang1·3d

@latkins @Kimi_Moonshot lol😂🤣

0

82

Lucas Atkins@latkins·4d

This level of drop on a Sunday night is so annoying because we have so many p0 tasks to do on Monday morning but everyone is going to want to replicate this instead. Seriously uncool @Kimi_Moonshot (congrats and thank you for sharing, jokes aside 😁)

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

5

4

133

12.2K

Zengzhi Wang retweetledi

Kimi.ai@Kimi_Moonshot·4d

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

326

2K

13.4K

4.8M

Zengzhi Wang retweetledi

Karina Nguyen@karinanguyen·11 Mar

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English

43

90

659

135.7K

Zengzhi Wang retweetledi

Jessica Chudnovsky@jchudnov·11 Mar

Your deduplication pipeline was built for small models. At scale, it's broken. New preprint: "Scale Dependent Data Duplication" 1/10

English

6

28

113

24.7K

Zengzhi Wang retweetledi

Joël Niklaus@joelniklaus·8 Mar

Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale huggingface.co/spaces/Hugging…

English

28

215

1.4K

118.9K

Zengzhi Wang retweetledi

Lewis Tunstall@_lewtun·15 Şub

We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem proving. huggingface.co/spaces/lm-prov…

English

24

131

829

159.4K

Zengzhi Wang retweetledi

Jonathan Frankle@jefrankle·5 Mar

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English

9

46

241

67.7K

Zengzhi Wang retweetledi

Rosinality@rosinality·4 Mar

RAE, MoE, Unified multimodality, and scaling law. Vision is data hungry, and ironically MoE reduces this optimal scaling gap between vision and language.

English

4

18

188

10.6K

Zengzhi Wang@SinclairWang1·4 Mar

@stingning yes

0

97

Ning Ding@stingning·4 Mar

Today I heard a line that stuck with me: "the real moat is the organizational structure."

English

4

5

51

6.4K

Zengzhi Wang retweetledi

Ai2@allen_ai·3 Mar

📢 Update: the Molmo 2 codebase is now open source. We're releasing the code behind Molmo 2—our open model family for video & image understanding, pointing, tracking, & more. Now you can easily train Molmo 2 on your own data. 🧵

English

6

51

364

30.8K

Junyang Lin@JustinLin610·3 Mar

me stepping down. bye my beloved qwen.

English

1.7K

741

13.6K

6.5M

Zengzhi Wang@SinclairWang1·3 Mar

@JustinLin610 OMG😱

0

4

1.1K

Zengzhi Wang retweetledi

Alex Wa@_djdumpling·18 Şub

new blog! What methodologies do labs use to train frontier models? The blog distills 7 open-weight model reports from frontier labs, covering architecture, stability, optimizers, data curation, pre/mid/post-training + RL, and behaviors/safety djdumpling.github.io/2026/01/31/fro…

English

34

287

2K

279.4K

Zengzhi Wang retweetledi

Huaqing Zhang@zhqwqwq·18 Şub

🚀Introducing our new work: Configuration-to-Performance Scaling Law with Neural Ansatz. A language model trained on large-scale pretraining logs can accurately predict how training configurations influence pretraining performance and generalize to runs with 10x more compute.

English

15

31

162

45.1K

Zengzhi Wang

Keşfet