Zengzhi Wang

1.7K posts

Zengzhi Wang banner
Zengzhi Wang

Zengzhi Wang

@SinclairWang1

PhDing @sjtu1896 #NLProc Working on Pre-training Data Engineering for LLMs: MathPile (2023), 🫐 ProX (2024), 💎 MegaMath (2025),🐙 OctoThinker(2025)

Katılım Kasım 2020
2.8K Takip Edilen2.6K Takipçiler
Sabitlenmiş Tweet
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!
Zengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet media
English
10
86
513
93.1K
Cody Blakeney
Cody Blakeney@code_star·
Found another great midtraining paper. I haven't seen it on my TL so thought I would share. Super excited to dig into it later but looks really promising. (ty @lukemerrick_ ) I love seeing more work unifying understanding of midtraining -> RL
Cody Blakeney tweet mediaCody Blakeney tweet mediaCody Blakeney tweet mediaCody Blakeney tweet media
English
3
18
122
5.7K
Zengzhi Wang retweetledi
Zengzhi Wang retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
326
2K
13.4K
4.8M
Zengzhi Wang retweetledi
Karina Nguyen
Karina Nguyen@karinanguyen·
Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:
English
43
90
659
135.7K
Zengzhi Wang retweetledi
Jessica Chudnovsky
Jessica Chudnovsky@jchudnov·
Your deduplication pipeline was built for small models. At scale, it's broken. New preprint: "Scale Dependent Data Duplication" 1/10
Jessica Chudnovsky tweet media
English
6
28
113
24.7K
Zengzhi Wang retweetledi
Joël Niklaus
Joël Niklaus@joelniklaus·
Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale huggingface.co/spaces/Hugging…
Joël Niklaus tweet media
English
28
215
1.4K
118.9K
Zengzhi Wang retweetledi
Lewis Tunstall
Lewis Tunstall@_lewtun·
We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem proving. huggingface.co/spaces/lm-prov…
Lewis Tunstall tweet media
English
24
131
829
159.4K
Zengzhi Wang retweetledi
Jonathan Frankle
Jonathan Frankle@jefrankle·
Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵
Jonathan Frankle tweet mediaJonathan Frankle tweet media
English
9
46
241
67.7K
Zengzhi Wang retweetledi
Rosinality
Rosinality@rosinality·
RAE, MoE, Unified multimodality, and scaling law. Vision is data hungry, and ironically MoE reduces this optimal scaling gap between vision and language.
Rosinality tweet media
English
4
18
188
10.6K
Ning Ding
Ning Ding@stingning·
Today I heard a line that stuck with me: "the real moat is the organizational structure."
English
4
5
51
6.4K
Zengzhi Wang retweetledi
Ai2
Ai2@allen_ai·
📢 Update: the Molmo 2 codebase is now open source. We're releasing the code behind Molmo 2—our open model family for video & image understanding, pointing, tracking, & more. Now you can easily train Molmo 2 on your own data. 🧵
Ai2 tweet media
English
6
51
364
30.8K
Junyang Lin
Junyang Lin@JustinLin610·
me stepping down. bye my beloved qwen.
English
1.7K
741
13.6K
6.5M
Zengzhi Wang retweetledi
Alex Wa
Alex Wa@_djdumpling·
new blog! What methodologies do labs use to train frontier models? The blog distills 7 open-weight model reports from frontier labs, covering architecture, stability, optimizers, data curation, pre/mid/post-training + RL, and behaviors/safety djdumpling.github.io/2026/01/31/fro…
Alex Wa tweet media
English
34
287
2K
279.4K
Zengzhi Wang retweetledi
Huaqing Zhang
Huaqing Zhang@zhqwqwq·
🚀Introducing our new work: Configuration-to-Performance Scaling Law with Neural Ansatz. A language model trained on large-scale pretraining logs can accurately predict how training configurations influence pretraining performance and generalize to runs with 10x more compute.
Huaqing Zhang tweet media
English
15
31
162
45.1K