Zengzhi Wang

1.7K posts

Zengzhi Wang banner
Zengzhi Wang

Zengzhi Wang

@SinclairWang1

PhDing @sjtu1896 Working on Pre-training Data Engineering for Foundation Models: MathPile (2023), 🫐 ProX (2024), 💎 MegaMath (2025),🐙 OctoThinker(2025)

Katılım Kasım 2020
2.8K Takip Edilen2.6K Takipçiler
Sabitlenmiş Tweet
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!
Zengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet media
English
10
85
513
93.3K
Zengzhi Wang retweetledi
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
It's time to change my work philosophy after witnessing the remarkable productivity of the frontier flagship model. Past: I needed to prepare many useful tools for myself to improve efficiency in my workspace. Now, I need to create a seamless workspace for AI, enabling it to interact with various tools, engines, and access permissions. It could boost my efficiency by at least 2 to 5 times. More importantly, the saved time allows me to think in a global perspective, waiving many manual labour and having a coffee by the way. While I recognize that this change might be a bit late, I’m glad to know it’s still not too late to adapt. 🚀🚀🚀
English
0
0
6
125
Zengzhi Wang retweetledi
himanshu
himanshu@himanshustwts·
The Arcee AI Podcast is here! In this episode, @latkins and @stochasticchasm join us to discuss the story of Trinity models and everything frontier. I can say, this talk has been one of the most amazing and technical conversations we've had on Ground Zero. 0:00:00 - Intro 0:00:59 - Varun's transition from SWE to Pre-Training Lead 0:04:20 - Trinity Manifesto, Openclaw Ecosystem 0:12:15 - Arcee's Post-Training to Pre-Training Pivot 0:23:45 - Varun's first Pre-Training run (you can just do things!) 0:27:33 - Saturation in Pre-Training?, Mid-Training 0:37:00 - Tweaking the Training Architecture, Adam vs Muon, Evals 01:09:07 - Inference Engineering, Quick Fire, Post-Training Recipe 01:18:02 - Alpha in RL Envs, Harness Design 01:23:00 - American Open Source is trailing Chinese Competitors, Trinity Adoption 01:29:25 - Hiring at Arcee, Advice to 20yo
English
12
32
197
31.4K
Zengzhi Wang retweetledi
Pengfei Liu
Pengfei Liu@stefan_fee·
Aha, thank you for the kind words! We’re exploring what “frontier lab” means in academia—through democratizing cognition and embracing “less is more” & “simple is powerful”. Recent releases: - agentic intelligence: davinci-dev, davinci-agency, davinci-env - open foundation model: davinci-llm, davinci-magihuman - data efficiency: (lima) limo, limr, limi - benchmark: agencybench, researcherbench, innovatorbench ... - data darwinism PartI, Part II - interaction as Intelligence: Part I, Part II - engineering: prompt engineering, cognition engineering, context engineering 2.0 More at: scholar.google.com/citations?hl=e… Our North Star: Using AI technology to make life better for people around us. Would love to exchange ideas if any of these interest you!
CLS ✈️ ICLR'26@ChengleiSi

@stefan_fee is running a mini frontier lab in academia 🤯

English
1
9
57
8.3K
Zengzhi Wang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
@beirmug @COLM_conf Hi Nandan, I believe this has not been used at COLM due to labels quickly going out of date with changes in the field (see: poster session topics at *ML confs, tracks at ACL). We expect automated paper matching will provide good reviewer fits (& continue to improve with time!).
English
2
0
5
203
Conference on Language Modeling
~45 hours until the abstract deadline! Submit abstracts on OpenReview by 3/26 11:59pm AOE, full papers 3/31. Final reminders & instructions for COLM are below (link in thread). Note that as of the March 31 deadline, papers must not be under review for ICML or committed to ACL.
Conference on Language Modeling tweet media
English
2
2
43
18.6K
Zengzhi Wang retweetledi
Pengfei Liu
Pengfei Liu@stefan_fee·
Seedance 2.0 is impressive. But it's closed-source! Introducing our daVinci-MagiHuman — a single-stream 15B Transformer trained from scratch that jointly generates video + audio. No cross-attention. No multi-stream branches. Just self-attention. ⚡ 5s 1080p video in 38s on a single H100 🏆 80% win rate vs Ovi 1.1 | 60.9% vs LTX 2.3 (2,000 human comparisons) 🌍 6 languages 📦 Fully open-source Speed by simplicity. By @SII_GAIR × @SandAI_HQ 📄 arxiv.org/abs/2603.21986 💻 github.com/GAIR-NLP/daVin… 🤗 huggingface.co/spaces/SII-GAI…
English
88
261
1.9K
294.3K
elie
elie@eliebakouch·
very nice answer!! i'm super glad to see this outcome "base models" is still confusing me here since there's no public ckpt of k2.5 base, you meant k2.5 post-trained used as base for the training? also "4x scale-up", is it compared to composer 1.5/1 or compared to k2.5 training, if so, is it k2.5 full training, only k2->k2.5 or only k2.5 post training? would be nice to see more evals of k2.5 vs composer 2 to see the improvement, it's a bit blurry if we look at the one in the blog post and compare to k2.5 data point
English
1
1
72
8.2K
Aman Sanger
Aman Sanger@amanrsanger·
We've evaluated a lot of base models on perplexity-based evals and Kimi k2.5 proved to be the strongest! After that, we do continued pre-training and high-compute RL (a 4x scale-up). The combination of the strong base, CPT and RL, and Fireworks' inference and RL samplers make Composer-2 frontier level. It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model.
Kimi.ai@Kimi_Moonshot

Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ ' hosted RL and inference platform as part of an authorized commercial partnership.

English
150
135
2.5K
492.8K
Cody Blakeney
Cody Blakeney@code_star·
Found another great midtraining paper. I haven't seen it on my TL so thought I would share. Super excited to dig into it later but looks really promising. (ty @lukemerrick_ ) I love seeing more work unifying understanding of midtraining -> RL
Cody Blakeney tweet mediaCody Blakeney tweet mediaCody Blakeney tweet mediaCody Blakeney tweet media
English
3
31
289
28K
Zengzhi Wang retweetledi
Zengzhi Wang retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
334
2.1K
13.5K
5M