Zengzhi Wang (@SinclairWang1) - Twitter Profili

Sabitlenmiş Tweet

Zengzhi Wang@SinclairWang1·26 Haz

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!

English

10

85

513

93.3K

Zengzhi Wang retweetledi

elie@eliebakouch·9 Nis

this is really cool work by the meta, with a new phase that does a better cold start for RL reasoning heavy tasks

Jason Weston@jaseweston

🏋️Thinking Mid-training: RL of Interleaved Reasoning🎗️ We address the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase to teach models how to think. - Annotate pretraining data with interleaved thoughts - SFT mid-training to learn when/what to think alongside original content - RL mid-training to optimize reasoning generation with grounded reward from future token prediction Result: 3.2x improvement on reasoning benchmarks compared to direct RL post-training on base Llama-3-8B, and gains over only prior SFT as well. Introducing reasoning earlier makes models better prepared for post-training! Read more in the blog post: facebookresearch.github.io/RAM/blogs/thin…

English

0

12

132

11.5K

Zengzhi Wang@SinclairWang1·10 Nis

It's time to change my work philosophy after witnessing the remarkable productivity of the frontier flagship model. Past: I needed to prepare many useful tools for myself to improve efficiency in my workspace. Now, I need to create a seamless workspace for AI, enabling it to interact with various tools, engines, and access permissions. It could boost my efficiency by at least 2 to 5 times. More importantly, the saved time allows me to think in a global perspective, waiving many manual labour and having a coffee by the way. While I recognize that this change might be a bit late, I’m glad to know it’s still not too late to adapt. 🚀🚀🚀

English

0

6

125

Zengzhi Wang retweetledi

himanshu@himanshustwts·8 Nis

The Arcee AI Podcast is here! In this episode, @latkins and @stochasticchasm join us to discuss the story of Trinity models and everything frontier. I can say, this talk has been one of the most amazing and technical conversations we've had on Ground Zero. 0:00:00 - Intro 0:00:59 - Varun's transition from SWE to Pre-Training Lead 0:04:20 - Trinity Manifesto, Openclaw Ecosystem 0:12:15 - Arcee's Post-Training to Pre-Training Pivot 0:23:45 - Varun's first Pre-Training run (you can just do things!) 0:27:33 - Saturation in Pre-Training?, Mid-Training 0:37:00 - Tweaking the Training Architecture, Adam vs Muon, Evals 01:09:07 - Inference Engineering, Quick Fire, Post-Training Recipe 01:18:02 - Alpha in RL Envs, Harness Design 01:23:00 - American Open Source is trailing Chinese Competitors, Trinity Adoption 01:29:25 - Hiring at Arcee, Advice to 20yo

English

12

32

197

31.4K

Zengzhi Wang retweetledi

Pengfei Liu@stefan_fee·1 Nis

Aha, thank you for the kind words! We’re exploring what “frontier lab” means in academia—through democratizing cognition and embracing “less is more” & “simple is powerful”. Recent releases: - agentic intelligence: davinci-dev, davinci-agency, davinci-env - open foundation model: davinci-llm, davinci-magihuman - data efficiency: (lima) limo, limr, limi - benchmark: agencybench, researcherbench, innovatorbench ... - data darwinism PartI, Part II - interaction as Intelligence: Part I, Part II - engineering: prompt engineering, cognition engineering, context engineering 2.0 More at: scholar.google.com/citations?hl=e… Our North Star: Using AI technology to make life better for people around us. Would love to exchange ideas if any of these interest you!

CLS ✈️ ICLR'26@ChengleiSi

@stefan_fee is running a mini frontier lab in academia 🤯

English

1

9

57

8.3K

Zengzhi Wang retweetledi

马东锡 NLP@dongxi_nlp·26 Mar

CC + autoreseach 来发现新的越狱算法。越来越多的类似的研究工作完全可以自动化。 autoresearch 重塑学术，正在发生。

Alexander Panfilov @ ✈️ICLR 2026@kotekjedi_ml

New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.

中文

8

57

371

104K

Zengzhi Wang retweetledi

He He@hhexiy·25 Mar

x.com/i/article/2036…

ZXX

17

124

861

114.6K

Zengzhi Wang@SinclairWang1·26 Mar

@gregd_nlp @beirmug @COLM_conf agree. As a reviewer, I found that the papers assigned to me are often highly matched.

English

0

1

34

Greg Durrett@gregd_nlp·26 Mar

@beirmug @COLM_conf Hi Nandan, I believe this has not been used at COLM due to labels quickly going out of date with changes in the field (see: poster session topics at *ML confs, tracks at ACL). We expect automated paper matching will provide good reviewer fits (& continue to improve with time!).

English

2

0

5

203

Conference on Language Modeling@COLM_conf·25 Mar

~45 hours until the abstract deadline! Submit abstracts on OpenReview by 3/26 11:59pm AOE, full papers 3/31. Final reminders & instructions for COLM are below (link in thread). Note that as of the March 31 deadline, papers must not be under review for ICML or committed to ACL.

Conference on Language Modeling tweet media

English

2

43

18.6K

Zengzhi Wang retweetledi

Pengfei Liu@stefan_fee·24 Mar

Seedance 2.0 is impressive. But it's closed-source! Introducing our daVinci-MagiHuman — a single-stream 15B Transformer trained from scratch that jointly generates video + audio. No cross-attention. No multi-stream branches. Just self-attention. ⚡ 5s 1080p video in 38s on a single H100 🏆 80% win rate vs Ovi 1.1 | 60.9% vs LTX 2.3 (2,000 human comparisons) 🌍 6 languages 📦 Fully open-source Speed by simplicity. By @SII_GAIR × @SandAI_HQ 📄 arxiv.org/abs/2603.21986 💻 github.com/GAIR-NLP/daVin… 🤗 huggingface.co/spaces/SII-GAI…

English

88

261

1.9K

294.3K

Zengzhi Wang@SinclairWang1·21 Mar

@eliebakouch @amanrsanger You look like Reviewer 2.🤣🤣 But these questions are indeed important.

English

0

1

177

elie@eliebakouch·20 Mar

very nice answer!! i'm super glad to see this outcome "base models" is still confusing me here since there's no public ckpt of k2.5 base, you meant k2.5 post-trained used as base for the training? also "4x scale-up", is it compared to composer 1.5/1 or compared to k2.5 training, if so, is it k2.5 full training, only k2->k2.5 or only k2.5 post training? would be nice to see more evals of k2.5 vs composer 2 to see the improvement, it's a bit blurry if we look at the one in the blog post and compare to k2.5 data point

English

1

72

8.2K

Aman Sanger@amanrsanger·20 Mar

We've evaluated a lot of base models on perplexity-based evals and Kimi k2.5 proved to be the strongest! After that, we do continued pre-training and high-compute RL (a 4x scale-up). The combination of the strong base, CPT and RL, and Fireworks' inference and RL samplers make Composer-2 frontier level. It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model.

Kimi.ai@Kimi_Moonshot

Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ ' hosted RL and inference platform as part of an authorized commercial partnership.

English

150

135

2.5K

492.8K

Zengzhi Wang@SinclairWang1·20 Mar

@chenhao_chao the gif is so cool.

English

0

1

413

Chen-Hao (Lance) Chao@chenhao_chao·19 Mar

(1/7) We introduce MDM-Prime-v2 which scales 21.8× better than autoregressive models (ARMs) in compute-optimal comparisons. 📎 Paper: arxiv.org/abs/2603.16077 🌟 Blog: chen-hao-chao.github.io/mdm-prime-v2 ⌨️ Github: github.com/chen-hao-chao/… Here’s how we did it👇:

GIF

English

9

53

329

47.4K

Zengzhi Wang@SinclairWang1·20 Mar

@code_star @lukemerrick_ indeed

English

0

1

494

Cody Blakeney@code_star·20 Mar

Found another great midtraining paper. I haven't seen it on my TL so thought I would share. Super excited to dig into it later but looks really promising. (ty @lukemerrick_ ) I love seeing more work unifying understanding of midtraining -> RL

English

3

31

289

28K

Zengzhi Wang retweetledi

Patrick Pynadath@PatrickPyn35903·16 Mar

New blog post with @thjashin and @ruqi_zhang! Minor entropy differences can completely flip model rankings in generative perplexity — a direct consequence of both metrics being components of KL divergence. We discuss what this means for model comparison. patrickpynadath1.github.io/blog/eval_meth…

GIF

English

4

15

84

26.1K

Zengzhi Wang retweetledi

Alexander Doria@Dorialexander·15 Mar

"Synthetic pretraining is the way frontier models are built" — by @fujikanaeda

Maarten Van Segbroeck@mvansegb

@inductionheads Spot on. We actually just gave a guest lecture at Berkeley EECS on this exact dynamic (L11: Synthetic Data Powering Pre-Training). @fujikanaeda Here are our slides if anyone wants to go down the rabbit hole: scalable-ai.eecs.berkeley.edu/assets/lecture…

English

5

38

492

45.6K

Zengzhi Wang retweetledi

Maarten Van Segbroeck@mvansegb·15 Mar

@inductionheads Spot on. We actually just gave a guest lecture at Berkeley EECS on this exact dynamic (L11: Synthetic Data Powering Pre-Training). @fujikanaeda Here are our slides if anyone wants to go down the rabbit hole: scalable-ai.eecs.berkeley.edu/assets/lecture…

English

4

20

166

53.8K

Zengzhi Wang retweetledi

elie@eliebakouch·16 Mar

visual summary of attention residuals by kimi, beautiful paper

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

7

63

712

47.5K

Zengzhi Wang@SinclairWang1·16 Mar

@latkins @Kimi_Moonshot lol😂🤣

0

84

Lucas Atkins@latkins·16 Mar

This level of drop on a Sunday night is so annoying because we have so many p0 tasks to do on Monday morning but everyone is going to want to replicate this instead. Seriously uncool @Kimi_Moonshot (congrats and thank you for sharing, jokes aside 😁)

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

5

3

132

12.2K

Zengzhi Wang retweetledi

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

334

2.1K

13.5K

5M

Zengzhi Wang

Keşfet