Wei Ping

335 posts

Wei Ping

@_weiping

Distinguished Research Scientist & Director @Nvidia | Post-training, Reasoning, RL, Multimodal

San Francisco, CA Katılım Haziran 2020

372 Takip Edilen3.1K Takipçiler

Sabitlenmiş Tweet

Wei Ping@_weiping·10h

🚀 Introducing Nemotron-Cascade 2 🚀 Just 3 months after Nemotron-Cascade 1, we’re releasing Nemotron-Cascade 2: an open 30B MoE with 3B active parameters, delivering best-in-class reasoning and strong agentic capabilities. 🥇 Gold Medal-level performance on IMO 2025, IOI 2025, and ICPC World Finals 2025: • Capabilities once thought achievable only by frontier proprietary models (e.g. Gemini Deep Think) or frontier-scale open models (i.e. DeepSeek-V3.2-Speciale-671B-A37B). • Remarkably high intelligence density with 20× fewer parameters. 🏆 Best-in-class across math, code reasoning, alignment, and instruction following: • Outperforms the latest Qwen3.5-35B-A3B (2026-02-24) and even larger Qwen3.5-122B-A10B (2026-03-11). 🧠 Powered by Cascade RL + multi-domain on-policy distillation: • Significantly expand Cascade RL across a much broader range of reasoning and agentic domains than Nemotron-Cascade 1, while distilling from the strongest intermediate teacher models throughout training to recover regressions and sustain gains. 🤗 Model + SFT + RL data: 👉 huggingface.co/collections/nv… 📄 Technical report: 👉 research.nvidia.com/labs/nemotron/…

English

541

50.5K

Wei Ping retweetledi

Muyu He@HeMuyu0327·12 Mar

I like this Nvidia RL paper for its complete reproducibility, so much so that, in Feynman's language, you can "invent" the whole RL training pipeline yourself. A ton of persuasive ablations you would find missing even in those frontier model reports. Some takeaways: - RLHF is important as a warmup stage for even math and code RL. In a lot of tech reports we see that reasoning RL is often the first stage of training. The authors compare training math and coding directly after SFT vs after RLHF. They find lifted math/coding performances after RLHF on all benchmarks and even just training RLHF shows lifts. - It is beneficial to train math RL on different context lengths progressively. The authors train it in three stages: 24K, 32K, 40K. Each stage has its own use. The first two stages are motivated by the fact that models have a high probability of going over the max context length, so training them at 24/32K stabilizes the reasoning and makes it more effective, and is judged by the decrease in incomplete ratio. The final stage is motivated by the opposite problem: when extending to longer context (eg. 64K), the model cannot use all of the context effectively to solve hard AIME problems. So the authors introduce a third stage to let models extend their comfortable context length to 40K and see lift in performance. An interesting property is that when the context size is small (<=24K), ablations show that throwing away responses going over the context size is more beneficial than assigning them 0 reward. But at a longer context size (>=32K), assigning 0 reward is more beneficial. (p2) Another interesting property is that they empirically find that training with a temperature of 1 (as opposed to 0.6/0.8) leads to better performances of math and coding, and needs to be carefully maintained to not have entropy explode (p3). - Probably the most interesting part: you can improve SWE performances with RL that does not execute code in an environment at all. Specifically, to solve the code repair problem of a SWE task, the training setup is as simple as giving the model the problem files (with some noise) and asking it to come up with a patch. Since there is no code execution to give an outcome reward, the reward, which is very novel, is to let an LLM compare the predicted patch with the true patch, and get the probability of the model outputting the token "yes". As a probability, this reward naturally falls into [0, 1]. By training on this signal alone, the authors are able to scale up the number of tasks. And they find lift in SWE-Bench verified. - Plain old policy gradient work. The authors choose to keep the training fully on-policy. As a result, each rollout is trained for one gradient update, and therefore the importance sampling ratio is always 1. The authors state that this is for training stability and avoids entropy collapse. Overall, the pipeline is RLHF -> instruction following -> math -> coding -> SWE, and the authors keep track of benchmark performances after each stage to see the dynamics. They also go into details of data preparation, reward function, dynamic filtering for each stage. A great resource for the open source community.

English

330

24.8K

Wei Ping retweetledi

Bryan Catanzaro@ctnzr·11 Mar

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!

English

205

1.2K

200.6K

Wei Ping retweetledi

Oleksii Kuchaiev@kuchaev·11 Mar

Nemotron 3 Super is here — 120B total / 12B active, Hybrid SSM Latent MoE, designed for Blackwell. Truly open: permissive license, open data, open training infra. See analysis on @ArtificialAnlys Details in thread 🧵below:

English

275

28.7K

Wei Ping retweetledi

renjie pi@RenjiePi·10 Mar

Introducing Nemotron-Terminal: a systematic data engineering pipeline for scaling LLM Terminal Agents. We bridge the gap between open models and proprietary models with a fully open synthetic-to-real trajectory pipeline. 🤯The payoff: SFT on our Nemotron-Terminal-Corpus boosts Qwen3-32B from 3.4% → 27.4% on Terminal-Bench 2.0 (+24.0), rivaling models multiple its size. What makes it work? 🌟Terminal-Task-Gen: A lightweight data curation pipeline that seamlessly combines the adaptation of existing datasets with robust synthetic task construction. 🌟Nemotron-Terminal-Corpus: A massive, open-source dataset covering diverse terminal interactions, which contains explicit planning and execution traces for complex long-horizon tasks. And we’re releasing everything: 📦 Nemotron-Terminal-Corpus (Large-scale dataset) 🤖 Nemotron-Terminal models (8B, 14B, 32B) Paper: arxiv.org/abs/2602.21193 HF Daily: huggingface.co/papers/2602.21… Models & Data: huggingface.co/collections/nv… Our tech report just hit the #1 spot on Hugging Face Daily Papers! We're also incredibly excited to see the open-source community putting our work to the test, with the Nemotron-Terminal-Corpus dataset currently trending at over 1,800 downloads and counting. We can't wait to see what the community build with it!

English

208

16.9K

Wei Ping retweetledi

Nanbeige@nanbeige·11 Şub

🚀 Announcing Nanbeige4.1-3B – our latest open-source 3B model mastering reasoning, preference alignment, & agent capabilities! Try it now: huggingface.co/Nanbeige/Nanbe… reddit.com/r/LocalLLaMA/c…

English

122

879

194.5K

Wei Ping@_weiping·7 Şub

Cool demo samples here — check them out! 🔥 research.nvidia.com/labs/adlr/UALM/

zhifeng kong@ZhifengKong

The UALM paper is accepted as an Oral presentation at ICLR. Key takeaways: 1) a unified LM for audio understanding and text2audio generation 2) use a proper audio codec and delay pattern for audio outputs 3) scale data!!! 4) use classifier free guidance arxiv.org/abs/2510.12000

English

3.7K

Wei Ping@_weiping·3 Şub

@rogerliuty Congrats!

English

163

Tianyu Liu@rogerliuty·3 Şub

Feels great to see our efforts paying off!😁

Arena.ai@arena

🚨BREAKING: Kimi K2.5 by @Kimi_Moonshot is now the #1 open model in Code Arena! In Code Arena’s agentic coding evaluations, Kimi K2.5 is now: - #1 open model, surpassing GLM-4.7 - #5 overall, on par with top proprietary models like Gemini-3-Flash - The only open model in the top 5 🏆Kimi K2.5 is the best open model across Text, Vision, and Code Arena. Huge congrats to the @Kimi_Moonshot team for continuing to push the frontier of open models 👏

English

2.5K

Wei Ping@_weiping·1 Şub

@andrew_n_carr @Zhen4good Yes

Andrew Carr 🤸@andrew_n_carr·1 Şub

@Zhen4good @_weiping Text+vision is likely lower quality than text only - I believe that's the point being made

English

Wei Ping@_weiping·31 Oca

Very enlightening results from removing vision–text SFT 👀 • Strong vision–text pretraining + text-only SFT (zero vision) already boosts visual reasoning & tool use • Then, add multimodal RL → SOTA on both vision and text • Vision–text SFT hurts generalization ; IMO, likely due to lower-diversity, lower-quality trajectories compared to text-only SFT data

Kimi.ai@Kimi_Moonshot

Kimi K2.5 tech report just dropped! Quick hits: - Joint text–vision training: pretrained with 15T vision-text tokens, zero-vision SFT (text-only) to activate visual reasoning - Agent Swarm + PARL: dynamically orchestrated parallel sub-agents, up to 4.5× lower latency, 78.4% on BrowseComp - MoonViT-3D: a unified image–video encoder with 4× temporal compression, enabling 4× longer videos in the same context - Toggle: token-efficient RL, 25–30% fewer tokens with no accuracy drop Here's our work toward scalable, real-world agentic intelligence. More details in the report 👉github.com/MoonshotAI/Kim…

English

6.3K

Wei Ping@_weiping·1 Şub

Quantization-aware distillation (QAD) delivers near-BF16 accuracy with NVFP4! Amazing work led by @huizi_mao

NVIDIA AI Developer@NVIDIAAIDev

We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200. Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16. Nemotron 3 Nano NVFP4: nvda.ws/4t63z9y Tech Report: nvda.ws/4bj3pp0

English

300

33.7K

Wei Ping@_weiping·31 Oca

Very enlightening results from removing vision–text SFT!! • Strong vision–text pretraining + text-only SFT (zero vision) already boosts visual reasoning & tool use • Then, add multimodal RL → SOTA on both vision and text • Vision–text SFT hurts generalization ; IMO, likely due to lower-diversity, lower-quality trajectories compared to text-only SFT data

English

1.1K

Kimi.ai@Kimi_Moonshot·30 Oca

English

286

1.9K

311.2K

Wei Ping@_weiping·30 Oca

@Kimi_Moonshot so cool!

English

742

Wei Ping@_weiping·16 Oca

@jasondeanlee don’t even need to have a frontier model to be a frontier ai lab now?

English

6.9K

Jason Lee@jasondeanlee·16 Oca

What's the secret? Long vesting I bet

Yuchen Jin@Yuchenj_UW

Every frontier AI lab has lost co-founder(s): - OpenAI: 8 of 11 gone - Thinking Machines: 3 of 6 gone - SSI: 1 of 3 gone - DeepMind: 1 of 3 gone - xAI: 3 of 12 gone One exception: Anthropic. All 7 co-founders are still there. Anthropic culture is worth studying.

English

20.3K

Wei Ping@_weiping·15 Oca

@wzihanw code is cheap. non-happy-path tests are still costly.

English

141

Zihan "Zenus" Wang@wzenus·14 Oca

I'm seeing a ubiquitous trend where: Code is cheap; words are costly. Answers are cheap; questions are costly.

Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English

6.3K

Wei Ping retweetledi

Shizhe Diao@shizhediao·9 Oca

RLVR is powerful — but how do you train with multiple rewards effectively? 🤔 🎯GDPO (not GRPO) is coming. We introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new multi-reward RL algorithm that consistently improves per-reward convergence over GRPO across a wide range of tasks. (1/n)

English

132

815

82.5K

Wei Ping@_weiping·11 Oca

feels like me and my 7-yo son, with the engineer desperately holding the cute humanoid back from going feral😂

Eren Chen@ErenChenAI

Just an ordinary day at a robotics company.

English

832

Wei Ping@_weiping·7 Oca

@Guangxuan_Xiao @MITEECS @thinkymachines Congrats, Dr. @Guangxuan_Xiao!

Français

1.2K

Guangxuan Xiao@Guangxuan_Xiao·7 Oca

Life update: Wrapped up my PhD at @MITEECS 🎓 Super excited to start working on pre-training at @thinkymachines.

English

1.9K

73.1K

Wei Ping@_weiping·6 Oca

Thanks for sharing our work, @omarsar0 — excited to see the discussion it sparks!

elvis@omarsar0

Banger paper from NVIDIA. Training general-purpose reasoning models with RL is complicated. Different domains have wildly different response lengths and verification times. Math uses fast symbolic verification. Code requires slow execution-based verification. Alignment needs reward model scores. Blending all these heterogeneous prompts together makes the infrastructure complex, slows training, and makes hyperparameter tuning difficult. This new research introduces Cascade RL, a framework that trains models sequentially across domains rather than mixing everything together. First RLHF for alignment, then instruction-following RL, then math RL, then code RL, then software engineering RL. This sequential approach is resistant to catastrophic forgetting. In RL, the model generates its own experience, so old behaviors remain if they stay reward-relevant. Unlike supervised learning, where previous data disappears, RL optimizes cumulative reward rather than fitting exact targets. RLHF, as a pre-step, actually boosts reasoning ability far beyond mere preference optimization by reducing verbosity and repetition. Subsequent domain-specific RL stages rarely degrade earlier performance and may even improve it. Here are the results: Their 14B model outperforms its own SFT teacher, DeepSeek-R1-0528 (671B), on LiveCodeBench v5/v6/Pro. Nemotron-Cascade-8B achieves 71.1% on LiveCodeBench v6, comparable to DeepSeek-R1-0528 at 73.3% despite being 84x smaller. The 14B model achieved silver medal performance at IOI 2025. They also demonstrate that unified reasoning models can operate effectively in both thinking and non-thinking modes, closing the gap with dedicated thinking models while keeping everything in a single model. Paper: arxiv.org/abs/2512.13607 Learn to build effective AI Agents in our academy: dair-ai.thinkific.com

English

1.6K

Wei Ping@_weiping·2 Oca

Since the release, many have asked why cascaded, domain-wise RL is so resistant to catastrophic forgetting in Nemotron-Cascade. It really comes down to the nature of RL, the structure of the problem, and strong execution. We break it down in our report 👇

Wei Ping@_weiping

🚀 Introducing Nemotron-Cascade! 🚀 We’re thrilled to release Nemotron-Cascade, a family of general-purpose reasoning models trained with cascaded, domain-wise reinforcement learning (Cascade RL), delivering best-in-class performance across a wide range of benchmarks. 💻 Coding powerhouse After RL, our 14B model: • Surpasses DeepSeek-R1-0528 (671B) on LiveCodeBench v5/v6/Pro. • Achieves silver-medal performance at IOI 2025 🥈. • Reaches a 43.1% pass @1 on SWE-Bench Verified, and 53.8% with test-time scaling. 🧠 What is Cascade RL? Instead of mixing heterogeneous prompts across domains, Cascade RL trains sequentially, domain by domain, which reduces engineering complexity, mitigates heterogeneous verification latencies, and enables domain-specific curricula and tailored hyperparameter tuning. ✨ Key insight Using RLHF for alignment as a pre-step dramatically boosts complex reasoning—far beyond preference optimization. Subsequent domain-wise RLVR stages rarely hurt the benchmark performance attained in earlier domains and may even improve it, as illustrated in the following figure. 🤗 Models & training data 🔥 👉 huggingface.co/collections/nv… 📄 Technical report with detailed training and data recipes 👉 arxiv.org/pdf/2512.13607

English

2.6K

Keşfet

@ArtificialAnlys @rogerliuty @andrew_n_carr @Zhen4good @huizi_mao @Kimi_Moonshot @jasondeanlee @elonmusk