Wei Ping

335 posts

Wei Ping banner
Wei Ping

Wei Ping

@_weiping

Distinguished Research Scientist & Director @Nvidia | Post-training, Reasoning, RL, Multimodal

San Francisco, CA Katılım Haziran 2020
372 Takip Edilen3.1K Takipçiler
Sabitlenmiş Tweet
Wei Ping
Wei Ping@_weiping·
🚀 Introducing Nemotron-Cascade 2 🚀 Just 3 months after Nemotron-Cascade 1, we’re releasing Nemotron-Cascade 2: an open 30B MoE with 3B active parameters, delivering best-in-class reasoning and strong agentic capabilities. 🥇 Gold Medal-level performance on IMO 2025, IOI 2025, and ICPC World Finals 2025: • Capabilities once thought achievable only by frontier proprietary models (e.g. Gemini Deep Think) or frontier-scale open models (i.e. DeepSeek-V3.2-Speciale-671B-A37B). • Remarkably high intelligence density with 20× fewer parameters. 🏆 Best-in-class across math, code reasoning, alignment, and instruction following: • Outperforms the latest Qwen3.5-35B-A3B (2026-02-24) and even larger Qwen3.5-122B-A10B (2026-03-11). 🧠 Powered by Cascade RL + multi-domain on-policy distillation: • Significantly expand Cascade RL across a much broader range of reasoning and agentic domains than Nemotron-Cascade 1, while distilling from the strongest intermediate teacher models throughout training to recover regressions and sustain gains. 🤗 Model + SFT + RL data: 👉 huggingface.co/collections/nv… 📄 Technical report: 👉 research.nvidia.com/labs/nemotron/…
Wei Ping tweet media
English
25
87
541
50.5K
Wei Ping retweetledi
Muyu He
Muyu He@HeMuyu0327·
I like this Nvidia RL paper for its complete reproducibility, so much so that, in Feynman's language, you can "invent" the whole RL training pipeline yourself. A ton of persuasive ablations you would find missing even in those frontier model reports. Some takeaways: - RLHF is important as a warmup stage for even math and code RL. In a lot of tech reports we see that reasoning RL is often the first stage of training. The authors compare training math and coding directly after SFT vs after RLHF. They find lifted math/coding performances after RLHF on all benchmarks and even just training RLHF shows lifts. - It is beneficial to train math RL on different context lengths progressively. The authors train it in three stages: 24K, 32K, 40K. Each stage has its own use. The first two stages are motivated by the fact that models have a high probability of going over the max context length, so training them at 24/32K stabilizes the reasoning and makes it more effective, and is judged by the decrease in incomplete ratio. The final stage is motivated by the opposite problem: when extending to longer context (eg. 64K), the model cannot use all of the context effectively to solve hard AIME problems. So the authors introduce a third stage to let models extend their comfortable context length to 40K and see lift in performance. An interesting property is that when the context size is small (<=24K), ablations show that throwing away responses going over the context size is more beneficial than assigning them 0 reward. But at a longer context size (>=32K), assigning 0 reward is more beneficial. (p2) Another interesting property is that they empirically find that training with a temperature of 1 (as opposed to 0.6/0.8) leads to better performances of math and coding, and needs to be carefully maintained to not have entropy explode (p3). - Probably the most interesting part: you can improve SWE performances with RL that does not execute code in an environment at all. Specifically, to solve the code repair problem of a SWE task, the training setup is as simple as giving the model the problem files (with some noise) and asking it to come up with a patch. Since there is no code execution to give an outcome reward, the reward, which is very novel, is to let an LLM compare the predicted patch with the true patch, and get the probability of the model outputting the token "yes". As a probability, this reward naturally falls into [0, 1]. By training on this signal alone, the authors are able to scale up the number of tasks. And they find lift in SWE-Bench verified. - Plain old policy gradient work. The authors choose to keep the training fully on-policy. As a result, each rollout is trained for one gradient update, and therefore the importance sampling ratio is always 1. The authors state that this is for training stability and avoids entropy collapse. Overall, the pipeline is RLHF -> instruction following -> math -> coding -> SWE, and the authors keep track of benchmark performances after each stage to see the dynamics. They also go into details of data preparation, reward function, dynamic filtering for each stage. A great resource for the open source community.
Muyu He tweet mediaMuyu He tweet mediaMuyu He tweet media
English
5
31
330
24.8K
Wei Ping retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!
Bryan Catanzaro tweet media
English
62
205
1.2K
200.6K
Wei Ping retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
Nemotron 3 Super is here — 120B total / 12B active, Hybrid SSM Latent MoE, designed for Blackwell. Truly open: permissive license, open data, open training infra. See analysis on @ArtificialAnlys Details in thread 🧵below:
Oleksii Kuchaiev tweet mediaOleksii Kuchaiev tweet media
English
10
45
275
28.7K
Wei Ping retweetledi
renjie pi
renjie pi@RenjiePi·
Introducing Nemotron-Terminal: a systematic data engineering pipeline for scaling LLM Terminal Agents. We bridge the gap between open models and proprietary models with a fully open synthetic-to-real trajectory pipeline. 🤯The payoff: SFT on our Nemotron-Terminal-Corpus boosts Qwen3-32B from 3.4% → 27.4% on Terminal-Bench 2.0 (+24.0), rivaling models multiple its size. What makes it work? 🌟Terminal-Task-Gen: A lightweight data curation pipeline that seamlessly combines the adaptation of existing datasets with robust synthetic task construction. 🌟Nemotron-Terminal-Corpus: A massive, open-source dataset covering diverse terminal interactions, which contains explicit planning and execution traces for complex long-horizon tasks. And we’re releasing everything: 📦 Nemotron-Terminal-Corpus (Large-scale dataset) 🤖 Nemotron-Terminal models (8B, 14B, 32B) Paper: arxiv.org/abs/2602.21193 HF Daily: huggingface.co/papers/2602.21… Models & Data: huggingface.co/collections/nv… Our tech report just hit the #1 spot on Hugging Face Daily Papers! We're also incredibly excited to see the open-source community putting our work to the test, with the Nemotron-Terminal-Corpus dataset currently trending at over 1,800 downloads and counting. We can't wait to see what the community build with it!
renjie pi tweet mediarenjie pi tweet mediarenjie pi tweet media
English
6
30
208
16.9K
Tianyu Liu
Tianyu Liu@rogerliuty·
Feels great to see our efforts paying off!😁
Arena.ai@arena

🚨BREAKING: Kimi K2.5 by @Kimi_Moonshot is now the #1 open model in Code Arena! In Code Arena’s agentic coding evaluations, Kimi K2.5 is now: - #1 open model, surpassing GLM-4.7 - #5 overall, on par with top proprietary models like Gemini-3-Flash - The only open model in the top 5 🏆Kimi K2.5 is the best open model across Text, Vision, and Code Arena. Huge congrats to the @Kimi_Moonshot team for continuing to push the frontier of open models 👏

English
2
1
17
2.5K
Wei Ping
Wei Ping@_weiping·
Very enlightening results from removing vision–text SFT!! • Strong vision–text pretraining + text-only SFT (zero vision) already boosts visual reasoning & tool use • Then, add multimodal RL → SOTA on both vision and text • Vision–text SFT hurts generalization ; IMO, likely due to lower-diversity, lower-quality trajectories compared to text-only SFT data
English
0
0
4
1.1K
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Kimi K2.5 tech report just dropped! Quick hits: - Joint text–vision training: pretrained with 15T vision-text tokens, zero-vision SFT (text-only) to activate visual reasoning - Agent Swarm + PARL: dynamically orchestrated parallel sub-agents, up to 4.5× lower latency, 78.4% on BrowseComp - MoonViT-3D: a unified image–video encoder with 4× temporal compression, enabling 4× longer videos in the same context - Toggle: token-efficient RL, 25–30% fewer tokens with no accuracy drop Here's our work toward scalable, real-world agentic intelligence. More details in the report 👉github.com/MoonshotAI/Kim…
Kimi.ai tweet mediaKimi.ai tweet mediaKimi.ai tweet mediaKimi.ai tweet media
English
54
286
1.9K
311.2K
Wei Ping
Wei Ping@_weiping·
@jasondeanlee don’t even need to have a frontier model to be a frontier ai lab now?
English
0
1
14
6.9K
Wei Ping
Wei Ping@_weiping·
@wzihanw code is cheap. non-happy-path tests are still costly.
English
0
0
1
141
Wei Ping retweetledi
Shizhe Diao
Shizhe Diao@shizhediao·
RLVR is powerful — but how do you train with multiple rewards effectively? 🤔 🎯GDPO (not GRPO) is coming. We introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new multi-reward RL algorithm that consistently improves per-reward convergence over GRPO across a wide range of tasks. (1/n)
Shizhe Diao tweet media
English
25
132
815
82.5K
Wei Ping
Wei Ping@_weiping·
Thanks for sharing our work, @omarsar0 — excited to see the discussion it sparks!
elvis@omarsar0

Banger paper from NVIDIA. Training general-purpose reasoning models with RL is complicated. Different domains have wildly different response lengths and verification times. Math uses fast symbolic verification. Code requires slow execution-based verification. Alignment needs reward model scores. Blending all these heterogeneous prompts together makes the infrastructure complex, slows training, and makes hyperparameter tuning difficult. This new research introduces Cascade RL, a framework that trains models sequentially across domains rather than mixing everything together. First RLHF for alignment, then instruction-following RL, then math RL, then code RL, then software engineering RL. This sequential approach is resistant to catastrophic forgetting. In RL, the model generates its own experience, so old behaviors remain if they stay reward-relevant. Unlike supervised learning, where previous data disappears, RL optimizes cumulative reward rather than fitting exact targets. RLHF, as a pre-step, actually boosts reasoning ability far beyond mere preference optimization by reducing verbosity and repetition. Subsequent domain-specific RL stages rarely degrade earlier performance and may even improve it. Here are the results: Their 14B model outperforms its own SFT teacher, DeepSeek-R1-0528 (671B), on LiveCodeBench v5/v6/Pro. Nemotron-Cascade-8B achieves 71.1% on LiveCodeBench v6, comparable to DeepSeek-R1-0528 at 73.3% despite being 84x smaller. The 14B model achieved silver medal performance at IOI 2025. They also demonstrate that unified reasoning models can operate effectively in both thinking and non-thinking modes, closing the gap with dedicated thinking models while keeping everything in a single model. Paper: arxiv.org/abs/2512.13607 Learn to build effective AI Agents in our academy: dair-ai.thinkific.com

English
0
1
12
1.6K
Wei Ping
Wei Ping@_weiping·
Since the release, many have asked why cascaded, domain-wise RL is so resistant to catastrophic forgetting in Nemotron-Cascade. It really comes down to the nature of RL, the structure of the problem, and strong execution. We break it down in our report 👇
Wei Ping tweet media
Wei Ping@_weiping

🚀 Introducing Nemotron-Cascade! 🚀 We’re thrilled to release Nemotron-Cascade, a family of general-purpose reasoning models trained with cascaded, domain-wise reinforcement learning (Cascade RL), delivering best-in-class performance across a wide range of benchmarks. 💻 Coding powerhouse After RL, our 14B model: • Surpasses DeepSeek-R1-0528 (671B) on LiveCodeBench v5/v6/Pro. • Achieves silver-medal performance at IOI 2025 🥈. • Reaches a 43.1% pass@1 on SWE-Bench Verified, and 53.8% with test-time scaling. 🧠 What is Cascade RL? Instead of mixing heterogeneous prompts across domains, Cascade RL trains sequentially, domain by domain, which reduces engineering complexity, mitigates heterogeneous verification latencies, and enables domain-specific curricula and tailored hyperparameter tuning. ✨ Key insight Using RLHF for alignment as a pre-step dramatically boosts complex reasoning—far beyond preference optimization. Subsequent domain-wise RLVR stages rarely hurt the benchmark performance attained in earlier domains and may even improve it, as illustrated in the following figure. 🤗 Models & training data 🔥 👉 huggingface.co/collections/nv… 📄 Technical report with detailed training and data recipes 👉 arxiv.org/pdf/2512.13607

English
0
4
29
2.6K