Yoshi Suhara

185 posts

Yoshi Suhara

Yoshi Suhara

@suhara

Building Small Language Models @nvidia

Santa Clara, CA Katılım Haziran 2007
299 Takip Edilen357 Takipçiler
Yoshi Suhara
Yoshi Suhara@suhara·
The model was compressed from Nano 9B v2 using Nemotron Elastic (arxiv.org/abs/2511.16664) and post-trained on Nemotron 3 post-training data using a new recipe designed to improve accuracy in reasoning-off mode. Please see the blog and the Nemotron Elastic paper for details.
Yoshi Suhara tweet media
English
0
0
1
104
Yoshi Suhara retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
Introducing NVIDIA Nemotron 3 Super 🎉 Open 120B-parameter (12B active) hybrid Mamba-Transformer MoE model Native 1M-token context Built for compute-efficient, high-accuracy multi-agent applications Plus, fully open weights, datasets and recipes for easy customization and deployment. 🧵
English
59
106
819
134.7K
Yoshi Suhara retweetledi
NVIDIA Japan
NVIDIA Japan@NVIDIAJapan·
【新モデル公開🚀Nemotron-Nano-9B-v2-Japanese】 本日、NVIDIA は Nejumi Leaderboard 4 のパラメータ数 10B 以下において、最先端の性能 (SOTA) を達成した NVIDIA Nemotron-Nano-9B-v2-Japanese を公開しました。商用利用可能です。 huggingface.co/blog/nvidia/ne…
日本語
10
330
1.3K
304.9K
Yoshi Suhara retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 New NVIDIA report: NVFP4 + Quantization-Aware Distillation (QAD) FP4 inference without quality collapse. Key idea: distill a BF16 teacher into an NVFP4 student using KL loss - much more robust than PTQ/QAT, especially after SFT/RL. 🔥 Near-BF16 accuracy ⚡ ~2-3× throughput, ~1.8× memory savings vs FP8 🧠 Works for LLMs and VLMs (Nemotron Nano, Super, VL) Technical report: huggingface.co/nvidia/NVIDIA-… Research blog: research.nvidia.com/labs/nemotron/… Hugging Face models: research.nvidia.com/labs/nemotron/…
@

We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200. Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16. Nemotron 3 Nano NVFP4: nvda.ws/4t63z9y Tech Report: nvda.ws/4bj3pp0

English
4
17
114
15.4K
Yoshi Suhara retweetledi
Jian Zhang
Jian Zhang@JianZhangCS·
🚀@Nvidia Nemotron 3 Nano is live! Nemotron 3 Nano is the world's most efficient open MoE with an Hybrid-MoE architecture and 1M context length. 🔥 Strong in reasoning, agentic and chat tasks with leading accuracy among AA index, Tau2, SWE Bench. 🔥 Up to 3.3X higher throughput comparing to other open MoE at similar sizes 🔥 A fully open recipe with data, infra released to the community Checkout the new model architecture and reinforcement learning technologies we used below: 😊 Huggingface: huggingface.co/collections/nv… 📢 Research blog: nvda.ws/48RusVt 🛣️Nemo RL & Nemo Gym (RL environment orchestration): github.com/NVIDIA-NeMo/RL & github.com/NVIDIA-NeMo/Gym Kudos to the teams for months of hard work! We are excited to keep building the Nemotron 3 model family and empower the community.
Jian Zhang tweet mediaJian Zhang tweet media
English
5
24
247
25.3K
Yoshi Suhara retweetledi
Georgi Gerganov
Georgi Gerganov@ggerganov·
In collaboration with NVIDIA, the new Nemotron 3 Nano model is fully supported in llama.cpp Nemotron 3 Nano features an efficient hybrid, Mamba, MoE architecture. It's a promising model, suitable for local AI applications on mid-range hardware. The large context window makes it a great choice for a variety of use cases and applications. The efficiency of llama.cpp and the unique context management features of the `llama-server` tool allows us to deploy and use this model on a wide-range of hardware. With recent code contributions by engineering teams at NVIDIA and open-source collaborators, we can run this model very efficiently across the entire spectrum of NVIDIA GPUs. Learn more at @NVIDIA_AI_PC developer.nvidia.com/blog/inside-nv…
English
8
41
405
27.1K
Yoshi Suhara retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.
Bryan Catanzaro tweet media
English
41
222
1.2K
504.7K
Yoshi Suhara retweetledi
Shizhe Diao
Shizhe Diao@shizhediao·
Thrilled to share that CLIMB has been accepted to the NeurIPS DB track! 🍀 Feeling so lucky to work with such an amazing team. #NeurIPS2025
@

Thrilled to share my first project at NVIDIA! ✨ Today’s language models are pre-trained on vast and chaotic Internet texts, but these texts are unstructured and poorly understood. We propose CLIMB — Clustering-based Iterative Data Mixture Bootstrapping — a fully automated framework that reorganizes pre-training data into clusters and iteratively search the best mixture. CLIMB does three things: ➤ Embeds and clusters web-scale data semantically. ➤ Searches, iteratively and efficiently, for optimal data mixtures using a lightweight proxy model + predictor loop. ➤ Learns how different domains interact, and how the right mix can unlock downstream performance we didn’t know was possible. On paper, the gains are real: ➤ Our 1B model, trained on CLIMB mixtures with 400B tokens, outperforms LLaMA 3.2-1B. ➤ In some specific domains e.g., Social Sciences, we see up to +5% improvements. ➤ We open-sourced ClimbLab (1.2T tokens across 20 domains) and ClimbMix (400B tokens, outperforming existing baselines under the same budget). The real win isn’t just numbers, it’s the idea that we can bootstrap searching 🔎 . This improves the data efficiency a lot. We hope CLIMB can be a small step toward more transparent, structured, and efficient pertaining. One where we curate not by filtering noise, but by discovering signal. We’d love to hear from others exploring the frontiers of data-centric AI. Let’s CLIMB together! 🔗 Read our paper: arxiv.org/abs/2504.13161 📂 Datasets available on Hugging Face: huggingface.co/collections/nv… 🌐 Project page: research.nvidia.com/labs/lpr/climb (check cluster visualizations) 🗨️ Discussion: huggingface.co/papers/2504.13…

English
0
4
57
5K
Yoshi Suhara retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
NVIDIA has released Nemotron Nano 9B V2, a small 9B reasoning model that scores 43 on the Artificial Analysis Intelligence Index, the highest yet for <10B models Nemotron 9B V2 is the first Nemotron model pre-trained by @NVIDIA. Previous Nemotron models have been developed by post-training on Meta Llama models. Architecture & Training: The model uses a hybrid Mamba-Transformer architecture. NVIDIA pre-trained a 12B parameter base model and applied post-training with a range of techniques including RLHF and GRPO. The final 9B size was pruned from this model and re-trained with the base model as a teacher. Small-model frontier: with only 9B parameters, Nemotron Nano 9B V2 is placed ahead of Llama 4 Maverick on our leaderboard, equal to Solar Pro 2 with reasoning and trails just behind gpt-oss-20B (high). Along with this model, NVIDIA released a 6.6-trillion token subset of their pre-training data for public use on @huggingface Key model details: ➤ 128k token context window ➤ Supports reasoning and non-reasoning modes (with ‘/no_think’ settings in the system prompt) ➤ Released under the NVIDIA Open Model License, and not additionally covered by Meta’s Llama license like prior Nemotron models - this means that there is no limitation on use by large companies or requirement to keep ‘Nemotron’ in the name of derivative models ➤ No serverless inference providers are yet serving the model, but it is available now on Hugging Face for local inference or self-deployment See below for our full analysis and key announcement links from NVIDIA 👇
Artificial Analysis tweet media
English
21
58
528
69.4K
Yoshi Suhara
Yoshi Suhara@suhara·
RT @shizhediao: ✨ Alongside NVIDIA-Nemotron-Nano-v2-9B, we’re also open-sourcing its pre-training dataset. At NVIDIA, we remain committed…
English
0
1
0
47
Yoshi Suhara retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
We're excited to share leaderboard-topping 🏆 NVIDIA Nemotron Nano 2, a groundbreaking 9B parameter open, multilingual reasoning model that's redefining efficiency in AI and earned the leading spot on the @ArtificialAnlys Intelligence Index leaderboard among open models within the same parameter range. It's built on a unique hybrid Transformer-Mamba architecture, a combination that delivers the same accuracy you expect, but with higher throughput. This enables it to achieve high performance/cost, making it perfect for real-world applications like customer service agents and chatbots. 🏗️ Hybrid Architecture: By combining the strengths of Transformer and Mamba architectures, achieves up to 6X faster throughput compared to other 8B open models and highest reasoning accuracy. 🏦 Thinking Budget: Reduces unnecessary token generation to cut costs by up to 60%, making it an ideal solution for balancing performance and total cost of ownership (TCO). 🔢 Open Datasets: The training datasets of this model are fully open, giving maximum transparency in using the model for enterprise applications. 🤗 Technical details on @HuggingFace ➡️ nvda.ws/3JfcKST 🏆 Leaderboard ➡️ nvda.ws/47B7iUh
English
7
30
141
8.1K
Yoshi Suhara retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
We are excited to release Nvidia-Nemotron-Nano-V2 model! This is a 9B hybrid SSM model with open base model and training data. This model also supports runtime "thinking" budget control. HF collection with base and post trained models: huggingface.co/collections/nv…
Oleksii Kuchaiev tweet media
English
9
63
297
65.4K
Yoshi Suhara retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
📢New efficient Hybrid-SLM from NVIDIA-Nemotron-Nano-v2-9B: ❗️6x faster than Qwen3-8B because of Hybrid (Mamba2+Attention) design. We tried something new: pretrain & align a 12B reasoning model → compress to 9B. First real stab at reasoning-model compression. Key takeaways from compression: ▪️Target was 23GB GPUs + room for a 650M vision encoder → design via compression, not bespoke architecture. ▪️Distillation loss went down, but benchmarks didn’t - unlike Base compression. ▪️Reasoning compression needs light post-training alignment. ▪️Applied both Minitron + Puzzle. ▪️Dropped 2 attn layers to hit 128k context; KV cache dominated. ▪️Depth: 62 → 56 (fewer tanked accuracy). ▪️FFN: 20,480 → 15,680 (−23%). ▪️Hidden dimension: 5120 → 4480. ▪️Mamba heads → small gains <15%, mostly avoided. Distilled on 136B tokens, context grown 8k → 262k. It is required if we want to preserve long-context of 128k. 📰 Report: research.nvidia.com/labs/adlr/file… 🤗 HF: huggingface.co/collections/nv…
Pavlo Molchanov tweet media
English
1
16
82
6.2K
Yoshi Suhara
Yoshi Suhara@suhara·
A new video game benchmark for LLM agents, designed across various game titles! Happy to be part of this wonderful collaboration with @dongmin_park11 and the amazing team @Krafton_AI!
Dongmin Park@dongmin_park11

🚨New Paper Alert As a game company, @Krafton_AI is actively exploring how to apply LLM agents to video games. We present Orak—a foundational video gaming benchmark for LLM agents! Includes Pokémon, StarCraft II, Slay the Spire, Darkest Dungeon, Ace Attorney, and more in🧵

English
0
0
5
744
Yoshi Suhara retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
NeMo RL is now open source! It replaces NeMo-Aligner and is the toolkit we use to post train next generations of our models. Give it a try github.com/NVIDIA/NeMo-RL
English
5
65
394
25K
Yoshi Suhara retweetledi
Shaokun Zhang
Shaokun Zhang@ShaokunZhang1·
Tool-using LLMs can learn to reason—without reasoning traces. 🔥 We present Nemotron-Research-Tool-N1, a family of tool-using reasoning LLMs trained entirely via rule-based reinforcement learning—no reasoning supervision, no distillation. 📄 Paper: arxiv.org/pdf/2505.00024 💻 Code: github.com/NVlabs/Tool-N1 (Please consider giving us a ⭐️ to stay updated on the upcoming code release!) 🧠 Why this matters: Existing tool-call models rely heavily on supervised reasoning traces from stronger models—costly, brittle, and often imitative. We ask: Can LLMs learn to reason directly from tool success signals? 📦 What we did: – Train Qwen2.5-7B/14B with simple binary reward on tool-call correctness + reasoning format in R1-style – No reasoning traces needed – Evaluate on BFCL, API-Bank, and ACEBench – Also study the role of SFT, RL, and widely adopted SFT-then-RL recipes in training Tool-Calling models. 📈 Key findings: – Tool-N1-7B/14B obviously outperform GPT-4o and open baselines on all benchmarks – Widely adopted SFT+RL paradigm doesn’t necessarily lead to better performance than Pure RL. – Binary reward > fine-grained reward, esp. for real-world queries – Scaling works: bigger = better gains under our RL setup 🌟 Takeaway: Reasoning doesn’t have to be taught. With just a binary signal, LLMs can learn to reason and act. Tool-N1 sets a new direction for scalable, supervision-light tool calling model training
Shaokun Zhang tweet media
English
2
94
358
40.4K