Yonggan Fu

6 posts

Yonggan Fu

Yonggan Fu

@YongganFu

Research Scientist @ NVIDIA Research; PhD @ Georgia Institute of Technology

Santa Clara, California Katılım Ekim 2024
89 Takip Edilen73 Takipçiler
Sabitlenmiş Tweet
Yonggan Fu
Yonggan Fu@YongganFu·
👀Your small LMs (SLMs) are… not that fast? 🚀At NVIDIA Research, we release 𝐍𝐞𝐦𝐨𝐭𝐫𝐨𝐧-𝐅𝐥𝐚𝐬𝐡 (NeurIPS 2025), a hybrid SLM family designed around real-world latency and trained from scratch with 1B/3B sizes, achieving SOTA accuracy, latency, and throughput. 🌟𝐍𝐞𝐦𝐨𝐭𝐫𝐨𝐧-𝐅𝐥𝐚𝐬𝐡 𝐡𝐚𝐬 𝐛𝐞𝐞𝐧 𝐢𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐞𝐝 𝐢𝐧𝐭𝐨 𝐓𝐑𝐓𝐋𝐋𝐌 𝐟𝐨𝐫 𝐩𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐠𝐫𝐚𝐝𝐞 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 with up to 41K tokens/second on a single H100 GPU! Try it following the instructions in our HF repo. Will share more details at NeurIPS’25 (poster on Thursday, 11am–2pm)! 𝐏𝐚𝐩𝐞𝐫 𝐋𝐢𝐧𝐤: arxiv.org/pdf/2511.18890 🤗 𝐇𝐅 𝐦𝐨𝐝𝐞𝐥𝐬: Nemotron-Flash-1B: huggingface.co/nvidia/Nemotro… Nemotron-Flash-3B: huggingface.co/nvidia/Nemotro… Nemotron-Flash-3B-Instruct: huggingface.co/nvidia/Nemotro…
Yonggan Fu tweet media
English
2
20
48
16.3K
Kaiyue Wen
Kaiyue Wen@wen_kaiyue·
(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.
Kaiyue Wen tweet media
English
27
121
687
197.5K
Yonggan Fu retweetledi
Shizhe Diao
Shizhe Diao@shizhediao·
🚀 Excited to share ToolOrchestra, an end-to-end RL training framework for orchestrating tools and agentic workflows. Everyone’s building agent workflows these days — connecting tools, APIs, and LLMs like LEGO. 🧩 But here are our findings: 👉 Just prompting the agent workflow won’t cut it. It’s not how you build the best agent. 👉 Without learning, workflows plateau fast. It’s time to bring RL fine-tuning 🔥back into agent development. (1/n)
Shizhe Diao tweet media
English
29
71
348
67.7K
Yonggan Fu retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 Introducing Hymba-1.5B: a new hybrid architecture for efficient small language models! ✅ Outperforms Llama, Qwen, and SmolLM2 with 6-12x less training ✅ Massive reductions in KV cache size & good throughput boost ✅ Combines Mamba & Attention in a Hybrid Parallel Architecture ✅ Base and Instruct with open license on HF 🤗 HF: tinyurl.com/hymba1-5b-hf 📚 Arxiv: arxiv.org/abs/2411.13676 🐙 GitHub: github.com/NVlabs/hymba Long post with analysis and insights
Pavlo Molchanov tweet media
English
4
57
244
52.7K
Yonggan Fu retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: arxiv.org/abs/2411.13676 Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!
Pavlo Molchanov tweet media
English
10
90
494
97.6K