Saurav Muralidharan

252 posts

Saurav Muralidharan

Saurav Muralidharan

@srv_m

Research Scientist @NVIDIA | Making LLMs More Efficient

شامل ہوئے Mart 2008
247 فالونگ193 فالوورز
Saurav Muralidharan
Saurav Muralidharan@srv_m·
Today we are releasing the first model in the NVIDIA Nemotron 3 family: Nemotron 3 Nano! Nemotron 3 Nano is truly open, efficient, and achieves class-leading accuracy on reasoning and agentic tasks. Check it out today! 🚀 research.nvidia.com/labs/nemotron/…
English
0
0
4
51
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: arxiv.org/abs/2411.13676 Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!
Pavlo Molchanov tweet media
English
10
90
494
97.6K
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
We are hiring researchers working in LLM and VLM efficiency! Applications are open for PhD students graduating in 2025; and senior researchers with PhD. Check requirements for the position. Apply here: nvidia.wd5.myworkdayjobs.com/NVIDIAExternal… Senior researchers: nvidia.wd5.myworkdayjobs.com/NVIDIAExternal… Check our teams webpage: nv-dler.github.io to learn more about the work we are doing.
Pavlo Molchanov tweet media
English
0
16
87
7.4K
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute acceleration for downstream tasks! 💥 📄 Paper: arxiv.org/abs/2409.17481 🔗 Code: github.com/NVlabs/MaskLLM 🌐 Webpage: vainf.github.io/maskllm-projec… 🔑 Key insights: - Gumbel-Softmax trick for differentiable binary mask training - Learnable Sparsity scales effectively to large-scale datasets and can fully leverage computational resources to learn precise masks through end-to-end training. - Using mask priors (i.e. SparseGPT, Magnitude) boosts efficiency & quality - Annealing stochastic sampling is crucial for effective mask learning - Maximizing resulting weight magnitudes improves downstream task performance 🏆 Results: - Effective 1.4x GPU speedup & 73% memory reduction with near-lossless performance! - MaskLLM outperforms one-shot techniques with just 1280 samples. - Our method improves with more data, unlike previous one-shot approaches! Kudos to Gongfan Fang (NUS) for great internship! Together with @yin_hongxu @jankautz @srv_m @gLeHeinrich @XinchaoWang3, Jeff Pool
GIF
English
2
34
154
13.8K
Saurav Muralidharan ری ٹویٹ کیا
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
👀 Experience high-efficiency NVIDIA Llama-3.1-Nemotron-51B - a NAS-optimized model achieving 2x throughput while preserving accuracy runs on a single H100 GPU. ✨Try out the Llama-3.1-Nemotron-51B NIM through the API from ai.nvidia.com or download from @huggingface. Technical deep dive ➡️ nvda.ws/47AI6ve
NVIDIA AI Developer tweet media
English
13
33
148
13.3K
Saurav Muralidharan ری ٹویٹ کیا
Wei Ping
Wei Ping@_weiping·
Introducing NVLM 1.0, a family of frontier-class multimodal LLMs that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training! We are working towards releasing the model weights very soon 🤗 and will open-source the training code for the community. For further details, please visit our project website: nvlm-project.github.io
Wei Ping tweet media
English
12
119
471
168.6K
Saurav Muralidharan ری ٹویٹ کیا
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
Today we released Mistral-NeMo-Minitron 8B, a pruned and distilled version of the open @MistralAI NeMo 12B model, achieving high accuracy across nine popular benchmarks for chatbots, virtual assistants, content generation, coding, and educational tools. ➡️ nvda.ws/4cz17Pl ✨ Experience now on the NVIDIA API catalog or download from @HuggingFace.
NVIDIA AI Developer tweet media
English
12
71
220
45.1K
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🌟 The best 8B Base model via pruning and distillation! 🚀 Introducing Mistral-NeMo-Minitron-8B-Base model we derived from the recent Mistral-NeMo-12B. Our recipe: finetune teacher on 100B tokens, prune to 8B params, run teacher-student distillation on <400B tokens. Result: the best -Base model for 8B range. 📚 Technical report: d1qx31qr3h6wln.cloudfront.net/publications/m… 📝 Blog post: developer.nvidia.com/blog/mistral-n… 🤗 HF model checkpoint: huggingface.co/nvidia/Mistral… 📌 Key Benefits: • Significant training reduction: only 400B tokens • Outperforms Mistral-7B and LLaMa-3.1-8B across 8/9 benchmarks • Enhanced model accuracy via distillation • Ready for commercial use with permissive license ⚙️ Model architecture: • 8.4 B total parameters • 7.3B active non-embedding params • 131k vocab size • Hidden size: 4096 • Depth: 40 layers • MLP hidden size: 11520 • Query heads: 32 • Head dimension: 128 • Attention groups: 8 👐 Permissive license! 🔑 Key Learnings: • Fine-tune the teacher with 100B tokens when original training data isn’t available before distillation. • Teacher fine-tuning can run in parallel with distillation. • Pruning focuses on MLP & embedding dimensions, leaving Attention layers untouched. • Handle embedding dimension pruning carefully, especially on the LN side—details in the paper. • Some benchmarks even show improvements over the original teacher model! 👥 Contributors: Foundational Model: Sharath Turuvekere Sreenivas*, Saurav Muralidharan*, Raviraj Joshi, Marcin Chochowski, Pavlo Molchanov, Mostofa Patwary, Daniel Korzekwa, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro and Jan Kautz Alignment: Ameya Sunil Mahabaleshwarkar, Hayley Ross, Brandon Rowlett, Oluwatobi Olabiyi, Shizhe Diao and Yoshi Suhara Datasets: Sanjeev Satheesh, Jupinder Parmar, Shengyang Sun, Jiaqi Zeng, Zhilin Wang, Yi Dong, Zihan Liu, Rajarshi Roy, Wei Ping, Makesh Narsimhan Sreedhar and Oleksii Kuchaiev TensorRT-LLM: Bobby Chen, James Shen and Chenhan Yu Hugging Face Support: Ao Tang, Yoshi Suhara and Greg Heinrich * Equal contribution. #NVIDIAAI #NVIDIA #NVIDIAResearch @NVIDIAAI
Pavlo Molchanov tweet media
English
4
50
149
20.7K
Mervin Praison
Mervin Praison@MervinPraison·
NVIDIA Llama 3.1 Minitron 4B: Created from Llama 3.1 8B. Here is how 🚀 40x Fewer Tokens 💰 1.8x Cost Savings 📈 16% Performance Boost 🧠 4 Billion Parameters ⚖️ On Par with 8B Models 🔄 Pruning & Distillation ⚡ Efficient AI Model Creation 🛠️ Less Training Data Needed nvda.ws/3WM4OeR @NVIDIAAI @nvidia @AIatMeta @PavloMolchanov @Ahmad_Al_Dahle @darrinpjohnson @NVIDIAAIDev Sub: @MervinPraison" target="_blank" rel="nofollow noopener">youtube.com/@MervinPraison
English
1
24
191
10.6K
Saurav Muralidharan ری ٹویٹ کیا
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
See how our #NVIDIAResearch team has developed a method to efficiently create smaller, accurate language models by using structured weight pruning and knowledge distillation - offering several advantages for developers: ✅ 16% better performance on MMLU scores ✅ 40x fewer tokens for training new models ✅ Up to 1.8x cost saving for training a family of models 🦙 The effectiveness of these strategies is demonstrated with the @AIatMeta Llama 3.1 8B model, which was refined into the Llama-3.1-Minitron 4B. This new model will soon be available in the NVIDIA @huggingface collection. ➡️ Technical Deep Dive: developer.nvidia.com/blog/how-to-pr…
NVIDIA AI Developer tweet media
English
4
26
110
10.5K
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 We've pruned LLaMa3.1 down to 4B parameters, delivering a smaller and more efficient model! Based on our recent paper: arxiv.org/abs/2407.14679 📖 Learn all about it in our blog: developer.nvidia.com/blog/how-to-pr… 🔗 META's announcement: ai.meta.com/blog/nvidia-ll… 👐 Checkpoints at HF this week: huggingface.co/collections/nv… Key Learnings: 🌟 Width pruning delivers better accuracy with MMLU at 60.5%, while depth pruning yields 58.7%. 🧠 Reasoning ability is more impacted, with GSM8K accuracy at 41.24% for width and 16.8% for depth -consistent with findings from our recent depth pruning paper (arxiv.org/abs/2407.16286). ⚡ Depth pruning boosts throughput, achieving ~2.7x speed up over LLaMa-3.1-8B, while width pruning provides ~1.7x speed up. #NVIDIAAI #META #llama
Pavlo Molchanov tweet media
English
8
91
309
58K
Saurav Muralidharan
Saurav Muralidharan@srv_m·
@cataluna84 Hi, we do have plans to release the code, but the timeline is a bit unclear due to the legal approvals we need to obtain. In the next few weeks, hopefully!
English
0
0
1
33
Mayank Bhaskar
Mayank Bhaskar@cataluna84·
@srv_m Do you plan to release the full pruning & distillation code along with evaluation & benchmarks, so that we can try this method on new models?
English
1
0
0
30
Saurav Muralidharan
Saurav Muralidharan@srv_m·
🤖 Excited to announce Minitron, a new family of language models obtained through a combination of weight pruning and knowledge distillation! Our models are available on HF with a permissive license. Give them a try today!
Pavlo Molchanov@PavloMolchanov

🚀 40x Faster Model Training via Pruning and Distillation! Permissive Minitron-4B and Minitron-8B models! 🔗 Paper: arxiv.org/abs/2407.14679 🔗 GitHub: github.com/NVlabs/Minitron 🔗 Models on HF: bit.ly/4ffjnQj Key highlights of 4B/8B models: 📊 2.6B/6.2B active non-embedding parameters ⚡ Squared ReLU activation in MLP – welcome back, sparsity! 🗜️  Grouped Query Attention with 24/48 heads and 8 queries 🌐 256K vocab size for multilingual support 🔒 Hidden size: 3072/4096 🔧 MLP hidden size: 9216/16384 📈 32 layers 👐 Permissive license! Details below 🧵

English
1
1
6
340
Saurav Muralidharan ری ٹویٹ کیا
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 Introducing Flextron - a Many-in-One LLM - Oral at ICML! Train one model and get many optimal models for each GPU at inference without any additional retraining. 🌟 🔗 Paper: arxiv.org/abs/2406.10260 Main benefits with only 5% post-training finetuning: ✅ Best model for every GPU (small & large) without retraining ✅ Change inference cost on the fly based on load ✅ Input-adaptive inference (heterogeneous weight-shared MoE, Attention) ✅Instead of training many models, we train only 1: LLaMa2-7B ➡️ 3B, 4B, 5B, 6B, etc. Method in observation in thread. 🧵👇
GIF
English
5
61
194
30.4K
Saurav Muralidharan
Saurav Muralidharan@srv_m·
Check out our latest work, Flextron, an elastic LLM that supports zero-shot flexible deployment at a variety of model scales and sizes. Flextron models achieve SoTA performance and are also input-adaptive (heterogeneous MoE).
Ruisi Cai@ccccrs_0908

Tired of training varying-size LLMs to fit various GPU memory and latency requirements? Check out Flextron! Our new ICML (Oral) paper shows how to train one model deployable across GPU series. Learn more: cairuisi.github.io/Flextron/🚀

English
1
0
3
210
Saurav Muralidharan ری ٹویٹ کیا
Corey Lynch
Corey Lynch@coreylynch·
We are now having full conversations with Figure 01, thanks to our partnership with OpenAI. Our robot can: - describe its visual experience - plan future actions - reflect on its memory - explain its reasoning verbally Technical deep-dive 🧵:
English
141
644
2.7K
669.5K