Saurav Muralidharan

252 posts

Saurav Muralidharan

@srv_m

Research Scientist @NVIDIA | Making LLMs More Efficient

Entrou em Mart 2008

247 Seguindo193 Seguidores

Saurav Muralidharan@srv_m·15 Ara

Today we are releasing the first model in the NVIDIA Nemotron 3 family: Nemotron 3 Nano! Nemotron 3 Nano is truly open, efficient, and achieves class-leading accuracy on reasoning and agentic tasks. Check it out today! 🚀 research.nvidia.com/labs/nemotron/…

English

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·22 Kas

Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: arxiv.org/abs/2411.13676 Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!

English

494

97.6K

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·4 Kas

We are hiring researchers working in LLM and VLM efficiency! Applications are open for PhD students graduating in 2025; and senior researchers with PhD. Check requirements for the position. Apply here: nvidia.wd5.myworkdayjobs.com/NVIDIAExternal… Senior researchers: nvidia.wd5.myworkdayjobs.com/NVIDIAExternal… Check our teams webpage: nv-dler.github.io to learn more about the work we are doing.

English

7.4K

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·27 Eyl

🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute acceleration for downstream tasks! 💥 📄 Paper: arxiv.org/abs/2409.17481 🔗 Code: github.com/NVlabs/MaskLLM 🌐 Webpage: vainf.github.io/maskllm-projec… 🔑 Key insights: - Gumbel-Softmax trick for differentiable binary mask training - Learnable Sparsity scales effectively to large-scale datasets and can fully leverage computational resources to learn precise masks through end-to-end training. - Using mask priors (i.e. SparseGPT, Magnitude) boosts efficiency & quality - Annealing stochastic sampling is crucial for effective mask learning - Maximizing resulting weight magnitudes improves downstream task performance 🏆 Results: - Effective 1.4x GPU speedup & 73% memory reduction with near-lossless performance! - MaskLLM outperforms one-shot techniques with just 1280 samples. - Our method improves with more data, unlike previous one-shot approaches! Kudos to Gongfan Fang (NUS) for great internship! Together with @yin_hongxu @jankautz @srv_m @gLeHeinrich @XinchaoWang3, Jeff Pool

GIF

English

154

13.8K

Saurav Muralidharan retweetou

NVIDIA AI Developer@NVIDIAAIDev·23 Eyl

👀 Experience high-efficiency NVIDIA Llama-3.1-Nemotron-51B - a NAS-optimized model achieving 2x throughput while preserving accuracy runs on a single H100 GPU. ✨Try out the Llama-3.1-Nemotron-51B NIM through the API from ai.nvidia.com or download from @huggingface. Technical deep dive ➡️ nvda.ws/47AI6ve

English

148

13.3K

Saurav Muralidharan retweetou

Wei Ping@_weiping·18 Eyl

Introducing NVLM 1.0, a family of frontier-class multimodal LLMs that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training! We are working towards releasing the model weights very soon 🤗 and will open-source the training code for the community. For further details, please visit our project website: nvlm-project.github.io

English

119

471

168.6K

Saurav Muralidharan retweetou

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·22 Ağu

LLM Pruning and Distillation in Practice: The Minitron Approach abs: arxiv.org/abs/2408.11796 models: huggingface.co/nvidia/Mistral… huggingface.co/nvidia/Llama-3… huggingface.co/nvidia/Llama-3… Compressing Llama 3.1 8B and Mistral NeMo 12B to 4B and 8B, respectively, with teacher correction, weight pruning, and distillation (Minitron approach from NVIDIA).

Tanishq Mathew Abraham, Ph.D. tweet media

English

349

25.6K

Saurav Muralidharan retweetou

NVIDIA AI Developer@NVIDIAAIDev·21 Ağu

Today we released Mistral-NeMo-Minitron 8B, a pruned and distilled version of the open @MistralAI NeMo 12B model, achieving high accuracy across nine popular benchmarks for chatbots, virtual assistants, content generation, coding, and educational tools. ➡️ nvda.ws/4cz17Pl ✨ Experience now on the NVIDIA API catalog or download from @HuggingFace.

English

220

45.1K

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·21 Ağu

🌟 The best 8B Base model via pruning and distillation! 🚀 Introducing Mistral-NeMo-Minitron-8B-Base model we derived from the recent Mistral-NeMo-12B. Our recipe: finetune teacher on 100B tokens, prune to 8B params, run teacher-student distillation on <400B tokens. Result: the best -Base model for 8B range. 📚 Technical report: d1qx31qr3h6wln.cloudfront.net/publications/m… 📝 Blog post: developer.nvidia.com/blog/mistral-n… 🤗 HF model checkpoint: huggingface.co/nvidia/Mistral… 📌 Key Benefits: • Significant training reduction: only 400B tokens • Outperforms Mistral-7B and LLaMa-3.1-8B across 8/9 benchmarks • Enhanced model accuracy via distillation • Ready for commercial use with permissive license ⚙️ Model architecture: • 8.4 B total parameters • 7.3B active non-embedding params • 131k vocab size • Hidden size: 4096 • Depth: 40 layers • MLP hidden size: 11520 • Query heads: 32 • Head dimension: 128 • Attention groups: 8 👐 Permissive license! 🔑 Key Learnings: • Fine-tune the teacher with 100B tokens when original training data isn’t available before distillation. • Teacher fine-tuning can run in parallel with distillation. • Pruning focuses on MLP & embedding dimensions, leaving Attention layers untouched. • Handle embedding dimension pruning carefully, especially on the LN side—details in the paper. • Some benchmarks even show improvements over the original teacher model! 👥 Contributors: Foundational Model: Sharath Turuvekere Sreenivas*, Saurav Muralidharan*, Raviraj Joshi, Marcin Chochowski, Pavlo Molchanov, Mostofa Patwary, Daniel Korzekwa, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro and Jan Kautz Alignment: Ameya Sunil Mahabaleshwarkar, Hayley Ross, Brandon Rowlett, Oluwatobi Olabiyi, Shizhe Diao and Yoshi Suhara Datasets: Sanjeev Satheesh, Jupinder Parmar, Shengyang Sun, Jiaqi Zeng, Zhilin Wang, Yi Dong, Zihan Liu, Rajarshi Roy, Wei Ping, Makesh Narsimhan Sreedhar and Oleksii Kuchaiev TensorRT-LLM: Bobby Chen, James Shen and Chenhan Yu Hugging Face Support: Ao Tang, Yoshi Suhara and Greg Heinrich * Equal contribution. #NVIDIAAI #NVIDIA #NVIDIAResearch @NVIDIAAI

English

149

20.7K

Saurav Muralidharan@srv_m·17 Ağu

@MervinPraison Thank you for the video! Your explanation is very clear and easy to follow.

English

705

Mervin Praison@MervinPraison·17 Ağu

NVIDIA Llama 3.1 Minitron 4B: Created from Llama 3.1 8B. Here is how 🚀 40x Fewer Tokens 💰 1.8x Cost Savings 📈 16% Performance Boost 🧠 4 Billion Parameters ⚖️ On Par with 8B Models 🔄 Pruning & Distillation ⚡ Efficient AI Model Creation 🛠️ Less Training Data Needed nvda.ws/3WM4OeR @NVIDIAAI @nvidia @AIatMeta @PavloMolchanov @Ahmad_Al_Dahle @darrinpjohnson @NVIDIAAIDev Sub: @MervinPraison" target="_blank" rel="nofollow noopener">youtube.com/@MervinPraison

English

191

10.6K

Saurav Muralidharan@srv_m·17 Ağu

RT @PavloMolchanov: 🚨 Llama-3.1-Minitron-4B-Width-Base is now live on HF! 🔗 huggingface.co/nvidia/Llama-3… ‼️ Important: Use a specific commit as…

English

Saurav Muralidharan retweetou

NVIDIA AI Developer@NVIDIAAIDev·14 Ağu

See how our #NVIDIAResearch team has developed a method to efficiently create smaller, accurate language models by using structured weight pruning and knowledge distillation - offering several advantages for developers: ✅ 16% better performance on MMLU scores ✅ 40x fewer tokens for training new models ✅ Up to 1.8x cost saving for training a family of models 🦙 The effectiveness of these strategies is demonstrated with the @AIatMeta Llama 3.1 8B model, which was refined into the Llama-3.1-Minitron 4B. This new model will soon be available in the NVIDIA @huggingface collection. ➡️ Technical Deep Dive: developer.nvidia.com/blog/how-to-pr… ✨

English

110

10.5K

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·14 Ağu

🚀 We've pruned LLaMa3.1 down to 4B parameters, delivering a smaller and more efficient model! Based on our recent paper: arxiv.org/abs/2407.14679 📖 Learn all about it in our blog: developer.nvidia.com/blog/how-to-pr… 🔗 META's announcement: ai.meta.com/blog/nvidia-ll… 👐 Checkpoints at HF this week: huggingface.co/collections/nv… Key Learnings: 🌟 Width pruning delivers better accuracy with MMLU at 60.5%, while depth pruning yields 58.7%. 🧠 Reasoning ability is more impacted, with GSM8K accuracy at 41.24% for width and 16.8% for depth -consistent with findings from our recent depth pruning paper (arxiv.org/abs/2407.16286). ⚡ Depth pruning boosts throughput, achieving ~2.7x speed up over LLaMa-3.1-8B, while width pruning provides ~1.7x speed up. #NVIDIAAI #META #llama

English

309

58K

Saurav Muralidharan@srv_m·1 Ağu

@cataluna84 Hi, we do have plans to release the code, but the timeline is a bit unclear due to the legal approvals we need to obtain. In the next few weeks, hopefully!

English

Mayank Bhaskar@cataluna84·31 Tem

@srv_m Do you plan to release the full pruning & distillation code along with evaluation & benchmarks, so that we can try this method on new models?

English

Saurav Muralidharan@srv_m·23 Tem

🤖 Excited to announce Minitron, a new family of language models obtained through a combination of weight pruning and knowledge distillation! Our models are available on HF with a permissive license. Give them a try today!

Pavlo Molchanov@PavloMolchanov

🚀 40x Faster Model Training via Pruning and Distillation! Permissive Minitron-4B and Minitron-8B models! 🔗 Paper: arxiv.org/abs/2407.14679 🔗 GitHub: github.com/NVlabs/Minitron 🔗 Models on HF: bit.ly/4ffjnQj Key highlights of 4B/8B models: 📊 2.6B/6.2B active non-embedding parameters ⚡ Squared ReLU activation in MLP – welcome back, sparsity! 🗜️ Grouped Query Attention with 24/48 heads and 8 queries 🌐 256K vocab size for multilingual support 🔒 Hidden size: 3072/4096 🔧 MLP hidden size: 9216/16384 📈 32 layers 👐 Permissive license! Details below 🧵

English

340

Saurav Muralidharan retweetou

Bryan Catanzaro@ctnzr·18 Tem

@MistralAI and @nvidia announce Mistral-NeMo 12B, an awesome bite-size model released under Apache 2.0 that we jointly trained. FP8 aligned checkpoint and 128k context window, great benchmark scores. blogs.nvidia.com/blog/mistral-n… mistral.ai/news/mistral-n…

English

9.8K

Saurav Muralidharan retweetou

Pavlo Molchanov@PavloMolchanov·17 Tem

🚀 Introducing Flextron - a Many-in-One LLM - Oral at ICML! Train one model and get many optimal models for each GPU at inference without any additional retraining. 🌟 🔗 Paper: arxiv.org/abs/2406.10260 Main benefits with only 5% post-training finetuning: ✅ Best model for every GPU (small & large) without retraining ✅ Change inference cost on the fly based on load ✅ Input-adaptive inference (heterogeneous weight-shared MoE, Attention) ✅Instead of training many models, we train only 1: LLaMa2-7B ➡️ 3B, 4B, 5B, 6B, etc. Method in observation in thread. 🧵👇

GIF

English

194

30.4K

Saurav Muralidharan@srv_m·19 Haz

More details here: cairuisi.github.io/Flextron/

English

Saurav Muralidharan@srv_m·19 Haz

Check out our latest work, Flextron, an elastic LLM that supports zero-shot flexible deployment at a variety of model scales and sizes. Flextron models achieve SoTA performance and are also input-adaptive (heterogeneous MoE).

Ruisi Cai@ccccrs_0908

Tired of training varying-size LLMs to fit various GPU memory and latency requirements? Check out Flextron! Our new ICML (Oral) paper shows how to train one model deployable across GPU series. Learn more: cairuisi.github.io/Flextron/🚀

English

210

Saurav Muralidharan retweetou

Corey Lynch@coreylynch·13 Mar

We are now having full conversations with Figure 01, thanks to our partnership with OpenAI. Our robot can: - describe its visual experience - plan future actions - reflect on its memory - explain its reasoning verbally Technical deep-dive 🧵:

English

141

643

2.7K

669.5K

Descobrir

@NeurIPSConf @yin_hongxu @jankautz @gLeHeinrich @XinchaoWang3 @huggingface @MistralAI @HuggingFace