Itay Levy

70 posts

Itay Levy

Itay Levy

@itayoush

Deep Learning Researcher @NVIDIA

Israel Katılım Eylül 2020
3K Takip Edilen313 Takipçiler
Itay Levy retweetledi
Jian Zhang
Jian Zhang@JianZhangCS·
Nemotron 3 Super is live! So far the most intelligent agentic reasoning model in the Nemotron family, with world leading efficiency and openness. Super particularly marks our first infra & research milestone in agentic reinforcement learning scaling up. Stay tuned for more infra, data and agentic generalization research we will open to the ecosystem. 🤗 Huggingface: lnkd.in/gWfamwwX 📜 Tech Report: lnkd.in/gRFFJxKm 🤸‍♂️NeMo-Gym (RL env data and orchestration): github.com/NVIDIA-NeMo/Gym 🤸NeMo-RL (RL training): github.com/NVIDIA-NeMo/RL
Jian Zhang tweet media
English
6
9
39
1.5K
Itay Levy retweetledi
Ethan He
Ethan He@EthanHe_42·
My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685
Ethan He tweet media
English
39
104
992
81.5K
Itay Levy
Itay Levy@itayoush·
For reasoning models, tok/s isn’t sufficient because trace length can change, so we also measure request-level efficiency. On the accuracy–speed frontier, 88B outperforms 120B across all efforts, with up to 1.29× higher request rates
English
1
0
1
136
Itay Levy
Itay Levy@itayoush·
🧵 New paper: We compressed OpenAI’s gpt-oss-120B into a smaller, faster derivative (gpt-oss-puzzle-88B) with no accuracy loss: ⚡ Up to 1.63× higher token throughput on 8×H100 ⚡ Up to 2.82× on a single H100
Itay Levy tweet media
English
1
4
23
999
Itay Levy
Itay Levy@itayoush·
Try Llama Nemotron Ultra 253B, the smartest open reasoning model available today! 🏆 Tops scientific reasoning, complex math and coding benchmarks ⚡️ 4x higher inference throughput over DeepSeek R1. Optimized with neural architecture search and FFN fusion
Itay Levy tweet mediaItay Levy tweet media
English
1
0
8
369
Itay Levy
Itay Levy@itayoush·
Very excited about the release of the Llama Nemotron Super 49B model 🚀 #GTC25 Using distillation-based NAS (Puzzle) we achieved 5X throughput gain! After SFT and RL, this model tops reasoning benchmarks among open 70B models
Itay Levy tweet media
English
1
1
8
391
Itay Levy retweetledi
The AI Timeline
The AI Timeline@TheAITimeline·
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs Author's Explanation: x.com/itayoush/statu… Overview: Puzzle introduces a distillation-based neural architecture search framework that significantly optimizes LLM inference on specific hardware, achieving a 2.17x speedup with a 98.4% retention of the original model's capabilities via blockwise local knowledge distillation and mixed-integer programming. The framework's application to Nemotron-51B enables a single NVIDIA H100 GPU to handle large batch sizes, highlighting the efficiency of the approach with only 45B training tokens needed compared to the 15T used for the original model, thus prioritizing inference performance over parameter count for model selection. Paper: arxiv.org/abs/2411.19146
The AI Timeline tweet media
Itay Levy@itayoush

Introducing Puzzle: Distillation-Based NAS for Inference-Optimized LLMs 🔗 arxiv.org/abs/2411.19146 🧵

English
1
2
3
539
Itay Levy
Itay Levy@itayoush·
As a highlight, we present Nemotron-51B, derived from Llama-3.1-70B-Instruct, achieving 2.17× inference throughput speedup on a single NVIDIA H100 GPU while preserving 98.4% of the original model's capabilities. Dive into the details of how it all works in our new research paper!
English
0
0
0
85
Itay Levy
Itay Levy@itayoush·
Puzzle accelerates LLM inference on specific hardware while preserving capabilities. Using decomposed NAS and knowledge distillation, we optimize LLMs under hardware constraints, while requiring only a fraction of the original training compute
English
1
0
0
67
Itay Levy retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute acceleration for downstream tasks! 💥 📄 Paper: arxiv.org/abs/2409.17481 🔗 Code: github.com/NVlabs/MaskLLM 🌐 Webpage: vainf.github.io/maskllm-projec… 🔑 Key insights: - Gumbel-Softmax trick for differentiable binary mask training - Learnable Sparsity scales effectively to large-scale datasets and can fully leverage computational resources to learn precise masks through end-to-end training. - Using mask priors (i.e. SparseGPT, Magnitude) boosts efficiency & quality - Annealing stochastic sampling is crucial for effective mask learning - Maximizing resulting weight magnitudes improves downstream task performance 🏆 Results: - Effective 1.4x GPU speedup & 73% memory reduction with near-lossless performance! - MaskLLM outperforms one-shot techniques with just 1280 samples. - Our method improves with more data, unlike previous one-shot approaches! Kudos to Gongfan Fang (NUS) for great internship! Together with @yin_hongxu @jankautz @srv_m @gLeHeinrich @XinchaoWang3, Jeff Pool
GIF
English
2
34
154
13.8K
Itay Levy retweetledi
Talor Abramovich
Talor Abramovich@AbramovichTalor·
We're launching EnIGMA, our state-of-the-art AI agent for offensive cybersec! It uses tools like Ghidra & pwntools, can debug, connect to servers, and exploit vulnerabilities to solve CTF challenges. Built with researchers from Princeton, NYU, and TAU. enigma-agent.github.io
Talor Abramovich tweet media
English
2
15
44
15.3K
Itay Levy retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 Exciting news! We’ve just released a new LLM: Llama-3.1-Nemotron-51B = LLaMa-70B-Instruct + Block Distillation + NAS + Logics Distillation; Powered by a single H100 GPU with nearly the same accuracy! ⚡ This gives a 2.2x inference speed-up with MT Bench 8.99 ➡️ 8.94. 🤗HuggingFace model: huggingface.co/nvidia/Llama-3… 📝 Full details in our blog: developer.nvidia.com/blog/advancing… This builds on block-level distillation, quick NAS and knowledge distillation, along the lines of our previous work on LANA with @ArashVahdat, @yin_hongxu & @jankautz. 📄 LANA Paper: arxiv.org/abs/2107.10624
Pavlo Molchanov tweet media
English
2
22
74
7.4K