Itay Levy

70 posts

Itay Levy

@itayoush

Deep Learning Researcher @NVIDIA

Israel Katılım Eylül 2020

3K Takip Edilen313 Takipçiler

Itay Levy retweetledi

Jian Zhang@JianZhangCS·11 Mar

Nemotron 3 Super is live! So far the most intelligent agentic reasoning model in the Nemotron family, with world leading efficiency and openness. Super particularly marks our first infra & research milestone in agentic reinforcement learning scaling up. Stay tuned for more infra, data and agentic generalization research we will open to the ecosystem. 🤗 Huggingface: lnkd.in/gWfamwwX 📜 Tech Report: lnkd.in/gRFFJxKm 🤸‍♂️NeMo-Gym (RL env data and orchestration): github.com/NVIDIA-NeMo/Gym 🤸NeMo-RL (RL training): github.com/NVIDIA-NeMo/RL

English

1.5K

Itay Levy retweetledi

Soumye Singhal@soumyesinghal·11 Mar

Nemotron 3 Super is here! 🚀 Big capability jump, especially on agentic benchmarks, while staying built for efficient inference. Released the @nvidia way: weights + training recipes + code + datasets. HF: huggingface.co/nvidia/NVIDIA-… Tech report: research.nvidia.com/labs/nemotron/…

English

1.4K

Itay Levy retweetledi

Ethan He@EthanHe_42·10 Mar

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685

English

104

992

81.5K

Itay Levy@itayoush·15 Şub

Paper: arxiv.org/abs/2602.11937 The model will become available on HF soon

English

119

Itay Levy@itayoush·15 Şub

For reasoning models, tok/s isn’t sufficient because trace length can change, so we also measure request-level efficiency. On the accuracy–speed frontier, 88B outperforms 120B across all efforts, with up to 1.29× higher request rates

English

136

Itay Levy@itayoush·15 Şub

🧵 New paper: We compressed OpenAI’s gpt-oss-120B into a smaller, faster derivative (gpt-oss-puzzle-88B) with no accuracy loss: ⚡ Up to 1.63× higher token throughput on 8×H100 ⚡ Up to 2.82× on a single H100

English

999

Itay Levy@itayoush·9 Nis

It has been an honor for us to collaborate on the Llama-Nemotron family, creating open world-class frontier models! build.nvidia.com/nvidia/llama-3…

English

110

Itay Levy@itayoush·9 Nis

Try Llama Nemotron Ultra 253B, the smartest open reasoning model available today! 🏆 Tops scientific reasoning, complex math and coding benchmarks ⚡️ 4x higher inference throughput over DeepSeek R1. Optimized with neural architecture search and FFN fusion

English

369

Itay Levy@itayoush·19 Mar

Distillation-based NAS (Puzzle) paper arxiv.org/abs/2411.19146

English

Itay Levy@itayoush·19 Mar

Hugging Face huggingface.co/nvidia/Llama-3…

English

Itay Levy@itayoush·19 Mar

Very excited about the release of the Llama Nemotron Super 49B model 🚀 #GTC25 Using distillation-based NAS (Puzzle) we achieved 5X throughput gain! After SFT and RL, this model tops reasoning benchmarks among open 70B models

English

391

Itay Levy retweetledi

The AI Timeline@TheAITimeline·7 Ara

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs Author's Explanation: x.com/itayoush/statu… Overview: Puzzle introduces a distillation-based neural architecture search framework that significantly optimizes LLM inference on specific hardware, achieving a 2.17x speedup with a 98.4% retention of the original model's capabilities via blockwise local knowledge distillation and mixed-integer programming. The framework's application to Nemotron-51B enables a single NVIDIA H100 GPU to handle large batch sizes, highlighting the efficiency of the approach with only 45B training tokens needed compared to the 15T used for the original model, thus prioritizing inference performance over parameter count for model selection. Paper: arxiv.org/abs/2411.19146

Itay Levy@itayoush

Introducing Puzzle: Distillation-Based NAS for Inference-Optimized LLMs 🔗 arxiv.org/abs/2411.19146 🧵

English

539

Itay Levy@itayoush·2 Ara

As a highlight, we present Nemotron-51B, derived from Llama-3.1-70B-Instruct, achieving 2.17× inference throughput speedup on a single NVIDIA H100 GPU while preserving 98.4% of the original model's capabilities. Dive into the details of how it all works in our new research paper!

English

Itay Levy@itayoush·2 Ara

Puzzle accelerates LLM inference on specific hardware while preserving capabilities. Using decomposed NAS and knowledge distillation, we optimize LLMs under hardware constraints, while requiring only a fraction of the original training compute

English

Itay Levy@itayoush·2 Ara

Introducing Puzzle: Distillation-Based NAS for Inference-Optimized LLMs 🔗 arxiv.org/abs/2411.19146 🧵

English

767

Itay Levy retweetledi

Pavlo Molchanov@PavloMolchanov·27 Eyl

🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute acceleration for downstream tasks! 💥 📄 Paper: arxiv.org/abs/2409.17481 🔗 Code: github.com/NVlabs/MaskLLM 🌐 Webpage: vainf.github.io/maskllm-projec… 🔑 Key insights: - Gumbel-Softmax trick for differentiable binary mask training - Learnable Sparsity scales effectively to large-scale datasets and can fully leverage computational resources to learn precise masks through end-to-end training. - Using mask priors (i.e. SparseGPT, Magnitude) boosts efficiency & quality - Annealing stochastic sampling is crucial for effective mask learning - Maximizing resulting weight magnitudes improves downstream task performance 🏆 Results: - Effective 1.4x GPU speedup & 73% memory reduction with near-lossless performance! - MaskLLM outperforms one-shot techniques with just 1280 samples. - Our method improves with more data, unlike previous one-shot approaches! Kudos to Gongfan Fang (NUS) for great internship! Together with @yin_hongxu @jankautz @srv_m @gLeHeinrich @XinchaoWang3, Jeff Pool

GIF

English

154

13.8K

Itay Levy retweetledi

Talor Abramovich@AbramovichTalor·24 Eyl

We're launching EnIGMA, our state-of-the-art AI agent for offensive cybersec! It uses tools like Ghidra & pwntools, can debug, connect to servers, and exploit vulnerabilities to solve CTF challenges. Built with researchers from Princeton, NYU, and TAU. enigma-agent.github.io

English

15.3K

Itay Levy retweetledi

Pavlo Molchanov@PavloMolchanov·24 Eyl

🚀 Exciting news! We’ve just released a new LLM: Llama-3.1-Nemotron-51B = LLaMa-70B-Instruct + Block Distillation + NAS + Logics Distillation; Powered by a single H100 GPU with nearly the same accuracy! ⚡ This gives a 2.2x inference speed-up with MT Bench 8.99 ➡️ 8.94. 🤗HuggingFace model: huggingface.co/nvidia/Llama-3… 📝 Full details in our blog: developer.nvidia.com/blog/advancing… This builds on block-level distillation, quick NAS and knowledge distillation, along the lines of our previous work on LANA with @ArashVahdat, @yin_hongxu & @jankautz. 📄 LANA Paper: arxiv.org/abs/2107.10624

English

7.4K

Keşfet

@nvidia @NeurIPSConf @yin_hongxu @jankautz @srv_m @gLeHeinrich @XinchaoWang3 @ArashVahdat