Wonmin Byeon

96 posts

Wonmin Byeon

Wonmin Byeon

@wonmin_byeon

Researcher

California Katılım Mart 2020
284 Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
🚀 New paper: Mamba–Transformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. ✅ Only 25% of visual tokens 🚀 3.8–4.2× faster prefilling (TTFT) 🎯 Near-baseline accuracy (can exceed baseline with light finetuning)
Wonmin Byeon tweet media
English
3
24
217
13.9K
Wonmin Byeon retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
Introducing NVIDIA Nemotron 3 Super 🎉 Open 120B-parameter (12B active) hybrid Mamba-Transformer MoE model Native 1M-token context Built for compute-efficient, high-accuracy multi-agent applications Plus, fully open weights, datasets and recipes for easy customization and deployment. 🧵
English
59
108
825
133.5K
Wonmin Byeon retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!
Bryan Catanzaro tweet media
English
62
205
1.2K
203.8K
Wonmin Byeon retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
One of the things that makes Nemotron 3 Super so fast is native multi-token prediction. 1. Model predicts several tokens rather than just one, which is essentially free because it's just a bit of extra work for the last layer of the model. The first token is accepted, the others are provisional. 2. On the next pass, the model verifies the extra tokens and either accepts them if they were correct or rejects them if they were wrong. If they were wrong, we still accept one token, so autoregression proceeds even with misprediction. 3. Verification is essentially free because the GPU has a lot of extra compute capability when running at a small batch size that would otherwise be wasted. 4. Speedup then is directly linked to acceptance rates. If the multi-token prediction is more accurate, we get a bigger speedup.
Bryan Catanzaro tweet media
English
7
30
292
18.1K
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
@bronzeagepapi It can be used for gated deltanet as well. Hopefully, someone tries it with Qwen3.5 VL ;)
English
0
0
3
122
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
🚀 New paper: Mamba–Transformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. ✅ Only 25% of visual tokens 🚀 3.8–4.2× faster prefilling (TTFT) 🎯 Near-baseline accuracy (can exceed baseline with light finetuning)
Wonmin Byeon tweet media
English
3
24
217
13.9K
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
5) Results: with only 25% visual tokens, we get 3.8–4.2× faster prefilling (TTFT) while keeping near-baseline accuracy. With light finetuning under reduction, accuracy improves further and can even surpass baseline.
English
0
1
5
189
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
4) Key piece: language-aware token scoring across attention and SSM: - attention layers (explicit attention) - Mamba/SSM layers (implicit-attention proxy) So selection stays prompt-aligned.
English
1
1
4
214
Jeongseok Hyun
Jeongseok Hyun@Jeongseok_hyun·
🎉Thrilled to share that I will be joining NVIDIA Taiwan (@NVIDIAAI) as a research intern! I'll be working on efficient modeling of Video LLMs under the supervision of Ryo Hachiuma (@RHachiuma). Looking forward to this exciting journey ahead!
English
1
0
13
447
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
It’s been months since we introduced STORM, and it’s great to see so many video discussions! ⛈️🎥 To catch up on our token-efficient long video understanding work, check out these reviews (Some are AI-generated but very useful!). Huge thanks to the creators! 🙌 The links ↓
Wonmin Byeon@wonmin_byeon

🚀 New paper: STORM — Efficient VLM for Long Video Understanding STORM cuts compute costs by up to 8× and reduces decoding latency by 2.4–2.9×, while achieving state-of-the-art performance. Details + paper link in the thread ↓

English
1
0
4
406
Wonmin Byeon retweetledi
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
What if your model could learn from its own drafts during RL training? 🚀 🔥 New paper: iGRPO: Iterative Group Relative Policy Optimization We add a self-feedback loop to GRPO: the model drafts multiple solutions, picks its best one, then learns to refine beyond it. Core idea: Stage 1 → explore and select your strongest attempt. Stage 2 → condition on that attempt and beat it. Same scalar reward. No critics, no generated critiques, no verification text. The best draft is the only feedback the model needs. 📊 Results across 7B / 8B / 14B models: • Nemotron-H-8B-Base-8K: 41.1% → 45.0% (+3.96 over GRPO) • DeepSeek-R1-Distill-Qwen-7B: 68.3% → 69.9% • OpenMath-Nemotron-14B: 76.7% → 78.0% • OpenReasoning-Nemotron-7B on AceReason-Math: 85.62% AIME24 / 79.64% AIME25 🏆 The same two-stage wrapper also improves DAPO and GSPO. It's not tied to GRPO at all. 📄 Paper: arxiv.org/abs/2602.09000 This work was the result of a wonderful collaboration with @shrimai_ @igtmn @GXiming @SeungjuHan3 @_weiping @YejinChoinka @jankautz
Ali Hatamizadeh tweet media
English
4
36
179
12K
Wonmin Byeon retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀Announcing C-RADIOv4: The latest evolution in our agglomerative vision backbone family is here! We’ve built a unified student model that distills the best capabilities of multiple state-of-the-art teachers into a single, efficient architecture. DINOv3, SAM3 and SigLIP2 all in one forward pass with better features. Better efficiency, robustness and any resolution. 👐Permissive and open-source! Read the tech report: arxiv.org/pdf/2601.17237 Models: huggingface.co/nvidia/C-RADIO… huggingface.co/nvidia/C-RADIO… Github: github.com/NVlabs/RADIO 🧵1/N
Pavlo Molchanov tweet media
English
6
41
251
12.6K
Wonmin Byeon retweetledi
Casper Hansen
Casper Hansen@casper_hansen_·
Nvidia is fast becoming open-source kings, they actually pre-trained, mid-trained, post-trained, and got better results than Qwen3 which is impressive! Now we only need to bully them about nvidia-open-model-license and multimodality...
Casper Hansen tweet media
English
11
25
239
42.7K
Wonmin Byeon retweetledi
Mohammad Shoeybi
Mohammad Shoeybi@MohammadShoeybi·
We just released Nemotron 3 Nano 30B-A3B. It is fast, accurate, and truly open. Check it out! Nemotron 3 Super (~4x larger) and Ultra (~16x larger) are currently being trained and will be released in 2026! blog-post and technical report: nvda.ws/48RusVt
Mohammad Shoeybi tweet media
English
1
8
35
2.4K
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
NVIDIA’s hybrid model just got bigger, faster, and stronger.
Bryan Catanzaro@ctnzr

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

English
0
0
3
143