Zhiding Yu

202 posts

Zhiding Yu

Zhiding Yu

@ZhidingYu

Working to make machines understand the world like human beings. Words are my own.

Santa Clara Katılım Temmuz 2020
644 Takip Edilen7.8K Takipçiler
Sabitlenmiş Tweet
Zhiding Yu
Zhiding Yu@ZhidingYu·
Thank you AK! Excited to introduce Eagle 2.5, NVIDIA’s latest vision-language model that brings strong long-context capabilities across both image and video understanding — all with just 8B parameters. Most existing VLMs struggle with high-res inputs and long video contexts. Eagle 2.5 is designed to tackle both — supporting up to 512 video frames and trained jointly on image + video data. We introduce a new benchmark-scale dataset, Eagle-Video-110K, with over 110K annotated samples, including QA, localization, and summarization. Videos range from a few minutes to 3 hours — pushing the limits of long-form visual reasoning. Key techniques: • Information-First Sampling: spatially aware, quality-preserving frame selection • Mixed image-video training for generalization • Progressive long-context recipes up to 128K tokens • Optimized decoding and inference for efficient deployment Strong results across the board: • 6 out of 10 SOTA on long video benchmarks • Outperforms GPT-4o (0806) on 3/5 video tasks • Outperforms Gemini 1.5 Pro on 4/6 video tasks • Matches or beats Qwen2.5-VL-72B on multiple key datasets • Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL. Evaluated on: • Video-MME • MVBench • Charades-STA • 1-Hour Video QA • EgoSchema • MLVU, LVBench, and more… These tasks stress-test long-form visual understanding with dense supervision and temporal reasoning. Model, demo, and dataset will be released soon. Explore the project here: nvlabs.github.io/EAGLE/ Code: github.com/NVlabs/EAGLE Tech Report: huggingface.co/papers/2504.15… We're excited to contribute toward long-context, general-purpose VLMs — and would love to hear your feedback or ideas for collaboration.
Aran Komatsuzaki@arankomatsuzaki

Nvidia presents Eagle 2.5! - A family of frontier VLMs for long-context multimodal learning - Eagle 2.5-8B matches the results of GPT-4o and Qwen2.5-VL-72B on long-video understanding

English
1
8
53
18.6K
Zhiding Yu retweetledi
Zihan "Zenus" Wang
Zihan "Zenus" Wang@wzenus·
In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. 🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵
English
12
53
209
129.3K
Zhiding Yu
Zhiding Yu@ZhidingYu·
Today feels reflective and a little wistful for me. Having built VLMs at NVIDIA for over two years, I’ve always seen the Qwen model family as a role model in the field. I’m proud to see it stand as an exemplar of open-source innovation, driven by outstanding homegrown talent and strong engineering execution. For me, Qwen has felt more like a research partner than a competitor to surpass. Its technology has supported many flagship efforts and become part of our daily research life. My best wishes to Qwen and Junyang. Your excellence has made us better.
Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

English
2
11
218
24.8K
Zhiding Yu retweetledi
Fu-En (Fred) Yang
Fu-En (Fred) Yang@FuEnYang1·
🚀 Excited to share that our paper Fast-ThinkAct has been accepted to #CVPR2026! 🎉 Efficient Vision-Language-Action reasoning via verbalizable latent planning — enabling embodied agents to think fast internally without lengthy textual reasoning. ⚡ Achieves 9.3× faster inference (89% latency reduction) than ThinkAct-7B — bringing Reasoning VLA closer to real-time robotic control. 📄 arxiv.org/abs/2601.09708 🎥 jasper0314-huang.github.io/fast-thinkact/ 🙌 Huge congrats to @chipinhxyz, @yunzeman, @ZhidingYu, @CMHungSteven, @jankautz, Yu-Chiang Frank Wang, @FuEnYang1 #EmbodiedAI #PhysicalAI #VLA #Robotics #NVIDIAResearch @NVIDIAAI @NVIDIARobotics
Fu-En (Fred) Yang@FuEnYang1

🤖 How can embodied agents think fast—like humans do internally—without lengthy textual reasoning, and still act effectively? 🚀 Introducing Fast-ThinkAct: compact, efficient, verbalizable latent reasoning for Vision-Language-Action models. Fast think, fast act. 🧠⚡🤲

English
3
11
75
6.2K
Zhiding Yu retweetledi
Zhiding Yu
Zhiding Yu@ZhidingYu·
Very happy to be part of this cool work! I believe that latent thinking does present a more promising paradigm for Embodied AI. It doesn’t make sense for a robot or self-driving car to spit hundreds of text tokens just to reason a decision. Instead, humans think fast in physical world. With Fast-ThinkAct, the thinking budget is highly compressed and controlled, while strong reasoning capabilities are preserved. This makes everything more friendly for the edge deployment of PhysicalAI systems.
Fu-En (Fred) Yang@FuEnYang1

🤖 How can embodied agents think fast—like humans do internally—without lengthy textual reasoning, and still act effectively? 🚀 Introducing Fast-ThinkAct: compact, efficient, verbalizable latent reasoning for Vision-Language-Action models. Fast think, fast act. 🧠⚡🤲

English
0
2
12
2.1K
Zhiding Yu retweetledi
Chanwoo Park
Chanwoo Park@chanwoopark20·
One of my favorite moments from Yejin Choi’s NeurIPS keynote was her point as follows: "it looks like a minor detail, but one thing I learned since joining and spending time at NVIDIA is that all these, like, minor details, implementation details matter a lot" -- I think this is exactly the point that theory people often undervalue when it comes to empirical work.
Chanwoo Park tweet media
English
22
76
1.1K
125.5K
Xueyan Zou
Xueyan Zou@xyz2maureen·
I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]
Xueyan Zou tweet media
English
69
87
1.1K
111.1K
Zhiding Yu
Zhiding Yu@ZhidingYu·
What a brilliant connection back to the texture synthesis with nonparametric sampling. I remember re-implementing this algorithm myself in undergrad, and it worked well like magic! A quarter century later, we are back to the simple but elegant basics. #NeurIPS25 #AutoregressiveModelsBeyondLanguage
Zhiding Yu tweet media
English
0
1
8
786
Zhiding Yu retweetledi
Pengyu Zhao
Pengyu Zhao@zpysky1125·
MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (zhihu.com/question/19653…) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock and go with full attention with MiniMax M2?" After explaining the backstory in one chat after another, I figured it's time to write down our journey in a blog. Honestly, I could give you the textbook debate. I could talk all afternoon about why you should build linear/sparse attention. Then, I could turn around and talk all afternoon about why you shouldn't. But what's the point of all that hand-waving? The real question is whether you should actually do it. So, let's start with the conclusion: We are always working on it. But in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention. As LLMs have evolved, the entire stack has become monstrously complex. We serve more scenarios, and the architecture design trade-offs are exploding: "How does it perform on code and math? What about agent scenarios? How does it handle multimodality? Does long-chain CoT still hold up? Can RL scale on top of it? Are there hidden traps with low-precision compute? How do you implement interleaved thinking, caching, or speculative decoding? ... " In short, there's a vast difference between the promise on paper and its payoff in production. You only get to claim that payoff after satisfying Condition 1...n and solving Problem 1...n. II. Why Efficient Attention? Let's do a thought experiment. If you had infinite compute, would you even bother with linear or sparse attention? Some might bring up theoretical arguments about softmax attention "oversmoothing" in an infinite context... but who knows? Under the current compute bound, no model has truly pushed softmax attention to its absolute limit. So, for all practical purposes, the race for efficient attention is a race to save compute. For our M2 design, could we aim to save tokens — achieving the same quality with fewer tokens? Well if you believe in scaling laws, to achieve this goal, you'd probably bet on other paths to get there, not efficient attention. So, the simple truth is this: Compute is finite. We need an architecture that makes better use of it — models that achieve higher performance under the same budget (training & inference). III. The Real Bottlenecks To build a model that can practically be deployed and used by the community, we have to start with what users care: Quality, Speed (TPS), and Price. Quality is non-negotiable. A useless model is useless even if it's free. So how do we make a Linear/Sparse/Hybrid Attention model that performs well enough? The biggest challenge here isn’t the architecture design — the real bottleneck is the limitations of evaluation. (As for speed and price, those are heavily influenced by the inference stack—and great models tend to attract great engineers to optimize them.) The Evaluation Trap: Goodhart's Law in Action “As long as you build the benchmark, I’ll find a way to beat it.” Over the past few years of LLM development, the pace of leaderboard progress is staggering. No matter how hard a benchmark is — even if the SOTA score starts in single digits — once it catches the industry’s attention, it’s usually crushed within a few iterations. But how do you build an evaluation system that is comprehensive and actually reflects a model's true capabilities? That’s one of the hardest — and most critical — problems in LLM development, and it becomes even more acute when you start messing with a component as fundamental as attention. Benchmarks are a Leaky Abstraction There’s no free lunch. When you reduce the complexity of attention, you pay a price. The question is, where? When we were developing MiniMax-Text-01, everyone was still evaluating MMLU, BBH, MATH, and LongBench (all of which are now saturated). From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. Our own small-scale hybrid models confirmed this on the leaderboards. (Did we find a free lunch?) Not quite. The price paid became obvious at a larger scale: the model had clear deficits in complex, multi-hop reasoning tasks. Okay, once a problem is exposed, you can fix it. We developed proxy metrics for this specific weakness and iterated until the hybrid model seemed to match MHA. But does that proxy metric still correlate with real-world downstream performance at an even larger scale? Are there other hidden weaknesses? Who knows. We haven't run those experiments yet. The better the models get, the harder they are to evaluate. But that’s a must part of the journey — keep it up, eval teams! The High Cost of Knowing Things For complex reasoning tasks, we can sometimes find early proxy metrics that correlate well with final performance — but not for all tasks (at least, not yet). As tasks get harder, the amount of experiment compute required just to get a statistically significant signal on your metric grows astronomically — which is ironic, since we study efficient attention because compute is limited. And beyond the academic benchmarks, optimization issues often only surface at scale. You never really know what’s going to happen until you scale up. Anyone who read our M1 paper will recall the serious precision issues we hit during RL training — problems that would’ve been spotted earlier. Going back and analyzing Lightning Attention's numerical convergence with that experience in hand was incredibly clarifying. Discovering the real problems is often far harder than solving them. A Symphony of Variables There are just too many variables in model training. Different architectures behave very differently on different data distributions and with different optimizers. In a world where our data is constantly being updated, an experiment run on last month's data mix might yield the opposite conclusion today. We can’t observe everything perfectly — but we’re working on finding more reliable experimental strategies. Infrastructure: Where Theory Meets Metal Compared to full attention, the infrastructure for linear and sparse attention is much less mature. To actually get the promised results, there’s still a lot of groundwork to fill in. Take linear attention for example: If you analyze the compute intensity of existing linear architectures, many of them are memory-bound — even during training. Without extreme IO optimization, you’re basically leaving a huge amount of GPU FLOPs on the table. And inference brings even more challenges than training: How do you deliver a service that is genuinely faster and cheaper? Linear attention has linear compute complexity and constant memory usage. That means there’s a crossover point where it becomes more efficient than full attention in compute and memory. In theory, that point lies at a few thousand tokens — which isn’t particularly long for today’s large models. But that’s just theory. We need to solve a few key problems to actually approach it: Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention. Prefix Caching: In real-world applications, the cache-hit rate for conversations is very high. A new architecture must handle this gracefully. Speculative Decoding: How do you optimize speculative decoding with linear attention backbone? Well fortunately, all of these seem solvable. IV. What’s Next Scaling remains the name of the game, and context scaling is one of the key problems. Longer and longer context length is key in both pre-training and post-training. As GPU compute growth slows while data length keeps increasing, the benefits of linear and sparse attention will gradually emerge. We should start preparing now: Better Data: More multimodal, information-rich long-context data. Better Evaluation: More informative evaluation system and experimental paradigms to speed up iteration. Better Infrastructure: Mature training and inference infrastructure to fully squeeze out GPU potential. V. Addendum: the SWA code... We accidentally left the SWA inference code in the open-source release, and some people asked why it wasn’t used in the final model. Simple answer: the performance wasn't good enough. That experiment was from quite early on, before GPT-OSS was open-sourced (we were pretty surprised to see its structure, by the way). But I can share a brief summary of our failed attempt. We tried adapting CPT into a Hybrid SWA, testing both inter & intra-layer mixing. The motivation for intra-layer mixing was to balance the compute intensity across all layers, which is friendly to both PP in training and PP or AFD during inference. Unfortunately, neither worked. Performance degraded noticeably as context length grew — which is unacceptable in agentic scenarios. Our analysis showed that many global attention patterns (like retrieval head and induction head) were already established early during pre-training. CPT can hardly adjust those patterns afterwards. You surely can mitigate the issue by using data probes to identify and keep those heads as full attention — but unfortunately, it’s nearly impossible to discover them all from human priors. (And no, this issue isn’t related to attention sinks.) If you're interested in this line of research, I recommend taking a closer look at GPT-OSS, CWM, and Gemma, especially their long-context performance. Finally, we’re hiring! If you want to join us, send your resume to guixianren@minimaxi.com. References MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention CWM: An Open-Weights LLM for Research on Code Generation with World Models Qwen3-Next Gemma 3 Technical Report gpt-oss-120b & gpt-oss-20b Model Card Retrieval Head Mechanistically Explains Long-Context Factuality transformer-circuits.pub/2022/in-contex…
MiniMax (official)@MiniMax_AI

We’re open-sourcing MiniMax M2 — Agent & Code Native, at 8% Claude Sonnet price, ~2x faster ⚡ Global FREE for a limited time via MiniMax Agent & API - Advanced Coding Capability: Engineered for end-to-end developer workflows. Strong capability on a wide-range of applications (Claude Code, Cursor, Cline, Kilo Code, Droid, etc) - High Agentic Performance: Robust handling of long-horizon toolchains (mcp, shell, browser, retrieval, code). - Smarter, Faster, Cheaper with efficient parameter activation

English
23
117
805
748.3K
Zhiding Yu
Zhiding Yu@ZhidingYu·
Very useful work! I'd like to try it as part of the offline data generation pipeline for spatial intelligence. It has become a general trend to lift things from 2D -> 3D, build a scene graph, and generate QAs in a scalable manner. Some relevant works from NVIDIA: github.com/NVlabs/OmniDri… github.com/AnjieCheng/Spa…
Smells Like ML@smellslikeml

@ZhidingYu You'd like VQASynth github.com/remyxai/VQASyn… Turn any image dataset from HF to one annotated with spatial relationships between objects using a 3D scene reconstruction pipeline

English
1
2
21
4.6K