Fu-En (Fred) Yang

484 posts

Fu-En (Fred) Yang banner
Fu-En (Fred) Yang

Fu-En (Fred) Yang

@FuEnYang1

Research Scientist @NVIDIAAI | Ph.D. @NTU_TW | Prev. Research Intern @NVIDIAAI | Unifying World, Language & Action for Generalist Robotics

Katılım Şubat 2020
1.4K Takip Edilen867 Takipçiler
Sabitlenmiş Tweet
Fu-En (Fred) Yang
Fu-En (Fred) Yang@FuEnYang1·
🚀 Excited to share that our paper Fast-ThinkAct has been accepted to #CVPR2026! 🎉 Efficient Vision-Language-Action reasoning via verbalizable latent planning — enabling embodied agents to think fast internally without lengthy textual reasoning. ⚡ Achieves 9.3× faster inference (89% latency reduction) than ThinkAct-7B — bringing Reasoning VLA closer to real-time robotic control. 📄 arxiv.org/abs/2601.09708 🎥 jasper0314-huang.github.io/fast-thinkact/ 🙌 Huge congrats to @chipinhxyz, @yunzeman, @ZhidingYu, @CMHungSteven, @jankautz, Yu-Chiang Frank Wang, @FuEnYang1 #EmbodiedAI #PhysicalAI #VLA #Robotics #NVIDIAResearch @NVIDIAAI @NVIDIARobotics
Fu-En (Fred) Yang@FuEnYang1

🤖 How can embodied agents think fast—like humans do internally—without lengthy textual reasoning, and still act effectively? 🚀 Introducing Fast-ThinkAct: compact, efficient, verbalizable latent reasoning for Vision-Language-Action models. Fast think, fast act. 🧠⚡🤲

English
3
11
75
6.2K
Fu-En (Fred) Yang retweetledi
Danfei Xu
Danfei Xu@danfei_xu·
Introducing EgoVerse: an ecosystem for robot learning from egocentric human data. Built and tested by 4 research labs + 3 industry partners, EgoVerse enables both science and scaling 1300+ hrs, 240 scenes, 2000+ tasks, and growing Dataset design, findings, and ecosystem 🧵
English
28
144
720
173.6K
Fu-En (Fred) Yang retweetledi
Yunzhu Li
Yunzhu Li@YunzhuLiYZ·
For a long time, I was skeptical about action-conditioned video prediction for robotics. Many models look impressive, but once you ask them to handle long-horizon manipulation with real physical interaction, things quickly fall apart (e.g., Genie is amazing but mostly focused on navigation). This project changed my mind. I'm beyond excited to share Interactive World Simulator, a project we have been working on for the past ~1.5 years 🤖 One of the first world models that produces convincing results for long-horizon robotic manipulation involving complex physical interactions, across a diverse range of objects (rigid objects, deformables, ropes, object piles). It directly unlocks scalable data generation for robotic policy training and policy evaluation. Try it yourself (no installation needed): yixuanwang.me/interactive_wo… Play directly with the simulator in your browser. Key Takeaways: 1️⃣ 15 Hz long-horizon action-conditioned video prediction for 10+ minutes on a single RTX 4090 GPU 2️⃣ Visual and dynamic fidelity: people often ask how much sim data equals one real data point. In our experiments, it turns out to be close to one-to-one using the Interactive World Simulator 3️⃣ Stress testing matters: we emphasize interactive stress testing to understand robustness and stability and to build trust in the simulator 4️⃣ The model is trained with only ~6 hours of real-world random interaction data on a single GPU. Imagine what happens if we scale this 1000× or even 1M× Huge credit to @YXWangBot, who led this effort with countless hours of work on data collection, training recipes, and system design. I'm incredibly proud of the work he did here! Enjoy the demos and videos. We also fully open-sourced the codebase for anyone interested in applying this to their own tasks. #Robotics #RobotLearning #WorldModels #EmbodiedAI
Yixuan Wang@YXWangBot

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at #interactive-demo" target="_blank" rel="nofollow noopener">yixuanwang.me/interactive_wo…. NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

English
2
51
366
73.7K
Fu-En (Fred) Yang retweetledi
Sergey Levine
Sergey Levine@svlevine·
We made a memory system for our models at PI. We call it Multi-Scale Embodied Memory (MEM). It provides both short-term and long-term memory to enable very long tasks. We tested it on cleaning a kitchen (and yes, washing dishes), making grilled cheese, and more.
English
9
40
469
30.6K
Fu-En (Fred) Yang retweetledi
Yuke Zhu
Yuke Zhu@yukez·
Today, we publicly released RoboCasa365, a large-scale simulation benchmark for training and systematically evaluating generalist robot models. Built upon our original RoboCasa framework, it offers: • 2,500 realistic kitchen environments; • 365 everyday tasks (basic skills + long-horizon mobile manipulation); • Over 3,200 objects with many articulated fixtures/appliances. All are designed for fully controlled, reproducible benchmarking of robotic policies. Progress in robotic foundation models is real. But it’s still hard to answer basic questions like: How close are we to general-purpose autonomy? What factors drive generalization? What are the model/data scaling curves like? Real-world eval is slow and noisy, and existing sims (like LIBERO, which we built 3 years ago) often lack sufficient task and scene diversity. This benchmark comes with 2,200+ hours of demonstrations and 500K+ trajectories to support studies of multi-task training, pretraining, and continual learning at scale. Check it out at robocasa.ai
English
13
60
338
20.4K
Fu-En (Fred) Yang retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Introducing the Qwen 3.5 Small Model Series Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B ✨ More intelligence, less compute. These small models are built on the same Qwen3.5 foundation — native multimodal, improved architecture, scaled RL: • 0.8B / 2B → tiny, fast, great for edge device • 4B → a surprisingly strong multimodal base for lightweight agents • 9B → compact, but already closing the gap with much larger models And yes — we’re also releasing the Base models as well. We hope this better supports research, experimentation, and real-world industrial innovation. Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw…
Qwen tweet media
English
918
2.9K
21.4K
8.9M
Fu-En (Fred) Yang retweetledi
Min-Hung (Steve) Chen
Min-Hung (Steve) Chen@CMHungSteven·
Our paper is Oral at @wacv_official THIS WEEK! 🎉🚀🔥 VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models Tired of detectors just shouting "🚨anomaly!" with zero insight? 😩 VADER levels up BIG: ✅ Describes exactly what happened ✅ Explains the causal why 🤔 ✅Reasons step-by-step on object dynamics & interactions like a video detective 🕵️✨ Powered by: 🌟CAES — smart keyframe sampling to catch the full causal story 📸 🌟CORE — contrastive encoder for evolving relations, temporal links & volatility ⚡ SOTA on HIVAU-70k & HAWK benchmarks 📈 🌐Project page: vader-vau.github.io See us live at WACV! 🗣️ Oral (Session 8B – Video Rec & Understanding II): Tue Mar 10, 13:30–14:30, AZ Ballroom 7 🖼️ Poster (Session 6): Tue Mar 10, 15:45–17:30, Tucson Ballroom See you in Tucson! 🌵 #ComputerVision #AnomalyDetection #VideoUnderstanding #MultimodalAI #LLM #CausalAI #WACV2026
Min-Hung (Steve) Chen tweet mediaMin-Hung (Steve) Chen tweet mediaMin-Hung (Steve) Chen tweet media
English
1
4
57
3.6K
Fu-En (Fred) Yang retweetledi
Jiafei Duan
Jiafei Duan@DJiafei·
Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: topreward.github.io/webpage/ 🧵👇
English
12
66
362
106K
Fu-En (Fred) Yang retweetledi
Jim Fan
Jim Fan@DrJimFan·
What can half of GPT-1 do? We trained a 42M transformer called SONIC to control the body of a humanoid robot. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. SONIC captures this "System 1" - the fast, reactive whole-body intelligence - in a single model that translates any motion command into stable, natural motor signals. And it's all open-source!! The key insight: motion tracking is the one, true scalable task for whole body control. Instead of hand-engineering rewards for every new skill, we use dense, frame-by-frame supervision from human mocap data. The data itself encodes the reward function: "configure your limbs in any human-like position while maintaining balance". We scaled humanoid motion RL to an unprecedented scale: 100M+ mocap frames and 500,000+ parallel robots across 128 GPUs. NVIDIA Isaac Lab allows us to accelerate physics at 10,000x faster tick, giving robots many years of virtual experience in only hours of wall clock time. After 3 days of training, the neural net transfers zero-shot to the real G1 robot with no finetuning. 100% success rate across 50 diverse real-world motion sequences. One SONIC policy supports all of the following: - VR whole-body teleoperation - Human video. Just point a webcam to live stream motions. - Text prompts. "Walk sideways", "dance like a monkey", "kick your left foot", etc. - Music audio. The robot dances to the beat, adapting to tempo and rhythm. - VLA foundation models. We plugged in GR00T N1.5 and achieved 95% success on mobile tasks. We open-source the code and model checkpoints!! Deep dive in thread:
English
87
217
1.5K
218.3K
Fu-En (Fred) Yang retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series. 🖼️Native multimodal. Trained for real-world agents. ✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling. ⚡8.6x–19.0x decoding throughput vs Qwen3-Max 🌍201 languages & dialects 📜Apache2.0 licensed 🔗Dive in: GitHub: github.com/QwenLM/Qwen3.5 Chat: chat.qwen.ai API:modelstudio.console.alibabacloud.com/ap-southeast-1… Qwen Code: github.com/QwenLM/qwen-co… Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… blog: qwen.ai/blog?id=qwen3.5
Qwen tweet media
English
271
878
5.4K
1.3M
Fu-En (Fred) Yang retweetledi
Jiafei Duan
Jiafei Duan@DJiafei·
What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page:rd-vla.github.io 👇🧵
English
4
16
111
13.2K
Fu-En (Fred) Yang retweetledi
Waymo
Waymo@Waymo·
We’re excited to introduce the Waymo World Model—a frontier generative mode for large-scale, hyper-realistic autonomous driving simulation built on @GoogleDeepMind’s Genie 3. By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios—from tornadoes to planes landing on freeways—long before it encounters them in the real world. waymo.com/blog/2026/02/t…
GIF
English
130
487
4K
992.8K
Fu-En (Fred) Yang retweetledi
DailyPapers
DailyPapers@HuggingPapers·
NVIDIA just released GR00T N1.6 DROID on Hugging Face A Vision-Language-Action model for generalist humanoid robots. Achieves SOTA on simulation benchmarks and runs on the Fourier GR-1 robot.
DailyPapers tweet media
English
1
19
112
6.1K
Fu-En (Fred) Yang retweetledi
Joel Jang
Joel Jang@jang_yoel·
Introducing DreamZero 🤖🌎 from @nvidia > A 14B “World Action Model” that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots > The key? Jointly predicting video & actions in the same diffusion forward pass Project Page: dreamzero0.github.io 🧵 (1/10)
English
18
48
258
57.9K
Fu-En (Fred) Yang retweetledi
Jim Fan
Jim Fan@DrJimFan·
New milestone: we trained a robot foundation model on a world model backbone, and enabled zero-shot, open-world prompting capability for new verbs, nouns, and environments. If the world model can "dream" the right future in pixels, then the robot can execute well in motors. We call it "DreamZero", our first World Action Model (WAM). Our team had tons of fun at the lab typing anything we like into an open text prompt, and watch the robot perform tasks it was never trained on. An emergent capability we didn't quite expect. Obviously not GPT-3 reliable yet, but we are marching into the GPT-2 era. Discoveries: - Model and data recipe co-evolve. Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity >> repetitions. - X-embodiment is extremely hard. Pixels are the answer. Different robot morphologies traditionally have a hard time sharing knowledge well. But if we put video first, pixels become the universal bridge connecting different hardware - even videos of human first-person view. DreamZero shows significant robot2robot and human2robot transfer. With only 55 trajectories on a *new*, unseen hardware (~30 min of teleop), it adapts so quickly and retains zero-shot prompting ability. Yesterday I posted about the "Second Pre-training Paradigm": world models are the next-gen foundation of Physical AI, not language backbones. Today, we are proving it works. And 2026 has just begun. Paper: World Action Models are Zero-Shot Policies. Read it now: (thread)
English
47
113
602
57.9K
Fu-En (Fred) Yang retweetledi
DailyPapers
DailyPapers@HuggingPapers·
NVIDIA Fast-ThinkAct Efficient VLA reasoning framework that compresses lengthy chain-of-thought into compact latent tokens, achieving 9.3x faster inference while maintaining strong performance on embodied manipulation tasks.
DailyPapers tweet media
English
4
24
192
10.6K