Hanjung Kim

75 posts

Hanjung Kim

@KimD0ing

Research Scientist Intern @nvidia GEAR | Ph.D. student @ Yonsei University | prev. @nyuuniversity

Santa Clara, CA เข้าร่วม Şubat 2023

289 กำลังติดตาม179 ผู้ติดตาม

ทวีตที่ปักหมุด

Hanjung Kim@KimD0ing·14 May

How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n

English

187

28.4K

Hanjung Kim รีทวีตแล้ว

Max Fu@letian_fu·1 Nis

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

128

628

153.3K

Hanjung Kim รีทวีตแล้ว

Irmak Guzey@irmakkguzey·30 Mar

Learning from human data requires human-like hardware. Humans use their wrists constantly, but table-top manipulators lack this flexibility. We build upon RUKA and introduce RUKA-v2: a tendon-driven hand with a 2-DOF wrist and finger abduction/adduction 👋✌️

English

113

8.1K

Hanjung Kim รีทวีตแล้ว

Miran Heo@miran_heo·27 Mar

SAM 3.1 is here 🚀 7x faster with 128 objects, without sacrificing any quality Glad to have contributed as part of the SAM team Special kudos to our intern @hkchengrex for the amazing contribution! 🙌

AI at Meta@AIatMeta

We’re releasing SAM 3.1: a drop-in update to SAM 3 that introduces object multiplexing to significantly improve video processing efficiency without sacrificing accuracy. We’re sharing this update with the community to help make high-performance applications feasible on smaller, more accessible hardware. 🔗 Model Checkpoint: go.meta.me/8dd321 🔗 Codebase: go.meta.me/b0a9fb

English

4.1K

Hanjung Kim รีทวีตแล้ว

Danfei Xu@danfei_xu·23 Mar

Introducing EgoVerse: an ecosystem for robot learning from egocentric human data. Built and tested by 4 research labs + 3 industry partners, EgoVerse enables both science and scaling 1300+ hrs, 240 scenes, 2000+ tasks, and growing Dataset design, findings, and ecosystem 🧵

English

159

821

230.4K

Hanjung Kim@KimD0ing·17 Mar

@luke_ch_song @OhioState @ysu_nlp @nvidia congrats!

English

Chan Hee (Luke) Song@luke_ch_song·17 Mar

🎓I defended my PhD at @OhioState! Grateful to my advisor @ysu_nlp and all my collaborators along the way :) Excited to be starting at @nvidia (just in time for #NVIDIAGTC😆) and continuing my research on spatial intelligence in multimodal foundation models.

English

4.2K

Hanjung Kim รีทวีตแล้ว

Ruijie Zheng@ruijie_zheng12·25 Şub

Proud to introduce EgoScale: We pretrained a GR00T VLA model on 20K+ hours of egocentric human video and discovered that robot dexterity can be scaled, not with more robots, but with more human data. A thread on 🧵what we learned. 👇

English

331

94.8K

Hanjung Kim รีทวีตแล้ว

Jim Fan@DrJimFan·25 Şub

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

English

148

286

1.8K

275.7K

Hanjung Kim รีทวีตแล้ว

Jim Fan@DrJimFan·20 Şub

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:

English

176

1.2K

204.8K

Hanjung Kim รีทวีตแล้ว

Zhengyi “Zen” Luo@zhengyiluo·20 Şub

SONIC is now open-source! Generalist whole-body teleoperation for EVERYONE! Our team has long been building comprehensive pipelines for whole-body control, kinematic planner, and teleoperation, and they will all be shared. This will be a continuous update; inference code + model already there, training code and gr00t integration coming soon! Code: github.com/NVlabs/GR00T-W… Docs: nvlabs.github.io/GR00T-WholeBod… Site: nvlabs.github.io/GEAR-SONIC/

English

202

905

210.6K

Hanjung Kim รีทวีตแล้ว

Seonghyeon Ye@SeonghyeonYe·19 Şub

VLAs (from VLMs) ❌ => WAMs (from Video Models) ✅ Why WAMs? 1️⃣ World Physics: VLMs know the internet, but Video Models implicitly model the physical laws essential for manipulation. 2️⃣ The "GPT Direction": VLAs are like BERT (rely heavily on task-specific post-training). WAMs are like GPT (pre-train & prompt), unlocking incredible zero-shot transfer! What I want to see in 2026: 📈 Scaling Laws: We will see much clearer scaling laws for robotics compared to VLAs. 🤝 Human-to-Robot Transfer: Unlocking massive transfer capabilities using video as a shared representation space. 🤖 Zero-Shot Mastery: Moving from short-horizon tasks to long-horizon, dexterous manipulation without task-specific demonstrations. We recently open-sourced the checkpoints, training and inference code. Dive into the research! 👇 📄 Paper: arxiv.org/abs/2602.15922 💻 Code: github.com/dreamzero0/dre… 🤗 HF: huggingface.co/GEAR-Dreams/Dr…

English

517

74.7K

Hanjung Kim รีทวีตแล้ว

Siddhant Haldar@haldar_siddhant·13 Şub

Robot foundation models are limited by costly real data, while simulation data is plentiful but visually mismatched to reality. We present Point Bridge, a method that enables zero-shot sim-to-real transfer for robot learning with minimal visual alignment. pointbridge3d.github.io

English

221

19.2K

Hanjung Kim รีทวีตแล้ว

Mahi Shafiullah 🏠🤖@notmahi·12 Şub

Why buy a robot when you can build your own? Meet YOR, our new open-source bimanual mobile manipulator robot – built for researchers and hackers alike for only ~$10k. 🧵👇

English

171

37.3K

Hanjung Kim รีทวีตแล้ว

Jeff Cui@jeffacce·10 Şub

We don't need the name of an object to pick it up; we simply need to know where it is and what it looks like. Introducing Contact-Anchored Policies (CAPs): instead of language, we explicitly condition on contacts. Our policy learns object pickup with only 16 hours of data! 🧵

English

111

12.7K

Hanjung Kim@KimD0ing·10 Şub

@redstone_hong nice work!!

English

111

Hongsuk Benjamin Choi@redstone_hong·10 Şub

Some exciting takeaways in addition to Brent's post: • We show flow policies working for sim2real humanoid locomotion & motion tracking without distillation or shortcut models. • The same recipe works for both from-scratch RL and BC → RL fine-tuning for manipulation---no bells and whistles. Code will be released: github.com/amazon-far/fpo…

Brent Yi@brenthyi

New project! Flow Policy Gradients for Robot Control tldr; a simple online RL recipe for training and fine-tuning flow policies for robots co-led w/ @redstone_hong: hongsukchoi.github.io/fpo-control

English

104

9.9K

Hanjung Kim รีทวีตแล้ว

Seonghyeon Ye@SeonghyeonYe·4 Şub

We just gave robots "imagination," and the results are wild. 🤯 This robot wasn't trained to untie shoes or shake hands. It's never seen these tasks before. It simply "dreams" the future outcome, then acts to make it real. 🧵👇

English

16.2K

Hanjung Kim รีทวีตแล้ว

Joel Jang@jang_yoel·4 Şub

Introducing DreamZero 🤖🌎 from @nvidia > A 14B “World Action Model” that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots > The key? Jointly predicting video & actions in the same diffusion forward pass Project Page: dreamzero0.github.io 🧵 (1/10)

English

262

59.6K

Hanjung Kim รีทวีตแล้ว

Jim Fan@DrJimFan·3 Şub

x.com/i/article/2018…

ZXX

141

413

2.6K

647.9K

Hanjung Kim รีทวีตแล้ว

Moo Jin Kim@moo_jin_kim·24 Oca

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

English

110

868

147.2K

Hanjung Kim@KimD0ing·14 Oca

@ulyanapiterbarg @GoogleDeepMind Congrats!!!

English

172

Ulyana Piterbarg@ulyanapiterbarg·13 Oca

Very happy to share that I moved to the Bay Area and joined the Gemini team at @googledeepmind ! Grateful to be working with a great team on long horizons, RL for LLMs, and agents I'm looking forward to seeing old friends again and making new ones, DMs are open :)

English

458

23.6K

Hanjung Kim รีทวีตแล้ว

Irmak Guzey@irmakkguzey·29 Ara

We just released AINA, a framework for learning robot policies from Aria 2 demos, and are now open-sourcing the code: github.com/facebookresear…. It includes: ✅ Aria 2 data processing into 3D observations like shown ✅Training of point-based policies ✅Calibration Give it a try!

GIF

English

139

22.4K

ค้นพบ

@NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab @hkchengrex @luke_ch_song @OhioState @ysu_nlp