Michael Cho - Rbt/Acc

2.5K posts

Michael Cho - Rbt/Acc banner
Michael Cho - Rbt/Acc

Michael Cho - Rbt/Acc

@micoolcho

I ❤️ robots, cheap hardware, steam engines, XGBoost, Liverpool FC & SG 🇸🇬 | Plane crash survivor | Building @BitRobotNetwork @frodobots

Singapore Katılım Mayıs 2010
2.4K Takip Edilen5.6K Takipçiler
Sabitlenmiş Tweet
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
Super excited to work alongside a crew of talented researchers/builders to help organize this yr's WBCD competition at ICRA 2026! This is everything u want from a robotic competition: - real robots, real-world evals - 3 concurrent locations (so u can still participate even if u aren't traveling to ICRA) - Simulation & UMI data support (for those sim-pilled or UMI-pilled) - $200k prize pool Really looking forward to this! Learn more during our Info Session on 24th Jan PST. More deets here: wbcdcompetition.github.io
WBCD@WBCDCompetition

Calling for Competing Teams! Help push the frontier of Bimanuals Robotics at the 2nd What Bimanuals Can Do (WBCD) at ICRA2026! - 3 concurrent locations (Vienna, Shanghai, Bay Area) - 3 real-world applications - Simulation & UMI Support - $200k Prize Pool A thread 🧵

English
1
6
46
5.5K
Shenyuan Gao
Shenyuan Gao@ShenyuanGao·
I believe it's not about "which world model will win". The action conditions of world models can be multi-modal, and a future decision-making system will be a hierarchical framework. Starting from a global instruction, it will first perform high-level planning using a high-level action-conditioned world model based on text instructions (DreamZero can play this role). Once the subtask is determined and an initial action proposal is sampled, the optimization of motor controls will utilize a low-level action-conditioned world model (like DreamDojo) to obtain the optimal action for execution. Such a two-layer world modeling system can operate robustly in new environments.
Anirudha Majumdar@Majumdar_Ani

x.com/i/article/2033…

English
1
3
24
2.6K
Elvis Nava
Elvis Nava@elvisnavah·
@micoolcho I think the terminology here is a bit confusing (partly due to Nvidia insisting on using the "world model" label for their models). WAM really should be VAM (it's fundamentally a video model) and the "action conditioned" model is the actual world model
English
1
0
1
62
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
Humans know how to run long before it knows how to form sophisticated language. Animals obviously dont have grammar and yet have crazy physical intelligence. So I'd guess language is not needed, and in fact could potentially be a distraction. It is however a decent local maxima given progress we already have with giant VLMs built on backs of internet scale datasets, and that its a nice UX for humans so that we can talk to the robots in natural language.
English
0
0
2
135
Jiafei Duan
Jiafei Duan@DJiafei·
Great article! One thing that feels true about scientific discovery is that we often do not know the right answer when it first appears. We only realize it later, once it becomes so useful that we start using it almost unconsciously in everyday life. For robotics, I often pondering: is the right path really to use LLMs as the backbone for generalist robot models, lifting everything into the semantic space of language? Or is it to condition action generation on video and world-model-style learning? Or is the real answer something else entirely different?
Anirudha Majumdar@Majumdar_Ani

x.com/i/article/2033…

English
2
7
43
8.5K
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
@ericjang11 Yup! Can't wait for them coming on our pod to share next wk! Any questions u think we sld ask them?
English
0
0
0
116
Eric Jang
Eric Jang@ericjang11·
Amazing work!
Li Yi@ericyi0124

Tennis is an extremely challenging sport — even for humans. Yet together with @GalbotRobotics, we manage to teach humanoid robots to play it and sustain long rallies with human players. More importantly, it precisely returns incoming balls while maintaining highly human-like motion. Still far from professional athletes, but closer than ever. Congrats to the leading students @Zhikai273 @josh00_lu @LianYunrui and all other collaborators. Project: zzk273.github.io/LATENT/

English
1
1
28
7.7K
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
@ai @DvijKalaria @berkeley_ai Wish I have the chance to try this out one day! Here's a detailed sharing from Zhi Su: x.com/i/status/19860…
RoboPapers@RoboPapers

How can we make a humanoid robot play table tennis? The robot must hit a moving ball and return it over and over again, requiring precise whole-body control over again. @ZhiSu22 tells us about how he developed a hierarchical approach for planning an whole body control that lets people play with a humanoid robot. Watch Episode #41 of RoboPapers with @micoolcho and @micoolcho now!

English
0
0
3
603
anand iyer
anand iyer@ai·
Went to @DvijKalaria's lab @berkeley_ai and played ping pong against his robot, Oreo. I'd played a ton of ping pong as a kid. This felt appropriately surreal and one of those "I wish I could tell my highschool self about this" moments. Table tennis is one of the harder sports for robots to play. The ball can move up to 30+ mph with heavy spin, the human opponent's intent is hidden, and the whole body has to coordinate. Oreo is a full humanoid holding a real paddle, and it learned key motions like swings by watching Dvij demonstrate. No robot-collected training data. One person shows the motion, the policy generalizes. The way it works, as I understood it: - A smart system (a hierarchical planner) first figures out where the ball is going to fly and picks the best type of hit, like a forehand or backhand swing. - This plan then helps train the robot's "brain" (an RL policy) in a virtual simulation. The brain learns by trial and error, getting rewards when it mimics a few example moves - Once trained in the sim, the whole setup gets applied to the actual physical robot so it can play for real. The human demonstrations are essentially the reference motions. They are building a robot that has watched more human table tennis than any human has, and uses that to develop its own game. I still won. (Barely. But that won't last)
English
3
2
37
11.7K
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
World models are everywhere you look because everyone use the term for many different things. Great clarification post from @zhuokaiz !
Zhuokai Zhao@zhuokaiz

AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models. But almost nobody means the same thing by that term. Here are, in my view, five categories of world models. --- 1. Joint Embedding Predictive Architecture (JEPA) Representatives: AMI Labs (@ylecun), V-JEPA 2 The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead. Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space. This is a crucial design choice. A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling. V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training. The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch. AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away. --- 2. Spatial Intelligence (3D World Models) Representative: World Labs (@drfeifei) Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?" The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction. This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly. Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate. For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed. --- 3. Learned Simulation (Generative Video + Latent-Space RL) Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1 This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination. The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training. Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case. The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning. Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible. These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies. But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents. The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons. The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier. --- 4. Physical AI Infrastructure (Simulation Platform) Representative: NVIDIA Cosmos NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs. Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices. The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data. They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains. Three model families sit on top of this. Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios. Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps. Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A. --- 5. Active Inference Representative: VERSES AI (Karl Friston) This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience. Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise). Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy. VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation. The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations. Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient. In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy. They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data. --- imo, these five categories aren't really competing — they're solving different sub-problems. JEPA compresses physical understanding. Spatial intelligence reconstructs 3D structure. Learned simulation trains agents through generated experience. NVIDIA provides the picks and shovels. Active inference offers a fundamentally different computational theory of intelligence. My guess is the lines between them blur fast.

English
0
0
10
1K
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
Better action tokenization may not sound as sexy as world models but is every bit as needed & impactful across many robotics projects. Tks @liu730chaoqi for the sharing!
RoboPapers@RoboPapers

How should we represent robot actions for autoregressive transformers? Most robot policies use diffusion or flow to generate continuous action sequences, but this isn’t how large language models work; they predict output tokens, which has many advantages. But coming up with a set of useful action tokens, so we can skip the slow and expensive diffusion steps, is difficult. @liu730chaoqi says action tokens need three qualities: reasonable compression, universal decodability, and a left-to-right causally ordered token space, and he proposes Ordered Action Tokenization as a solution to all three. Watch Episode 66 of RoboPapers now, with @micoolcho and @chris_j_paxton, to learn more!

English
0
0
5
883
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
@DJiafei Hilarious! Exactly my comment in ur original thread...glad someone else went to try this out
English
0
0
1
96
Michael Cho - Rbt/Acc
Michael Cho - Rbt/Acc@micoolcho·
This is awesome! Someone scrapped our @RoboPapers episodes and created a proper list of "pain points" in robotics.
anand iyer@ai

Haptic scraped all 64 episodes of @RoboPapers (@chris_j_paxton + @micoolcho) and ranked every pain point in physical AI research. The top 10, by mention frequency: 1. Scalable data collection 2. Generalization / zero-shot robustness 3. Dexterous manipulation 4. Teleoperation / whole-body data 5. Sim-to-real transfer 6. Evaluation / benchmarking 7. VLAs / foundation models for control 8. Human video to robot transfer 9. Long-horizon memory 10. RL scaling / offline-to-online Code keeps getting cheaper. Atoms stay expensive. That's the entire startup opportunity in physical AI right now. hapticlabs.ai/blog/2026/03/0…

English
1
1
8
1.2K
Chris Paxton
Chris Paxton@chris_j_paxton·
Even though this is teleoperated I feel like a lot of folks don't understand how hard this is to do so well. Picking and placing little objects, carrying boxes and plugging in a blender all with one bot
English
8
3
76
6.1K
Chaoqi Liu
Chaoqi Liu@liu730chaoqi·
My recent work ordered-action-tokenization.github.io (OAT) was featured on @RoboPapers, where I discussed the motivation and ideas behind the project with @micoolcho and @chris_j_paxton, as well as my perspective on action token modelability and what it means to make robot actions more “language-like” for large models. The full episode will be released soon!
RoboPapers@RoboPapers

Full episode dropping soon! Geeking out with @liu730chaoqi on OAT: Ordered Action Tokenization ordered-action-tokenization.github.io Co-hosted by @micoolcho @chris_j_paxton

English
2
2
26
4.4K