Nate Gillman

164 posts

Nate Gillman banner
Nate Gillman

Nate Gillman

@GillmanLab

ML researcher, interning @Google, PhD-ing @BrownUniversity. I train deep generative models

Joined Ağustos 2021
453 Following810 Followers
Pinned Tweet
Nate Gillman
Nate Gillman@GillmanLab·
Ever wish you could tell a video model what to achieve, rather than just how to move? Introducing our CVPR 2026 paper, Goal Force! Instead of simulating a direct push, our model plans the entire causal chain (the "how") to achieve your specified goal (the "what"). 🧵(1/n)
English
3
15
64
9.1K
Nate Gillman
Nate Gillman@GillmanLab·
We've released our code (Wan2.2+ControlNet), synthetic training datasets, and model weights, to help build the next generation of these physically-aware interactive world models. Explore the code and try the interactive demos on our project page! goal-force.github.io (n/n)
English
0
0
3
210
Nate Gillman
Nate Gillman@GillmanLab·
Ever wish you could tell a video model what to achieve, rather than just how to move? Introducing our CVPR 2026 paper, Goal Force! Instead of simulating a direct push, our model plans the entire causal chain (the "how") to achieve your specified goal (the "what"). 🧵(1/n)
English
3
15
64
9.1K
Nate Gillman retweeted
Oscar Michel
Oscar Michel@ojmichel4·
📢Current world models aren't really modeling the world; they're modeling one agent's view of it. Partial observations ≠ world state. Future world models will be independent of any one agent's perspective. You will be able to “drop in” any number of agents at any point in time, and a persistent world state will evolve with their interactions. Imagine a neural MMORPG server. 🧵[1/10]
English
13
87
613
123.1K
Nate Gillman retweeted
Jim Fan
Jim Fan@DrJimFan·
We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:
English
145
282
1.7K
268.7K
Nate Gillman retweeted
Xun Huang
Xun Huang@xxunhuang·
"Promptable event" was a low-hanging fruit for conditioning video world models. But text isn’t a great interface for real-time control: typing itself is too slow and can't keep up with real-time world progression. We need to focus on conditions that actually arrive in real time.
English
3
4
79
5.4K
Nate Gillman retweeted
Sander Dieleman
Sander Dieleman@sedielem·
Latent representations are pervasive in modern generative modelling at scale, because iterative refinement in a compact latent space is much more cost-effective, and latents can be decoded to pixels in a single forward pass. ...but what if your generative model itself only needs one step?🤔 Then the trade-off could change! Pixel Mean Flows apply the recent "improved Mean Flow" formulation for learning flow maps (iMF, arxiv.org/abs/2512.02012) directly to pixels. The authors reparameterise the neural network to make predictions in input space, as opposed to velocity space, in order to deal with the larger number of input dimensions -- much like JiT (arxiv.org/abs/2511.13720) and BOOT (arxiv.org/abs/2306.05544). This yields a very elegant algorithm for training a one-step generative model in pixel space from scratch👌 Paper: arxiv.org/abs/2601.22158
Sander Dieleman tweet mediaSander Dieleman tweet media
English
9
74
509
35.3K
Nate Gillman retweeted
Google DeepMind
Google DeepMind@GoogleDeepMind·
Step inside Project Genie: our experimental research prototype that lets you create, edit, and explore virtual worlds. 🌎
English
982
4.3K
34.7K
13.4M
Nate Gillman retweeted
Yinghao Xu
Yinghao Xu@YinghaoXu1·
🔥 Very excited to share that we’re releasing LingBot-World 🌍 @robbyant_brain — an open-source frontier world model! We’re pushing the limits of: 🔹 High-Fidelity Simulation & Precise Control 🔹 Long-Horizon Consistency & Memory 🔹 Modeling Physical & Game Worlds The most surprising part? The emergence of sophisticated behaviors that go beyond simple video generation. 👇I’m obsessed with this dragon demo 🐉. It can rollout for 1 min while maintaining crisp visual dynamics and consistent memory!
English
10
54
341
40.8K
Nate Gillman retweeted
Moo Jin Kim
Moo Jin Kim@moo_jin_kim·
We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇
English
17
110
868
145.9K
Nate Gillman retweeted
Peter Tong
Peter Tong@TongPetersb·
Last October, we introduced Representation Autoencoders (RAE), showing that training diffusion on frozen semantic representations works and outperforms VAEs on ImageNet. We received many questions: Can this scale to complex settings like T2I? Do the advantages hold? The answer is YES. 🧵
Peter Tong tweet media
English
9
85
444
101.3K
Nate Gillman retweeted
Runway
Runway@runwayml·
Introducing Image to Video for Gen-4.5, the world's best video model. Built for longer stories. Precise camera control. Coherent narratives. And characters that stay consistent. Gen-4.5 Image to Video is available now for all paid plans.
English
228
475
4.1K
747.3K
Nate Gillman retweeted
Xindi Wu
Xindi Wu@cindy_x_wu·
New #NVIDIA Paper We introduce Motive, a motion-centric, gradient-based data attribution method that traces which training videos help or hurt video generation. By isolating temporal dynamics from static appearance, Motive identifies which training videos shape motion in video generation. 🔗 research.nvidia.com/labs/sil/proje… 1/10
English
11
112
541
72.9K