Sabitlenmiş Tweet
Nate Gillman
168 posts

Nate Gillman
@GillmanLab
ML researcher, interning @Google, PhD-ing @BrownUniversity. I train deep generative models
Katılım Ağustos 2021
458 Takip Edilen807 Takipçiler
Nate Gillman retweetledi

2/
Check out how Gemini 3.5 Flash instantly digests dense academic papers and autonomously codes a fully interactive, visual website explaining the intricacies of the research. It's an incredible stress test that seamlessly merges massive long context, deep reasoning, complex coding, and ultra-low latency.
It really helps you distill papers down to their essence and aid your understanding!
English
Nate Gillman retweetledi

My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks.
It's the longest one yet👀 Let me know what you think!
sander.ai/2026/05/06/flo…
English
Nate Gillman retweetledi

Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space.
Now it is 0.75, and can be even lower.
Many wonder how.
I thought it might end as a small FID prank: simple and deliberate.
It started with one question: can FID be optimized directly, and what does it reveal?
Introducing FD-loss.

English
Nate Gillman retweetledi

vision🍌 is here vision-banana.github.io
if you got into computer vision the way I did, starting with pixel-level labeling tasks like segmentation, edges, depth, or surface normals, you’ll probably feel the same seeing these results -- something big has quietly shifted, and it’s going to change how we approach these problems for good 🧵
English

We've released our code (Wan2.2+ControlNet), synthetic training datasets, and model weights, to help build the next generation of these physically-aware interactive world models.
Explore the code and try the interactive demos on our project page!
goal-force.github.io (n/n)
English

This is a joint project with @zitian_tang @dakshces @mik3fr33man + Yinghua, Evan, Arjan, Charles, and advised by @jesu9 at @BrownUniversity. Collaboration between Brown and @Cornell (9/n)
English
Nate Gillman retweetledi

📢Current world models aren't really modeling the world; they're modeling one agent's view of it. Partial observations ≠ world state.
Future world models will be independent of any one agent's perspective. You will be able to “drop in” any number of agents at any point in time, and a persistent world state will evolve with their interactions. Imagine a neural MMORPG server. 🧵[1/10]
English
Nate Gillman retweetledi

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop.
Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate.
Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution.
Our recipe is called "EgoScale":
- Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks.
- Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency.
- Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone.
The scalable path to robot dexterity was never more robots. It was always us.
Deep dives in thread:
English
Nate Gillman retweetledi
Nate Gillman retweetledi

Latent representations are pervasive in modern generative modelling at scale, because iterative refinement in a compact latent space is much more cost-effective, and latents can be decoded to pixels in a single forward pass.
...but what if your generative model itself only needs one step?🤔 Then the trade-off could change!
Pixel Mean Flows apply the recent "improved Mean Flow" formulation for learning flow maps (iMF, arxiv.org/abs/2512.02012) directly to pixels. The authors reparameterise the neural network to make predictions in input space, as opposed to velocity space, in order to deal with the larger number of input dimensions -- much like JiT (arxiv.org/abs/2511.13720) and BOOT (arxiv.org/abs/2306.05544).
This yields a very elegant algorithm for training a one-step generative model in pixel space from scratch👌
Paper: arxiv.org/abs/2601.22158


English
Nate Gillman retweetledi
Nate Gillman retweetledi

🔥 Very excited to share that we’re releasing LingBot-World 🌍 @robbyant_brain — an open-source frontier world model!
We’re pushing the limits of:
🔹 High-Fidelity Simulation & Precise Control
🔹 Long-Horizon Consistency & Memory
🔹 Modeling Physical & Game Worlds
The most surprising part? The emergence of sophisticated behaviors that go beyond simple video generation.
👇I’m obsessed with this dragon demo 🐉. It can rollout for 1 min while maintaining crisp visual dynamics and consistent memory!
English

