
Andy Zeng
398 posts

Andy Zeng
@andyzengineer
Building robot foundation models @GeneralistAI. Prev @GoogleDeepMind, PhD @Princeton. One experiment away from magic. ✗DMs → email



More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts). Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training. See blog addendum: generalistai.com/blog/nov-04-20…

ok actually i think this is probably the most underappreciated part. these guys are serious about scaling. it’s not just talk.


Introducing GEN-0, our latest 10B+ foundation model for robots ⏱️ built on Harmonic Reasoning, new architecture that can think & act seamlessly 📈 strong scaling laws: more pretraining & model size = better 🌍 unprecedented corpus of 270,000+ hrs of dexterous data Read more 👇

Introducing GEN-0, our latest 10B+ foundation model for robots ⏱️ built on Harmonic Reasoning, new architecture that can think & act seamlessly 📈 strong scaling laws: more pretraining & model size = better 🌍 unprecedented corpus of 270,000+ hrs of dexterous data Read more 👇

Introducing GEN-0, our latest 10B+ foundation model for robots ⏱️ built on Harmonic Reasoning, new architecture that can think & act seamlessly 📈 strong scaling laws: more pretraining & model size = better 🌍 unprecedented corpus of 270,000+ hrs of dexterous data Read more 👇

Super excited to finally share our work on “Self-Improving Embodied Foundation Models”!! (Also accepted at NeurIPS 2025) • Online on-robot Self-Improvement • Self-predicted rewards and success detection • Orders of magnitude sample-efficiency gains compared to SFT alone • Generalization enables novel skill acquisition 🧵👇[1/11]

This is one-shot assembly: you show examples of what to build, and the robot just does it. (see original post: generalistai.com/blog) To share more on how this works, the robot is controlled in real time by a neural network that takes in video pixels and outputs 100Hz actions. The video below is part of the raw input passed directly into the model. I also like this view (at 1x speed) because it shows more of the (I think very cool) subtle moments of dexterity near the fingertips 👌 One-shot assembly seemed like a dream even just a year ago — it's not easy. It requires both the high-level reasoning of "what to build" (recognizing the geometry of the structures presented by the human), and the low-level visuomotor control of "how to build it" (purposefully re-orienting individual pieces and nudging them together in place). While possible to manually engineer a complex system for this (e.g. w/ hierarchical control, or explicit state representations), we were curious if our own Foundation model could do it all end-to-end with just some post-training data. Surprisingly, it just worked. Nothing about the recipe is substantially different than any other demo we’ve run in the past, and we’re excited about its implications on model capabilities: • On contextual reasoning, these models can (i) attend to task-related pixels in the peripheral view of the video inputs, and (ii) retain this knowledge in-context while ignoring irrelevant background. This is useful for generalizing to a wide range of real workflows: e.g. paying attention to what’s coming down the conveyor line, or glancing at the instructions displayed on a nearby monitor. • On dexterity, these models can produce contact-rich "commonsense" behaviors that can be difficult to pre-program or write language instructions for e.g. rolling a brick slightly to align its studs against the bottom of another, re-grasping to get a better grip or to move out of the way before a forceful press, or gently pushing the corners of a brick against the mat to rotate it in hand and stand it up vertically (i.e. extrinsic dexterity). These aspects work together to form a capability that resembles fast adaptation — a hallmark of intelligence, relevant for real use cases. This has also expanded my own perspective on what's possible with robot learning, using a recipe that's repeatable for many more skills. This milestone stands on top of the solid technical foundations we’ve built here at Generalist: hardcore controls & hardware, all in-house built models, and a data engine that "just works." We're a small group of hyper-focused engineers, and hands-down the highest talent-density team I’ve ever worked with. We're accelerating and scaling aggressively towards unlocking next-generation robot intelligence. Building Legos is just one example, and it's clear to me that we're headed towards a future where robots can do just about anything we want them to. Its coming, and we're going to make it happen.





At Generalist, we’re working towards a future where robots can do anything. To that end, the robots build now, too. We’ve trained a robot to do one-shot assembly, constructing Legos end-to-end: no custom engineering, just pixels in → Lego copies out.


How can existing robot systems be replaced with foundation models? Check out our new survey paper on the real-world robot applications of foundation models: arxiv.org/abs/2402.05741 Thread👇


To see emergent behaviors from low-level policies was a first for many of us on the team. They don't happen often enough yet, but it certainly feels like we're headed in the right direction. Reach out if you're interested in working together.

Today we're excited to share a glimpse of what we're building at Generalist. As a first step towards our mission of making general-purpose robots a reality, we're pushing the frontiers of what end-to-end AI models can achieve in the real world. Here's a preview of our early results in autonomous general-purpose dexterous capabilities – fast, reactive, smooth, precise, bi-manual coordinated sensorimotor control.

Today we're excited to share a glimpse of what we're building at Generalist. As a first step towards our mission of making general-purpose robots a reality, we're pushing the frontiers of what end-to-end AI models can achieve in the real world. Here's a preview of our early results in autonomous general-purpose dexterous capabilities – fast, reactive, smooth, precise, bi-manual coordinated sensorimotor control.