Pete Florence
601 posts

Pete Florence
@peteflorence
Co-Founder & CEO @GeneralistAI







Happening now at #NVIDIAGTC: Generalist’s GEN-0 model autonomously packing phones on @Universal_Robot arms in our first public demo. To move robotics beyond the lab, systems need to operate in real time on industrial hardware. See the demo below, and stop by booth #1840 👇🤖



The dark matter of robotics is “physical commonsense” Those tiny corrections, subtle recoveries that your hands do (and rarely notice). It’s everywhere, and yet—hard to pin down Second nature to us, but hard for machines They’re starting to emerge in our foundation models👇


I’m starting to worry about Massachusetts 1. Biotech is way off from a few years ago 2. Only 1 of the top 50 ai companies are in MA 3. The Fed research funding cuts hitting MIT, Harvard, Whoi are brutal. 4. The millionaires tax is working in the short run, but I know a lot of wealthy folks preparing for a FL move. 5. A glut of empty condos 6. It’s not “cool” for young folks 7. It’s expensive as sh-t. I honestly don’t think the MA/Boston govt can do that much about it as they are kind of macro issues. I give them big credit for working on building more housing and fixing the T, which will help. I’m trying to help w HubSpot, partnering w WHOI, teaching at MIT. I’d like to help more. Specifically I’d like to encourage and help more ai and climate companies in the state. I think ai and climate should be our dual growth engines.

Compelling advances in scaling laws for robotics from @GeneralistAI! Scaling laws are without a doubt one of the key components that enabled the rapid hyperscaling of language model pre-training over the past years. Establishing predictive scaling laws would be a watershed moment for general purpose robotics, and the recent analysis from Generalist is some of the most promising I've seen so far. But, there are some nuances that I'd like to raise; think of these ideas as an open bounty for high-impact future work 🚀 #1 Language modeling has found that scaling laws measured on metrics like training loss [1] or monorepo perplexity [2] provide accurate approximations of observed downstream evaluation performance. However, an open secret in robotics is that offline metrics, such as training or validation loss, exact-match token accuracy, or open-loop offline action MSE, have been notoriously uncorrelated with real-world close-loop performance! Reliable offline evaluation metrics have been a holy grail for robotics that remains unsolved [3]. Oftentimes, model checkpoints with higher validation loss may actually result in better end-to-end performance in the real world; this makes checkpoint selection, model iteration, and power law analysis extremely difficult. This has many reasons, but an intuitive illustration is that robot policies often have very jagged properties in terms of their learned behavior: if model A fits the training distribution extremely well but is brittle and sometimes makes rare but catastrophic errors, it may exhibit much lower action MSE compared to a model B which is slightly worse on all states but never makes unrecoverable errors. There are many such considerations when comparing offline metrics with online closed-loop rollouts: compounding errors, model brittleness, in-distribution vs. out-of-distribution generalization performance. But even offline closed-loop evaluation is difficult, and leveraging simulation [4] or world models [5] for evaluation are an open research problem. Generalist proposed scaling laws across dataset size and model scale, as measured against validation loss during a post-training phase on target tasks; while an interesting result at a larger scale than previous work ([6], [7], [8]), this result is not as convincing as I would like because while one can argue that the proposed power laws are true, it is unclear if they are meaningful power laws because the correlation between validation loss and real-world performance has been highly dependent on the policy type, model class, task complexity, and deployment situation. #2 Scaling laws in language modeling have been immensely useful during pretraining laddering (where ideas are explored at smaller scales and can be used to make critical decisions for larger hero runs) because of their predictive power. The trends and slopes of metrics like compute efficiency when measuring specific domains like code or Wikipedia oftentimes accurately measure how general model intelligence will improve on other tasks as varied as MMLU or GPQA. However, in robotics, scaling law analysis is overfit to a single embodiment on a single set of tasks; it does not translate to a universal predictive scaling law to apply to other robots or other scenarios! A derived scaling law for the relationship between model size, FLOPs, and hours of demonstrations needed to solve Task A on an ALOHA may not necessarily tell you much at all about the scaling law for Task B on a humanoid. In robotics, scaling laws are often time only backwards-looking: if you were to re-do the same project under the exact same requirements, you could have saved X amount of time by collecting less data / training a smaller model. But it may not tell you anything about the next task or next environment you deploy in. Overall, I am excited and impressed by the scale of Generalist's immense data collection operations, the gorgeously smooth and performant model behaviors, and the scientifically rigorous push to make real progress on scaling laws for robotics. The team has been cooking! In the future, I look forward to extensions to this scaling analysis (from Generalist or the community!) for (A) moving beyond validation loss to more trustworthy performance measurements and (B) showing predictive power of truly universal scaling laws as opposed to backwards looking task and embodiment specific scaling laws. These are hard problems, and I look forward to seeing progress on this front!

More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts). Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training. See blog addendum: generalistai.com/blog/nov-04-20…



More pretraining improves GEN-0 real-robot performance (via blind A/B evals with closed-loop rollouts). Improvements are significant in the low-data regime, but the best models thrive with both pretraining and ample post-training. See blog addendum: generalistai.com/blog/nov-04-20…

Today, we’re excited to introduce Rnj-1, @essential_ai's first open model; a world-class 8B base + instruct pair, built with scientific rigor, intentional design, and a belief that the advancement and equitable distribution of AI depend on building in the open. We bring American open-source at par with the best in the world.






Introducing GEN-0, our latest 10B+ foundation model for robots ⏱️ built on Harmonic Reasoning, new architecture that can think & act seamlessly 📈 strong scaling laws: more pretraining & model size = better 🌍 unprecedented corpus of 270,000+ hrs of dexterous data Read more 👇

