Nicholas Pfaff

72 posts

Nicholas Pfaff

Nicholas Pfaff

@NicholasEPfaff

Robotics PhD Student @MIT_CSAIL

Katılım Mart 2025
671 Takip Edilen746 Takipçiler
Sabitlenmiş Tweet
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)
English
18
79
561
70.9K
Nicholas Pfaff retweetledi
jenny huang
jenny huang@JennyHuang99·
🧵1/ 🤔New paper: Do LLMs Benefit from Their Own Words? In multi-turn chats, models are typically given their own past responses as context. But do their own words always help… or can they sometimes be a distraction?
jenny huang tweet media
English
6
32
170
17.3K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
I haven't ever timed this 😅 However, we implemented a bunch of performance improvements targeting throughput over latency. Hence, it wouldn't be much faster than when generating ~25 scenes or so in parallel. The biggest bottleneck is API response times. Hence, we have an option to opt into OpenAI's priority tier that speeds this up by 50% but is twice as expensive. Switching to Gemini Flash (or other speed-optimized models should also make a big difference here).
English
0
0
1
30
Ali Shamsaddinlou
Ali Shamsaddinlou@alishams21_·
@NicholasEPfaff Very cool! How long does it take to generate each simulation-ready scene if you generate one scene per run?
English
1
0
1
38
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)
English
18
79
561
70.9K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@vatsalbajaj We have not tried RL-based training yet. We do have teleop demos on the website that could be used for supervised learning. RL would be exciting to try!
English
1
0
0
50
Vatsal Bajaj
Vatsal Bajaj@vatsalbajaj·
@NicholasEPfaff this is incredible! Out of curiosity, have you used Scene Smith to generate RL environments in which to train robots? D’you have any demos of how one could train/ post train for robotics using SceneSmith?
English
1
0
2
57
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@YufeiWang25 Very cool! Thanks for sharing. Using image priors is a promising way to improve spatial reasoning.
English
0
0
0
122
Yufei Wang
Yufei Wang@YufeiWang25·
@NicholasEPfaff Really cool work! We had a very similar work Architect which also focuses on indoor scene generation (also specifically for small & manipulatable objects placement): wangyian-me.github.io/Architect/. Different method applied to similar problems, excited to see this new progress!
English
1
0
3
176
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Agreed here. VLMs seem to struggle with spatial imagination. Setting a table with place settings that face in different directions (and not just toward the current image render) is a revealing case of this. Image generative models are much better at this. Maybe the next version will use agentic video models?
English
0
0
1
81
Luke Hutchison
Luke Hutchison@LH·
@NicholasEPfaff Very cool... Although did anyone else notice that the knife was backwards on the table? 😁 There will always be a fundamental lack of actual understanding in ML/AI.
English
1
0
1
115
Nicholas Pfaff retweetledi
Shivaram Kumar
Shivaram Kumar@shirakuex·
@allen_ai This is awesome work! Curious—any plans to integrate SceneSmith-like agentic scene generation into MolmoSpaces? It feels like a natural combo: MolmoSpaces benchmark + SceneSmith prompt-to-sim scenes = infinite evaluation distribution. scenesmith.github.io
English
1
1
4
966
Nicholas Pfaff retweetledi
Ai2
Ai2@allen_ai·
Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.
English
10
104
728
94.7K
Nicholas Pfaff retweetledi
Ilir Aliu
Ilir Aliu@IlirAliu_·
Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments. 📍 Paper AND Code: Instead of hand-building scenes in simulation, you write one prompt. SceneSmith builds the world for you. > Room layout. > Furniture. > Wall and ceiling objects. > Small movable items. Each stage is handled by a team of VLM agents: one proposes, one critiques, one coordinates. The result is not just pretty scenes, but physics-ready environments. Every object: •Metric scale •Collision geometry •Estimated mass, inertia, friction •<2% object collisions •96% stable under gravity And it exports directly to MJX, USD, SDFormat. If you train or evaluate robot policies, environment creation is usually the bottleneck. SceneSmith turns it into an on-demand layer. You can generate dozens of diverse scenes per task and automatically evaluate policies across them, with 99.7% agreement to human labels. That means: •More robust policies •Faster benchmarking •No hand-written success predicates 205 participants preferred SceneSmith scenes 92% of the time for realism and 91% for prompt faithfulness. Environment generation is no longer the slow part of robot research. If you work on sim2real, policy scaling, or automated evaluation, this is worth bookmarking and sharing with your team. 📍GitHub: scenesmith.github.io Paper: arxiv.org/abs/2602.09153 Code: github.com/nepfaff/scenes… —- Weekly robotics and AI insights. Subscribe free: 22astronauts.com
English
5
1
44
3.8K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
We don't have an explicit argument for that. However, the input just gets sent to a VLM agent, which can natively take both text and image inputs. Hence, supporting this seems like a minor code change. You could already use a VLM to describe a set of images in great detail in text and use that. We have been doing that, and it works quite well.
English
0
0
0
16
Peng Xu
Peng Xu@sippeyxp·
@RussTedrake Yeah! Sim cannot unleash its power until building a sim becomes engineering free. Thanks for sharing. Do you know if there is future plan to support contact property (e.g friction) and actuated objects?
English
2
0
1
166
Russ Tedrake
Russ Tedrake@RussTedrake·
I've been saying for years that the biggest challenge for simulation in robotics is not actually the physics engine (although you do have to get that right). The real challenge is capturing the *diversity* of the real world. There was no doubt that generative AI had the potential to change that, but it's still amazing to see it take shape. Watching Nick's incredibly fast progress has convinced me that content generation might not actually be a bottleneck anymore. This is a beautiful combination of hardened tools for e.g. low-level mesh processing with the latest tools for generative asset creation, wrapped in a powerful agentic workflow. Please do give it a try and share your feedback.
Nicholas Pfaff@NicholasEPfaff

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)

English
9
33
320
36.2K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@maxzpchen Agreed. I think we are at the point where we have to push the robot side to see how far we can get with these environments and whether they start breaking down anywhere.
English
0
0
0
267
Max
Max@maxzpchen·
One thing I’m especially curious about is how far this kind of unbounded, VLM-driven object vocabulary and physics estimation can go before you start needing task-specific, human-curated distributions again—for example, for household manipulation vs. warehouse vs. surgery—and whether we’ll see “benchmark overfitting” in simulation the same way we did in vision and NLP
English
1
0
2
252
Ludwig_fr
Ludwig_fr@ludwig_fr·
@NicholasEPfaff Add some steam VR teleop I would give you some free teleop data^^
English
1
0
3
263
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@hferrolho Glad you liked that one! A pain to simulate with those large forces 😂
English
0
0
0
171
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Correct. We worked on real2sim (replicating an actual environment in simulation) in the past: scalable-real2sim.github.io The goal of this approach is to match the distribution of real-world and simulated environments for increased scale, but it might not be great at replicating one particular environment from images (though we have used it for that as well).
English
0
0
1
45
Abdul R
Abdul R@cyrux004·
@NicholasEPfaff @GChongkai I also saw a post from somebody to create a gaussian splat of your personal space which can be used for simulation. This would be better suited if you want to train in a specific environment vs what you have here if you want to build something general purpose, right ?
English
1
0
1
44
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@JeremySMorgan3 Agreed. Maybe even some VLM-aided TAMP or similar to get diverse data from such a model-based planner. @cohnthomas43 set up a simple version of a model-based planner for these scenes, but we have only used it for evaluation so far.
English
0
0
2
242
JeremySMorgan
JeremySMorgan@JeremySMorgan3·
@NicholasEPfaff This looks great! Nice videos also. Would be cool to hook up a motion planner in here with a task sampler to generate diverse demonstrations for VLA training
English
1
0
3
261
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@ZhaoMandi Thank you! It ended up working way better than what I had hoped when we last talked about this
English
0
0
1
314
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
I think this will be very useful for all of training, evaluation, and development. Large-scale training data generation takes more effort to set up, but has a lot of potential. We had lots of fun testing one of the applications for automatic evaluation. We open-sourced all of our code and even released all of our already generated scenes on Hugging Face!
English
0
0
4
277
Fabrizio Milo
Fabrizio Milo@fabmilo·
I was prototyping a system like this after stumbling on the google deepmind object dataset . Is this more useful as an API service for evaluation/ training or both? Will big companies just implement it internally or would be open to use a service?
Nicholas Pfaff@NicholasEPfaff

This is an amazing collaboration with @cohnthomas43, @ZakharovSergeyN, @RickCory21, @RussTedrake 📄 arxiv.org/abs/2602.09153 💻 scenesmith.github.io 🔧 github.com/nepfaff/scenes… Explore our interactive 3D scenes on our website or download them from Hugging Face! 🧵(8/8)

English
2
0
2
593
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
You should be able to send commands from your real-world teleop system to the simulator instead of to the robot directly. We did open-source a lightweight mobile iiwa teleop example with a space mouse a while back: github.com/nepfaff/scene_…. However, there are much better recent approaches that use a VR headset for this teleop. Also, check out this company: evolverobotics.com
English
2
0
1
281
Heeger
Heeger@GChongkai·
@NicholasEPfaff Great work! May I know how to teleoperate a mobile robot in simulation (with only one person)? Is there any mature pipeline for this?
English
1
0
1
290