Nicholas Pfaff

77 posts

Nicholas Pfaff

Nicholas Pfaff

@NicholasEPfaff

Robotics PhD Student @MIT_CSAIL

Katılım Mart 2025
690 Takip Edilen770 Takipçiler
Sabitlenmiş Tweet
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)
English
18
79
564
73.4K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Integration them could be really great, especially as limited articulated is a big limitation of SceneSmith at the moment. I also think that future scene generation systems could hugely benefit from some of your efficiency ideas to make it easier to scale them without huge budgets.
English
0
0
0
27
RuiningLi
RuiningLi@RayLi234·
@NicholasEPfaff Thanks! We don’t have a lot of money so scaling to 10k assets requires the agent to be cheap. @Mattzh1314 made lots of tradeoffs between fidelity/realism and cost. A big fan of your work Scenesmith, we should integrate these pipelines together for more scalable real to sim!
English
1
0
3
511
RuiningLi
RuiningLi@RayLi234·
🚀 Introducing Articraft, a coding agent for articulated 3D asset creation. Articraft writes code, executes it, receives validation feedback, and refines the result into simulation-ready 3D assets with parts, joints, and motion. We’re also releasing Articraft-10K: 10,000+ articulated objects across 250 categories, unlocking large-scale interactive scenes for robotics simulation and physical AI. 🔗 Project page: articraft3d.github.io 💻 Code: github.com/mattzh72/artic…
English
23
107
743
180.3K
Nicholas Pfaff retweetledi
Sergey Zakharov
Sergey Zakharov@ZakharovSergeyN·
Releasing RecGen: a collaboration between @ToyotaResearch, @toyota_europe, and @UvA_Amsterdam tackling a core 3D vision challenge: reconstructing complete multi-object scenes (parts, poses, textures, even occluded geometry) from just 1 to a few RGB-D views. Trained purely on synthetic data, RecGen achieves SOTA on real-world robotics and 6D pose benchmarks, handling occlusions, symmetry, and complex interactions. A step toward scalable, high-fidelity digital twins for robotics, and better evaluation and training of generalist policies. reconstruction-by-generation.github.io
English
2
35
220
26.6K
Nicholas Pfaff retweetledi
Nicholas Pfaff retweetledi
Jean Mercat
Jean Mercat@MercatJean·
Releasing VLA Foundry: an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. End-to-end control from language pretraining to action-expert fine-tuning — no more stitching together incompatible repos.
English
10
76
489
74.1K
Nicholas Pfaff retweetledi
jenny huang
jenny huang@JennyHuang99·
🧵1/ 🤔New paper: Do LLMs Benefit from Their Own Words? In multi-turn chats, models are typically given their own past responses as context. But do their own words always help… or can they sometimes be a distraction?
jenny huang tweet media
English
6
34
173
18.4K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
I haven't ever timed this 😅 However, we implemented a bunch of performance improvements targeting throughput over latency. Hence, it wouldn't be much faster than when generating ~25 scenes or so in parallel. The biggest bottleneck is API response times. Hence, we have an option to opt into OpenAI's priority tier that speeds this up by 50% but is twice as expensive. Switching to Gemini Flash (or other speed-optimized models should also make a big difference here).
English
0
0
1
43
Ali Shamsaddinlou
Ali Shamsaddinlou@alishams21_·
@NicholasEPfaff Very cool! How long does it take to generate each simulation-ready scene if you generate one scene per run?
English
1
0
1
51
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)
English
18
79
564
73.4K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@vatsalbajaj We have not tried RL-based training yet. We do have teleop demos on the website that could be used for supervised learning. RL would be exciting to try!
English
1
0
1
73
Vatsal Bajaj
Vatsal Bajaj@vatsalbajaj·
@NicholasEPfaff this is incredible! Out of curiosity, have you used Scene Smith to generate RL environments in which to train robots? D’you have any demos of how one could train/ post train for robotics using SceneSmith?
English
1
0
2
75
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@YufeiWang25 Very cool! Thanks for sharing. Using image priors is a promising way to improve spatial reasoning.
English
0
0
0
147
Yufei Wang
Yufei Wang@YufeiWang25·
@NicholasEPfaff Really cool work! We had a very similar work Architect which also focuses on indoor scene generation (also specifically for small & manipulatable objects placement): wangyian-me.github.io/Architect/. Different method applied to similar problems, excited to see this new progress!
English
1
0
3
202
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Agreed here. VLMs seem to struggle with spatial imagination. Setting a table with place settings that face in different directions (and not just toward the current image render) is a revealing case of this. Image generative models are much better at this. Maybe the next version will use agentic video models?
English
0
0
1
92
Luke Hutchison
Luke Hutchison@LH·
@NicholasEPfaff Very cool... Although did anyone else notice that the knife was backwards on the table? 😁 There will always be a fundamental lack of actual understanding in ML/AI.
English
1
0
1
127
Nicholas Pfaff retweetledi
Shivaram Kumar
Shivaram Kumar@shirakuex·
@allen_ai This is awesome work! Curious—any plans to integrate SceneSmith-like agentic scene generation into MolmoSpaces? It feels like a natural combo: MolmoSpaces benchmark + SceneSmith prompt-to-sim scenes = infinite evaluation distribution. scenesmith.github.io
English
1
1
4
1K
Nicholas Pfaff retweetledi
Ai2
Ai2@allen_ai·
Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.
English
10
103
723
96.6K
Nicholas Pfaff retweetledi
Ilir Aliu
Ilir Aliu@IlirAliu_·
Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments. 📍 Paper AND Code: Instead of hand-building scenes in simulation, you write one prompt. SceneSmith builds the world for you. > Room layout. > Furniture. > Wall and ceiling objects. > Small movable items. Each stage is handled by a team of VLM agents: one proposes, one critiques, one coordinates. The result is not just pretty scenes, but physics-ready environments. Every object: •Metric scale •Collision geometry •Estimated mass, inertia, friction •<2% object collisions •96% stable under gravity And it exports directly to MJX, USD, SDFormat. If you train or evaluate robot policies, environment creation is usually the bottleneck. SceneSmith turns it into an on-demand layer. You can generate dozens of diverse scenes per task and automatically evaluate policies across them, with 99.7% agreement to human labels. That means: •More robust policies •Faster benchmarking •No hand-written success predicates 205 participants preferred SceneSmith scenes 92% of the time for realism and 91% for prompt faithfulness. Environment generation is no longer the slow part of robot research. If you work on sim2real, policy scaling, or automated evaluation, this is worth bookmarking and sharing with your team. 📍GitHub: scenesmith.github.io Paper: arxiv.org/abs/2602.09153 Code: github.com/nepfaff/scenes… —- Weekly robotics and AI insights. Subscribe free: 22astronauts.com
English
5
1
44
3.9K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
We don't have an explicit argument for that. However, the input just gets sent to a VLM agent, which can natively take both text and image inputs. Hence, supporting this seems like a minor code change. You could already use a VLM to describe a set of images in great detail in text and use that. We have been doing that, and it works quite well.
English
0
0
0
17
Peng Xu
Peng Xu@sippeyxp·
@RussTedrake Yeah! Sim cannot unleash its power until building a sim becomes engineering free. Thanks for sharing. Do you know if there is future plan to support contact property (e.g friction) and actuated objects?
English
2
0
1
181
Russ Tedrake
Russ Tedrake@RussTedrake·
I've been saying for years that the biggest challenge for simulation in robotics is not actually the physics engine (although you do have to get that right). The real challenge is capturing the *diversity* of the real world. There was no doubt that generative AI had the potential to change that, but it's still amazing to see it take shape. Watching Nick's incredibly fast progress has convinced me that content generation might not actually be a bottleneck anymore. This is a beautiful combination of hardened tools for e.g. low-level mesh processing with the latest tools for generative asset creation, wrapped in a powerful agentic workflow. Please do give it a try and share your feedback.
Nicholas Pfaff@NicholasEPfaff

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)

English
11
33
324
37.6K
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@maxzpchen Agreed. I think we are at the point where we have to push the robot side to see how far we can get with these environments and whether they start breaking down anywhere.
English
0
0
0
283
Max
Max@maxzpchen·
One thing I’m especially curious about is how far this kind of unbounded, VLM-driven object vocabulary and physics estimation can go before you start needing task-specific, human-curated distributions again—for example, for household manipulation vs. warehouse vs. surgery—and whether we’ll see “benchmark overfitting” in simulation the same way we did in vision and NLP
English
1
0
2
269
Ludwig_fr
Ludwig_fr@ludwig_fr·
@NicholasEPfaff Add some steam VR teleop I would give you some free teleop data^^
English
1
0
3
283
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
@hferrolho Glad you liked that one! A pain to simulate with those large forces 😂
English
0
0
0
184
Nicholas Pfaff
Nicholas Pfaff@NicholasEPfaff·
Correct. We worked on real2sim (replicating an actual environment in simulation) in the past: scalable-real2sim.github.io The goal of this approach is to match the distribution of real-world and simulated environments for increased scale, but it might not be great at replicating one particular environment from images (though we have used it for that as well).
English
0
0
1
45
Abdul R
Abdul R@cyrux004·
@NicholasEPfaff @GChongkai I also saw a post from somebody to create a gaussian splat of your personal space which can be used for simulation. This would be better suited if you want to train in a specific environment vs what you have here if you want to build something general purpose, right ?
English
1
0
1
45