Nicholas Pfaff

77 posts

Nicholas Pfaff

@NicholasEPfaff

Robotics PhD Student @MIT_CSAIL

Katılım Mart 2025

690 Takip Edilen770 Takipçiler

Sabitlenmiş Tweet

Nicholas Pfaff@NicholasEPfaff·11 Şub

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)

English

564

73.4K

Nicholas Pfaff@NicholasEPfaff·16 May

Integration them could be really great, especially as limited articulated is a big limitation of SceneSmith at the moment. I also think that future scene generation systems could hugely benefit from some of your efficiency ideas to make it easier to scale them without huge budgets.

English

RuiningLi@RayLi234·16 May

@NicholasEPfaff Thanks! We don’t have a lot of money so scaling to 10k assets requires the agent to be cheap. @Mattzh1314 made lots of tradeoffs between fidelity/realism and cost. A big fan of your work Scenesmith, we should integrate these pipelines together for more scalable real to sim!

English

511

RuiningLi@RayLi234·15 May

🚀 Introducing Articraft, a coding agent for articulated 3D asset creation. Articraft writes code, executes it, receives validation feedback, and refines the result into simulation-ready 3D assets with parts, joints, and motion. We’re also releasing Articraft-10K: 10,000+ articulated objects across 250 categories, unlocking large-scale interactive scenes for robotics simulation and physical AI. 🔗 Project page: articraft3d.github.io 💻 Code: github.com/mattzh72/artic…

English

107

743

180.3K

Nicholas Pfaff retweetledi

Sergey Zakharov@ZakharovSergeyN·29 Nis

Releasing RecGen: a collaboration between @ToyotaResearch, @toyota_europe, and @UvA_Amsterdam tackling a core 3D vision challenge: reconstructing complete multi-object scenes (parts, poses, textures, even occluded geometry) from just 1 to a few RGB-D views. Trained purely on synthetic data, RecGen achieves SOTA on real-world robotics and 6D pose benchmarks, handling occlusions, symmetry, and complex interactions. A step toward scalable, high-fidelity digital twins for robotics, and better evaluation and training of generalist policies. reconstruction-by-generation.github.io

English

220

26.6K

Nicholas Pfaff retweetledi

Katherine Liu@robo_kat·23 Nis

Also, if you’re wondering how we generated all these cool videos from the Drake sim, check out @NicholasEPfaff’s repo github.com/nepfaff/drake-… as a starting point 👀

Katherine Liu@robo_kat

A few interesting rollouts from the Foundry-QwenVLA-2.5B multi-task model on seen tasks in sim – a 🧵. I really like behaviors that involve non-prehensile manipulation, like the little nudges in StoreCerealBoxUnderShelf.

English

2.7K

Nicholas Pfaff retweetledi

Jean Mercat@MercatJean·22 Nis

Releasing VLA Foundry: an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. End-to-end control from language pretraining to action-expert fine-tuning — no more stitching together incompatible repos.

English

489

74.1K

Nicholas Pfaff retweetledi

jenny huang@JennyHuang99·4 Mar

🧵1/ 🤔New paper: Do LLMs Benefit from Their Own Words? In multi-turn chats, models are typically given their own past responses as context. But do their own words always help… or can they sometimes be a distraction?

English

173

18.4K

Nicholas Pfaff@NicholasEPfaff·17 Şub

I haven't ever timed this 😅 However, we implemented a bunch of performance improvements targeting throughput over latency. Hence, it wouldn't be much faster than when generating ~25 scenes or so in parallel. The biggest bottleneck is API response times. Hence, we have an option to opt into OpenAI's priority tier that speeds this up by 50% but is twice as expensive. Switching to Gemini Flash (or other speed-optimized models should also make a big difference here).

English

Ali Shamsaddinlou@alishams21_·17 Şub

@NicholasEPfaff Very cool! How long does it take to generate each simulation-ready scene if you generate one scene per run?

English

Nicholas Pfaff@NicholasEPfaff·11 Şub

English

564

73.4K

Nicholas Pfaff@NicholasEPfaff·16 Şub

@vatsalbajaj We have not tried RL-based training yet. We do have teleop demos on the website that could be used for supervised learning. RL would be exciting to try!

English

Vatsal Bajaj@vatsalbajaj·16 Şub

@NicholasEPfaff this is incredible! Out of curiosity, have you used Scene Smith to generate RL environments in which to train robots? D’you have any demos of how one could train/ post train for robotics using SceneSmith?

English

Nicholas Pfaff@NicholasEPfaff·13 Şub

@YufeiWang25 Very cool! Thanks for sharing. Using image priors is a promising way to improve spatial reasoning.

English

147

Yufei Wang@YufeiWang25·13 Şub

@NicholasEPfaff Really cool work! We had a very similar work Architect which also focuses on indoor scene generation (also specifically for small & manipulatable objects placement): wangyian-me.github.io/Architect/. Different method applied to similar problems, excited to see this new progress!

English

202

Nicholas Pfaff@NicholasEPfaff·13 Şub

Agreed here. VLMs seem to struggle with spatial imagination. Setting a table with place settings that face in different directions (and not just toward the current image render) is a revealing case of this. Image generative models are much better at this. Maybe the next version will use agentic video models?

English

Luke Hutchison@LH·13 Şub

@NicholasEPfaff Very cool... Although did anyone else notice that the knife was backwards on the table? 😁 There will always be a fundamental lack of actual understanding in ML/AI.

English

127

Nicholas Pfaff retweetledi

Shivaram Kumar@shirakuex·12 Şub

@allen_ai This is awesome work! Curious—any plans to integrate SceneSmith-like agentic scene generation into MolmoSpaces? It feels like a natural combo: MolmoSpaces benchmark + SceneSmith prompt-to-sim scenes = infinite evaluation distribution. scenesmith.github.io

English

Nicholas Pfaff retweetledi

Ai2@allen_ai·11 Şub

Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.

English

103

723

96.6K

Nicholas Pfaff retweetledi

Ilir Aliu@IlirAliu_·12 Şub

Agentic Generation of Simulation-Ready Indoor Scenes and Robot Test Environments. 📍 Paper AND Code: Instead of hand-building scenes in simulation, you write one prompt. SceneSmith builds the world for you. > Room layout. > Furniture. > Wall and ceiling objects. > Small movable items. Each stage is handled by a team of VLM agents: one proposes, one critiques, one coordinates. The result is not just pretty scenes, but physics-ready environments. Every object: •Metric scale •Collision geometry •Estimated mass, inertia, friction •<2% object collisions •96% stable under gravity And it exports directly to MJX, USD, SDFormat. If you train or evaluate robot policies, environment creation is usually the bottleneck. SceneSmith turns it into an on-demand layer. You can generate dozens of diverse scenes per task and automatically evaluate policies across them, with 99.7% agreement to human labels. That means: •More robust policies •Faster benchmarking •No hand-written success predicates 205 participants preferred SceneSmith scenes 92% of the time for realism and 91% for prompt faithfulness. Environment generation is no longer the slow part of robot research. If you work on sim2real, policy scaling, or automated evaluation, this is worth bookmarking and sharing with your team. 📍GitHub: scenesmith.github.io Paper: arxiv.org/abs/2602.09153 Code: github.com/nepfaff/scenes… —- Weekly robotics and AI insights. Subscribe free: 22astronauts.com

English

3.9K

Nicholas Pfaff@NicholasEPfaff·12 Şub

We don't have an explicit argument for that. However, the input just gets sent to a VLM agent, which can natively take both text and image inputs. Hence, supporting this seems like a minor code change. You could already use a VLM to describe a set of images in great detail in text and use that. We have been doing that, and it works quite well.

English

Luke@groccy1·12 Şub

@NicholasEPfaff @cohnthomas43 @ZakharovSergeyN @RickCory21 @RussTedrake Can it take photos as inputs too? It would be awesome. Very fascinating work!

English

Nicholas Pfaff@NicholasEPfaff·12 Şub

@sippeyxp @RussTedrake All objects already have estimated friction, and we do support articulated objects.

English

Peng Xu@sippeyxp·12 Şub

@RussTedrake Yeah! Sim cannot unleash its power until building a sim becomes engineering free. Thanks for sharing. Do you know if there is future plan to support contact property (e.g friction) and actuated objects?

English

181

Russ Tedrake@RussTedrake·11 Şub

I've been saying for years that the biggest challenge for simulation in robotics is not actually the physics engine (although you do have to get that right). The real challenge is capturing the *diversity* of the real world. There was no doubt that generative AI had the potential to change that, but it's still amazing to see it take shape. Watching Nick's incredibly fast progress has convinced me that content generation might not actually be a bottleneck anymore. This is a beautiful combination of hardened tools for e.g. low-level mesh processing with the latest tools for generative asset creation, wrapped in a powerful agentic workflow. Please do give it a try and share your feedback.

Nicholas Pfaff@NicholasEPfaff

English

324

37.6K

Nicholas Pfaff@NicholasEPfaff·12 Şub

@maxzpchen Agreed. I think we are at the point where we have to push the robot side to see how far we can get with these environments and whether they start breaking down anywhere.

English

283

Max@maxzpchen·12 Şub

One thing I’m especially curious about is how far this kind of unbounded, VLM-driven object vocabulary and physics estimation can go before you start needing task-specific, human-curated distributions again—for example, for household manipulation vs. warehouse vs. surgery—and whether we’ll see “benchmark overfitting” in simulation the same way we did in vision and NLP

English

269

Nicholas Pfaff@NicholasEPfaff·12 Şub

@nicks_robots @cohnthomas43 @ZakharovSergeyN @RickCory21 @RussTedrake Glad you like it! Agreed. That would be really cool to see! We uploaded >1200 scenes to Hugging Face, so you could try it with those. huggingface.co/datasets/nepfa…

English

Nick K@nicks_robots·12 Şub

@NicholasEPfaff @cohnthomas43 @ZakharovSergeyN @RickCory21 @RussTedrake Hey Nic! Great stuff, love to see automatic 3D scene construction! This would be sick to pair with my haptic gloves :D

English

Nicholas Pfaff@NicholasEPfaff·12 Şub

@ludwig_fr Love that!

English

269

Ludwig_fr@ludwig_fr·12 Şub

@NicholasEPfaff Add some steam VR teleop I would give you some free teleop data^^

English

283

Nicholas Pfaff@NicholasEPfaff·12 Şub

@hferrolho Glad you liked that one! A pain to simulate with those large forces 😂

English

184

Henrique Ferrolho@hferrolho·12 Şub

@NicholasEPfaff Earthquake test was my favourite part in this thread! 😆 Cool stuff!

English

214

Nicholas Pfaff@NicholasEPfaff·12 Şub

Correct. We worked on real2sim (replicating an actual environment in simulation) in the past: scalable-real2sim.github.io The goal of this approach is to match the distribution of real-world and simulated environments for increased scale, but it might not be great at replicating one particular environment from images (though we have used it for that as well).

English

Abdul R@cyrux004·12 Şub

@NicholasEPfaff @GChongkai I also saw a post from somebody to create a gaussian splat of your personal space which can be used for simulation. This would be better suited if you want to train in a specific environment vs what you have here if you want to build something general purpose, right ?

English

Keşfet

@Mattzh1314 @ToyotaResearch @toyota_europe @UvA_Amsterdam @vatsalbajaj @YufeiWang25 @allen_ai @elonmusk