PhysX-Omni treats those as first-class outputs of generation rather than post-processing annotations.
For embodied systems, meshes have:
• support constraints
• material behavior
• contact dynamics
• articulated structure
• affordances for agents
Most 3D generative pipelines produce assets that look correct but fall apart the moment physics enters the loop.
PhysX-Omni targets the missing layer: generating simulation-native 3D assets with geometry, articulation, material properties, and functional semantics jointly modeled.
The output is usable for not just rendering demos, but also:
• robotics simulators
• manipulation tasks
• physically grounded scene synthesis
• embodied training pipelines
SCRIPT introduces a scalable diffusion-policy framework for language-driven humanoid control in physics simulation. Instead of generating offline motion clips, it trains a closed-loop policy that directly controls a humanoid while staying physically stable.
The core idea is JAST-DiT, a diffusion transformer that jointly models actions, body states, and text tokens through shared attention. The policy predicts future action-state chunks, executes only the first action, then replans continuously in a receding-horizon loop.
At inference, all reconstruction and prediction heads are removed. The robot keeps only a compact GaussianDream prefix that conditions action generation, avoiding Gaussian rendering, video rollout, or planners during execution. Results on LIBERO, RoboCasa, and real robots show strong gains in spatial reasoning, pick-and-place precision, and long-horizon manipulation while remaining lightweight enough for closed-loop control.
GaussianDream argues current Vision-Language-Action (VLA) robot models mainly imitate actions from videos, but do not explicitly model how the environment will change after interaction. Existing 3D VLAs add geometry like depth or point clouds, yet mostly capture only the current scene. World models can predict futures, but usually rely on expensive video rollouts or latent simulations that are too slow for real-time robotic control.
What is impressive here is not just exploration performance but transfer.
After curiosity pretraining on HM3D, the same RGB-only policy can be fine-tuned for:
apple picking
image-goal navigation
unseen AI-generated worlds
And it outperforms agents trained from scratch on task rewards alone.
No explicit maps.
No planners.
No hierarchical exploration modules.
Just:
persistent world reconstruction + long-context memory + curiosity-driven RL.
A strong argument that scalable exploration may emerge from better memory rather than more handcrafted navigation structure.
Most curiosity-driven RL agents fail in long-horizon exploration because they forget. They revisit the same places, treat them as novel again, and collapse into repetitive loops.
This paper fixes that with two ideas:
1)A persistent world model using online 3D Gaussian Splatting
2)A transformer agent with episodic memory over RGB history
The agent explores purely from curiosity rewards derived from reconstruction error between predicted and observed views. No task rewards, maps, depth sensors, or localization at test time.
Result: emergent behaviors like corridor traversal, backtracking, doorway seeking, and strong zero-shot generalization across photorealistic 3D worlds.
The performance numbers:
• low-microsecond control loops on CPU
• up to 100M+ dynamics evaluations/sec on GPU
• 2–5× faster Python controllers than Pinocchio/MuJoCo bindings
• near-MJX/BRAX GPU throughput with much faster JIT compile times
Robot learning stacks are hitting an infrastructure bottleneck.
Classical rigid-body dynamics engines were built for single robots and recursive CPU execution — not massive parallel simulation, differentiable control, or accelerator-native learning pipelines.
FRAX rethinks rigid-body dynamics entirely in JAX.