Saturday Robotics

4

13

2.8K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·2d

🚀 Calling for Keynotes, Panelists & Lightning Talks 🥂 Saturday Robotics x SF Deep Tech Week Happy Hour 06/25 | Robotics & World Models Reading Club 14 👉🏻 RSVP: luma.com/l1g9c2l1 As part of @saturdayrobotic Deep Tech Week Summit (June 2026), we're bringing together robotics researchers, founders, investors, and builders during SF Deep Tech Week for an evening of discussions on the future of embodied AI and robotics. 📅 June 25, 2026 ⏰ 5:30 PM – 9:30 PM 📍 San Francisco We're currently looking for speakers, panelists, and lightning talk presenters interested in sharing insights on: • World Models • Robotics Frontiers: Industry and Academia • Dexterous Manipulation, Cross-Embodiment • Video Generation, Simulation Researchers, startup founders, industry practitioners, and investors are all welcome. If you'd like to be a lightning talk speaker or panelist, please message @junfanzhu98 or 📧 junfanzhu98@gmail.com Include: • Name + Affiliation • Lightning Talk Abstract, or Proposed Panel Discussion Topics 👉🏻 RSVP: luma.com/l1g9c2l1 Looking forward to bringing together the robotics and embodied AI community for an evening of technical discussion, networking, and collaboration. #Robotics #EmbodiedAI #WorldModels #PhysicalAI #HumanoidRobots #DexterousManipulation #DeepTechWeek #ArtificialIntelligence #MachineLearning #Startup

English

2

4

450

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·2d

📚 @saturdayrobotic Robotics & World Models Reading Club 17: Soft Tactile-Centric Multimodal Intelligence Toward Safe and Dexterous Manipulation — San Francisco 07/11 👉🏻RSVP: luma.com/e53zawq2 Keynote: Quan Luu, Postdoctoral Researcher @LifeAtPurdue As robots transition from structured industrial settings to unstructured environments, they are required to interact with humans and objects in a safe and effective manner. However, achieving robust and safe operation remains a major challenge due to the limited understanding of physical contact. In this seminar, I present my research on advancing embodied robot intelligence through tactile-centric multimodal sensing, perception, and learning. I first introduce the design of soft sensorized robotic components that integrate tactile feedback with proximity and visual sensing to capture rich physical interaction signals. I then demonstrate how these capabilities are integrated into learning and control frameworks to enable reactive manipulation and contact-rich behaviors. Together, this work illustrates how the integration of soft sensorized robot bodies with multimodal perception and control enables safe and adaptive robot behavior while improving overall contact-rich manipulation performance. Pre-Readings: • ManiFeel preprint & project: arxiv.org/pdf/2505.18472 | zhengtongxu.github.io/manifeel-websi… • Vision-Based Proximity and Tactile Sensing for Robot Arms: Design, Perception, and Control. IEEE T-RO, 2025 • Simulation, Learning, and Application of Vision-Based Tactile Sensing at Large Scale. IEEE T-RO, 2023 • Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies. arXiv:2604.27224v2 • CONTACT: CONtact-aware TACTile Learning for Robotic Disassembly. arXiv:2603.08560v1 Location: San Francisco (Downtown) Date & Time: Saturday, July 11, 2026 | 2:00 PM – 5:00 PM Hosts: @junfanzhu98 & @aurorafeng_01 Agenda 2:00 PM – 2:30 PM: Doors open & social 🍓 Food, beverages & UNLIMITED strawberries (official Reading Club fruit!) 2:30 PM – 4:00 PM: Keynote by Quan Luu 4:00 PM – 5:00 PM: Q&A + open-floor roundtable (10–20 min per topic) — spotlight any paper or technical deep-dive you want to discuss Past sessions have brought together researchers & engineers from Boston Dynamics, Google DeepMind, NVIDIA, Stanford, UC Berkeley, Physical Intelligence, Tesla, Rhoda AI, Dyna, Generalist and leading Bay Area robotics startups. Spots are limited — please arrive by 2:00 PM for check-in. Keynote starts promptly at 2:30 PM. Join the Saturday Robotics Discord for RSVP, updates & community: discord.gg/WH7DrTHRXK Follow @saturdayrobotic for more frontier robotics & world models content. Come ready to discuss soft tactile sensing, multimodal perception, contact-rich control, safe dexterous manipulation, and the path to reliable embodied intelligence in the real world! 👉🏻RSVP: luma.com/e53zawq2 #Robotics #WorldModels #EmbodiedAI #TactileSensing #DexterousManipulation #SafeRobotics #ContactRich #SFTech

English

2

5

300

Saturday Robotics retweetledi

Aurora Feng@aurorafeng_01·3d

CVPR Week was a blast! On 6/6, @junfanzhu98 (my cohost at @saturdayrobotic), Anthony Zhao (@ManycoreTech) and I hosted a room full of robotics and world models researchers in Denver. During CVPR researcher night, @t641769919 Haoyi Niu and I opened the talk with the question we keep coming back to at Neural Motion: What if every robot could learn from every other robot? Our view is that the embodiment gap should not only be pushed downstream into the policy but should be solved at the data layer. That is the idea behind NM-GenET: a data-transfer-centric foundation model that we're soon releasing. Max Zhaoshuo Li (NVIDIA Cosmos 3) pushed toward a unified omnimodal world-action engine — understanding, generation, simulation, inverse dynamics, forward dynamics, and control inside one physical AI stack. Xiaofan Li (WALL-WM from X Square Robot) reframed world modeling around events instead of fixed action chunks, a shift from what comes after this frame window? to: what event is unfolding in the world? Zesen Zhao (UMich) geometric verification work asked a very practical question: if world-action models can produce visually plausible futures, how do we know they are geometrically consistent? Jie Wang (UPenn GRASP’s Robotics MMLU) hit one of the biggest elephants in the room: robot policies are starting to look like foundation models, but robotics still lacks the evaluation coordinate system that lets the field agree on what “better” even means. And finally, Gordon Qian's Diffusion-DRF showed a different role for VLMs in video generation: not just judging outputs after the fact, but becoming dense credit-assignment engines during optimization. Now that we're back in SF, I'm excited to bring the @saturdayrobotic community to SF Deep Tech Week - we’re hosting a happy hour on 6/25 for robotics, world models, and physical AI people. If you missed us in Denver, come hang out next week in San Francisco :D 👉rsvp: luma.com/l1g9c2l1

English

3

13

1.2K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·5d

🤖🦾 @saturdayrobotic Robotics & World Model Reading Club 12 Recap: @DanielXieee (@QuantingX7410) on Reproducible Robotic Dexterity Benchmarking: From Grasp Taxonomies → Multi-Axis Evaluation → Physical AI Dexterity remains one of robotics’ least standardized capabilities. Binary success rates and static grasp taxonomies fail to capture fluent manipulation. Progress requires reproducible benchmarks, automated evaluation, embodiment-aware hardware, and foundation models capable of generating diverse yet semantically meaningful rollouts for post-training. 📏 Human Dexterity Foundations Occupational-therapy benchmarks provide repeatable human baselines: • O’Connor Finger Dexterity Test: high-density pin insertion throughput. • Purdue Pegboard Test (Tiffin, 1948): single/bimanual insertion speed & accuracy. These measure coordination, learning curves, and fine-motor throughput under standardized protocols. ✋ Why Grasp Taxonomies Are Insufficient The classic 33-grasp taxonomy spans Power/Intermediate/Precision grasps (diameter, sphere, disk, prismatic, tripod, lateral, pincer, hook, adduction, parallel-extension, etc.). It measures manipulation vocabulary (available poses), not fluency (dynamic coordination under spatial, temporal, contact, force, and tool constraints). 📊 GENE-26.5 Dexterity Axes Manipulation decomposes into: 1️⃣ Spatial Precision 2️⃣ Temporal Composition 3️⃣ Contact Richness 4️⃣ Contact Coordination 5️⃣ Tool-Mediated Interaction These dimensions better capture dexterity than task-level success alone. 🧩 DexBench (RLWRLD + NVIDIA + Isaac Lab Arena) 18 atomic task families across 5 domains: Special Picking(4), In-Hand Reorientation(4), Bimanual Regrasp(7), Precision Insertion(5), Hand Fastening(5), Constrained-Axis Manipulation(5), Interface Actuation(4), Force-Regulated Wiping(2), Flowable Material Control(4), Fabric Handling(2), Cable Winding(1), Package Handling(5), Sorting/Binning(3), Bin Packing(2), Box Sealing(1), Precision Arrangement(3), Tool Use(4), Moving Object Interaction(2). Examples: 🔧 Window-regulator assembly requires simultaneous multi-point 6D alignment across articulated linkages with failure modes including forced insertion, reversed seating, component deformation, and jig damage. 💧 Pouring benchmark: 1.5L kettle → 300ml mark. Human judges assess fill level and spillage, revealing reproducibility limits. ⚠️ Current benchmarks still rely on non-standardized kits and human evaluation. 🔄 Toward Fully Automated Evaluation • AutoEval (Berkeley/NVIDIA): 24/7 autonomous evaluation cells, policy queues, PaliGemma-based success classifiers, ~0.99 correlation with human labels. • FurnitureBench: standardized long-horizon furniture assembly. • LIBERO: 130 language-conditioned lifelong-learning tasks. • RoboCasa: large-scale household simulation with leaderboards and distributed evaluators. ✅ Recommended benchmark recipe: • Cheap standardized physical kits (3D-printable/off-the-shelf) • Timed throughput metrics • Human norm curves • Zero human evaluation • Autonomous success detection, recovery logging, duration histograms, and multi-axis scoring 📈 Critical Measurement Gaps Success rates should be supplemented with: • Spatial/temporal/contact-axis scores • Recovery efficiency • Perturbation robustness • Throughput under distribution shift • Tactile & force profiles • Sim2real gap quantification Evaluation models themselves can overfit to task-specific visual cues, necessitating axis-aligned dexterity metrics independent of benchmark idiosyncrasies. 🤲 Embodiment Gap = Primary Bottleneck Human demonstrations are collected with 5-finger embodiments; ~20% of tasks (e.g., phone manipulation) become infeasible with 3-finger systems. Contact-rich manipulation likely requires dense tactile arrays (~15×15–20×20). Human-video pretraining remains difficult because robot kinematics, sensing, compliance, and dynamics differ substantially from humans. Human-like impedance/muscle-style actuation and matched sensing reduce this transfer gap. 🧠 PhysBrain Egocentric2Embodiment extracts structured physical commonsense from egocentric human video, producing E2E-3M (3M VQA samples) with temporal consistency and evidence grounding. Focus: • State-change reasoning • Object interaction modeling • Long-horizon planning Results: • >20% planning gains versus other 7B-scale models. • Strong transfer through PhysGR00T/PhysPI on SimplerEnv, LIBERO, RoboCasa, ERQA, and PhysBench. This provides dense human-derived physical priors to complement sparse robot trajectories. 🌍 World Models & Post-Training Oasis 3 (Decart) introduces API-accessible, promptable, multi-view, closed-loop, geometry-aware, action-conditioned world models for Physical AI. Key insight: Post-training quality is fundamentally limited by rollout quality. Effective RL requires pretrained VLAs/world models capable of producing diverse but semantically meaningful trajectories. Pretraining and post-training must scale together. Long-term planning likely requires moving beyond pixel/video decoding toward abstract latent dynamics that support shortcut discovery, recovery strategies, and novel tool invention. 🧮 Planning Stack ReAct + PDDL enables verifiable symbolic planning over continuous control. Force-aware vision, muscle-like actuation, and latent action alignment (LARA) improve contact-rich and tool-mediated behaviors. 🚀 Hardware Co-Design Origami Robotics’ 22-DoF quasi-direct-drive anthropomorphic hands with 1:1:1 mapping between glove, hand kinematics, contacts, and sensing directly attack the embodiment gap. Such systems make PhysBrain-style priors, human-video transfer, and reproducible dexterity benchmarks substantially more practical. 🎯 Scalable robotic dexterity requires the convergence of multi-axis evaluation, autonomous benchmarking, tactile-rich embodiment, physical commonsense pretraining, world-model rollouts, and human-aligned hardware-data co-design.

x.com/i/article/2066…

English

4

27

5K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·5d

x.com/i/article/2066…

ZXX

4

26

8.2K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·13 Haz

@ycombinator @FrancoisChauba1 Welcome to @saturdayrobotic Saturday Robotics Robotics & World Models Reading Club, every Saturday in SF! luma.com/saturdayrobotic

English

2

10

303

Saturday Robotics retweetledi

Manycore Tech@ManycoreTech·12 Haz

Nearly 200 builders gathered at our Spatial Intelligence & World Models event @CVPR 2026 in Denver. The community is hungry for real answers on robotics, world models, video gen, and physical AI. No settled path yet. That’s what makes it interesting. Here are 6 sentences that captured the conversation. 🧵

English

2

1

4

309

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·12 Haz

ZXX

10

50

13K

Saturday Robotics retweetledi

Gordon (Guocheng) Qian@CVPR2026@guocheng_qian·12 Haz

Glad to share our vision and progress in RL tuning for video generation in @saturdayrobotic at @CVPR 2026.

🎬 At @saturdayrobotic @CVPR 2026 Research Night, @guocheng_qian (Senior Research Scientist @Snap) presented lightning talk "Diffusion-DRF: a new RL/post-training paradigm for video diffusion models". TL;DR: Scalar rewards are too coarse for video generation. Diffusion-DRF turns VLM explanations into free, rich, dense, differentiable rewards with spatial/semantic credit assignment. Why: RL-style post-training powers LLM reasoning and image generation, but video is harder. A single reward cannot identify which object, motion, frame, physical inconsistency, or visual defect caused failure. GRPO-style methods often reward-hack and collapse within ~300 steps; video RL stability becomes a major bottleneck. Diffusion-DRF remains stable beyond 3K training steps. Core idea: Instead of binary/scalar rewards, use Qwen2.5-VL as a VQA reward engine. 1️⃣ Reference video + caption → structured decomposition: • Environment • Objects • Object locations • Other scene attributes 2️⃣ Generate multi-dimensional questions: • TA (Text Alignment): does the environment/object configuration match the prompt? • Phy (Physics): are objects physically plausible (deformation, motion, interactions)? • VQ (Visual Quality): blur, artifacts, defects? 3️⃣ Qwen2.5-VL answers these questions on the reference video, producing Yes/No + free-form explanations as targets. 4️⃣ Generated video is evaluated by the same VLM. 5️⃣ VQA next-token prediction loss becomes a differentiable reward Key insight: VLM token probabilities and explanations provide dense token-level feedback instead of brittle reward labels. Even more interesting: gradients flow through the VAE decoder and final denoising stages, allowing direct optimization of the video diffusion model. VLMs are not just judges—they become credit-assignment engines. Results (V-Bench 2.0): Diffusion-DRF (7B) achieves: • Overall 55.38 • Creativity 64.58 • Common Sense 56.96 • Controllability 27.98 • Human Fidelity 80.51 • Physics 56.85 • Material 75.52 • Dynamic Attribute 42.86 • Motion Rationality 40.23 • Complex Landscape 21.05 • Camera Motion 24.69 Outperforming Flow-DPO (50.27), Flow-GRPO (50.64), VideoAlign (53.55), Vanilla-DRF (53.72), and Wan2.1-3B-T2V (52.99), while maintaining far stronger training stability. Qualitatively: • More realistic human expressions and object interactions (e.g., honey pouring, facial fidelity) • Better object color/location control • More accurate manipulation actions • Stronger physical consistency and scene composition The next step is TC-GRPO (Diffusion-DRF + GRPO). Instead of scalar rewards, VLM gradients provide token-level credit assignment inside Group Relative Policy Optimization loops. On HunyuanVideo-1.5: • More natural handshakes • Better human-object interactions • Stronger lighting realism • Improved motion dynamics • Better road adhesion, composition, and photorealism in driving scenes Big picture: Video world models face the same scaling transition LLMs experienced: pretraining is no longer enough; post-training dominates compute (e.g., Composer 2.5 reportedly spends ~85% of compute on additional training/RL). The challenge is that scalar rewards break down in video. Diffusion-DRF suggests a different path: use VLM explanations + token probabilities as free, dense, differentiable rewards and token-level credit signals. VLM gradients may become for video generation what RLHF/RLAIF became for language models. @CVPRConf #CVPR2026

English

1

7

973

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·12 Haz

🤖 @CVPR 2026 Hot 🔥 Takes on Embodied AI: VLA × World Models × Agentic Loops @CVPRConf Embodied AI is converging toward a unified stack: VLA policies + world models + active perception, connected by hierarchical memory, reusable skills, and long-horizon orchestration. 🔹 Trends • Scenario-level generalization under distribution shift (novel objects, clutter, lighting) without task finetuning. • Sim-scale pretraining → real-world adaptation. • Language-conditioned manipulation, hierarchical planning, reusable skills. • Scaling axes: larger multimodal FMs, recursive refinement loops, test-time compute (reasoning/planning). • Shift from discrete query-response systems → continuous inference, streaming state maintenance, and full-duplex perception-action loops. 🔹 @sudo_robotics • Hierarchical VLA: language planner → skill toolbox → actions. • Real2Sim2Real pipeline with ManiSkill3 + SAPIEN. • Foundation-model approach: scale simulation, reusable skills, language-promptable robots. • Generalizes from fish-oil softgels to unseen plush toys across booths with zero task-specific finetuning. • ViTaMIn-B-style visuo-tactile sensing. • Clever hardware: multi-monocular cameras outperform stereo depth for hand-object visibility and reduced finger occlusion. 🔹 @meta_aria Perception-first embodied engineering: • Online calibration + temperature-aware compensation. • Detects minute calibration drift with mm-level precision. • Pixel-level exposure adaptation for HDR environments. • Visual-inertial SLAM optimized for localization, not photography. • Monochrome sensors improve feature extraction and long-term tracking robustness. 🔹 ForeAct (@MIT HAN Lab) Visual foresight as a plug-and-play module for any VLA. Pipeline: Qwen3-VL → subtask decomposition → diffusion-based goal imagination → robot → VLM monitor → replanning. Key idea: Separate semantic reasoning, task decomposition, future prediction, and control. ManiSkill decomposes tasks into skills; ForeAct decomposes tasks into future states. 🔹 SaPaVe (@PKU1898 / Beihang / BAAI) First end-to-end VLA combining semantic active perception + manipulation. Key insight: If information is insufficient, acquire information before acting. Architecture: • Camera Action Decoder (2 DoF yaw/pitch semantic viewpoint control). • Manipulation Decoder (26 DoF dual-arm control). • Camera Adapter: LoRA on Eagle-2 VLM (<2% trainable params). • Universal Spatial Encoder (MapAnything) injects depth, intrinsics, extrinsics, arbitrary geometry. • ~15% performance gain from geometry-aware view-invariant reasoning. Together: SaPaVe = gather information ForeAct = imagine future outcomes Loop: reason → inspect → imagine → execute → verify → replan. 🔹 WoW (14B World Model) • Trained on 2M robot trajectories. • SOPHIA self-optimization: generate → VLM critique → rewrite → regenerate. • Improves causal validity, collision reasoning, consistency. • Learns embodied physics directly from interaction. • Inverse Dynamics module converts imagined futures into executable actions. 🔹 Maestro Robotics OS paradigm: VLAs become modules inside an orchestration layer. Responsibilities: • Information sufficiency assessment. • Invoke SaPaVe / ForeAct / WoW. • Maintain long-horizon task memory. • Policy/primitive selection. • State tracking across time. Emerging view: Robotics is orchestration, not monolithic policy learning. 🔹@NVIDIAAI Cosmos3 Discussion: Always-On World Models @NVIDIARobotics Hypothesis: Future intelligence emerges from continuous prediction-reality mismatch correction. Architecture: • Persistent latent memory. • Self-monologue + dreaming loops. • Continuous VLM auditing. • Automatic memory pruning. • Test-time learning as a first-class capability. Inference scaling may have 3 orthogonal axes: 1️⃣ Larger multimodal models. 2️⃣ Recursive latent compression/folding. 3️⃣ Test-time rollout, search, self-consistency, continuous refinement. Data bottleneck: Egocentric trajectories + YouTube-scale multi-view video + action-conditioned interaction logs. Potentially ~50× more high-quality action data needed for the next phase transition. 🔹 From Tokens to Robots Fireside • VLAs and LLMs are both sequence models; robot tokens correspond to actions, states, and trajectories. • Action spaces become robotics' version of function calling. • World models optimize action-conditioned transition prediction rather than behavior imitation. • RL adds critics/value functions for selecting among imagined futures. • Failure trajectories remain valuable training data. • Calibration may matter more than raw accuracy. • Contact-rich interaction remains robotics' hardest challenge. • Robotics lacks a Chinchilla-style scaling law relating data, model size, compute, and downstream performance. • World models may become evaluation engines before policy engines. 🎯 Takeaway Active Perception (SaPaVe) → Visual Foresight (ForeAct) → World Models (WoW) → Agentic Orchestration (Maestro) with continuous loops of: Perceive ↔ Imagine ↔ Predict ↔ Act ↔ Revise The open challenge remains unifying perception, memory, planning, control, causal representation learning, diffusion MPC, and action-conditioned world modeling into a stable long-horizon embodied intelligence scaling law.

English

6

25

4K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·11 Haz

ZXX

4

15

7.1K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·11 Haz

🌌 @saturdayrobotic @CVPR 2026 Robotics Research Night Recap: World Models, Physical AI & Embodied Intelligence @CVPRConf 👉🏻YouTube: youtube.com/live/P_3gSC-5c… 👉🏻Luma: luma.com/zamm9g2g 6 Lightning talks: 🤖 @neuralmotion — NM-GenET @aurorafeng_01 introduced NM-GenET, a generative video-action model for universal embodiment transfer and cross-domain policy learning. The goal is to enable policies learned on one robot, morphology, or environment to generalize across embodiments and domains through video-action generation. 🌍 @NVIDIAAI Cosmos 3 @mli0603 Zhaoshuo Li unveiled Cosmos 3, NVIDIA's next-generation omnimodal world model. Built on a Mixture-of-Transformers architecture with parallel autoregressive and diffusion pathways, Cosmos 3 jointly processes and generates language, image, video, audio, and action sequences within a single model. The same backbone supports: • Vision reasoning • Image/video/audio generation • Forward dynamics prediction • Inverse dynamics inference • Robot policy control A particularly impressive capability is explicit spatial grounding combined with structured action generation, allowing the model to identify task-relevant objects, reason about spatial relationships, and generate executable robot trajectories in cluttered scenes. Cosmos 3 positions omnimodal world models as a foundation model for Physical AI, unifying understanding, generation, simulation, reasoning, and control. 🧠 WALL-WM (@XSquareRobot) Xiaofan Li presented WALL-WM, a World Action Model built around event-level Vision-Language-Action pretraining. Instead of predicting fixed-length action chunks, WALL-WM treats semantic events as the atomic unit of world modeling. Core transition: Next Chunk Prediction → Next Event Prediction By aligning language, perception, and action around event representations, WALL-WM aims to better capture real-world temporal structure while preserving pretrained multimodal priors. The architecture supports both: • Language-guided event reasoning • Event-centric world simulation This represents a shift from modeling "what action follows this frame window" to modeling "what event is unfolding in the world." 📐 Test-Time Scaling for World Action Models @SourORZ1 Zesen Zhao (@UMich) presented a training-free verification framework for World Action Models. Key insight: Predicted futures should be geometrically consistent across multiple camera views. Using frozen VGGT depth estimation and cross-view reprojection consistency, the system performs Best-of-N rollout selection without additional training or robot rollouts. The broader argument is that geometry remains largely implicit in current VLAs and WAMs, making depth a potentially important next scaling axis for Physical AI. 📊 Toward a Robotics MMLU @JieWang_ZJUI (@Penn @GRASPlab) argued that robotics lacks an equivalent of MMLU. While robot policies increasingly resemble foundation models, evaluation remains fragmented across hundreds of incompatible benchmarks. • Decomposable capability axes • Reproducible evaluation protocols • Distributed evaluator networks • Generalization-first benchmarking A recurring observation was that tiny distribution shifts—camera placement, lighting, human interaction variations—can still collapse state-of-the-art policies. 🎥 Diffusion-DRF @guocheng_qian (@Snap) presented Diffusion-DRF, a new post-training paradigm for video diffusion models. Instead of relying on scalar rewards, Diffusion-DRF converts VLM-generated explanations and token probabilities into dense differentiable rewards that provide spatially and semantically precise credit assignment. Key result: Training remains stable beyond 3,000 steps, significantly outperforming conventional GRPO-style video RL approaches that often collapse after only a few hundred iterations. The broader implication is that VLMs may evolve from evaluators into credit-assignment engines for video generation and future world models. 💡Summary • World models are moving from frame/chunk prediction toward semantic event prediction. • Omnimodal architectures are beginning to unify perception, reasoning, simulation, and control. • Test-time scaling is becoming increasingly important for embodied systems. • Geometry and depth may become foundational modalities rather than auxiliary signals. • Evaluation remains one of the largest bottlenecks for robotics foundation models. • Post-training and inference-time optimization are emerging as critical scaling dimensions alongside model size and data scale. Converging toward continuously operating world models that can perceive, predict, reason, simulate futures, detect mismatches with reality, and update themselves in an ongoing loop. The future may look like an always-on interaction system built around persistent world modeling.

YouTube

English

3

7

35

5.3K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·10 Haz

CVPR 2026 Embodied AI Highlight Papers Active Perception · Visual Foresight · Embodied Cognitive Loops 1. ForeAct (MIT HAN Lab, Zhuoyang Zhang, Shang Yang et al., arXiv:2602.12322, github.com/mit-han-lab/fo…) ForeAct delivers efficient visual foresight that steers any VLA via atomic visual goal imagination. It addresses the failure mode where sufficient information already exists, but explicit future grounding is missing. If SaPaVe answers: Do I know enough to act? ForeAct answers: Now that I know enough, what exactly should success look like? The core argument: existing VLAs are overloaded. They simultaneously perform: semantic reasoning, task decomposition, future prediction, visuo-motor control. ForeAct explicitly separates these responsibilities. This resembles skill-library systems such as ManiSkill in spirit, but with a different abstraction: ManiSkill decomposes tasks into reusable skills; ForeAct decomposes tasks into reusable future states. Unlike Sudo-style systems that reduce VLAs into lightweight coordinators over primitives, ForeAct keeps the VLA intact and steers it via visual foresight. Closed loop pipeline: Qwen3-VL → subtask → ImGen → robot (multi-cam) → VLM monitor / re-plan (finer granularity than ManiSkill skills; no VLA replacement, unlike Sudo-style coordination layers) 2. SaPaVe (Mengzhen Liu, Enshen Zhou et al., PKU / Beihang / BAAI, arXiv:2603.12193) SaPaVe delivers the first end-to-end VLA unifying semantic active perception and manipulation via explicit decoupling. It addresses insufficient information before action. I was surprised that the human-like paradigm: “Look again, look closer, look left and right” (combining perception + action) was not already well-established in VLAs—it is extremely natural for embodied intelligence. Core insight SaPaVe solves the regime where robots lack: occlusion understanding, grasp affordances, articulation state, action success certainty. Existing VLAs operate under passive perception: fixed camera viewpoints, direct manipulation prediction from static observations. However, active perception introduces a key coupling problem: moving the camera changes observations, manipulating objects changes observations, reorienting objects changes observations. Traditional unified action spaces entangle: camera motion objectives, manipulation objectives. SaPaVe resolves this via explicit decoupling. Decoupled design Embodied intelligence becomes a two-branch decision process: - test information sufficiency - if sufficient → act; if insufficient → active information acquisition. SaPaVe + ForeAct together instantiate this loop: reason → gather info → imagine futures → execute → verify → re-plan (vs traditional perceive → act) SaPaVe architecture Camera Action Decoder: 2 DoF (pitch + yaw), embodiment-agnostic semantic viewpoint control, supports: “look left / zoom / inspect behind” Manipulation Action Decoder: 26 DoF joint positions, dual-arm dexterity Decoupled heads outperform unified decoder (71.25% vs lower baseline) Camera / perception modules Camera Adapter: LoRA on Eagle-2 VLM, <2% trainable parameters, learns semantic active perception priors, preserves base manipulation knowledge Universal Spatial Encoder (MapAnything): injects depth + intrinsics + extrinsics + arbitrary geometry, element-wise fused into VLM tokens & action head during denoising, enforces view-invariant 3D consistency, improves performance by ~15% even on simple tasks. 3. Long-horizon cognition: WoW (arXiv:2509.22642) WoW is a 14B embodied world model trained on 2M robot trajectories (not passive video). Key mechanism: SOPHIA self-optimizing loop: generate, VLM critique (physical + causal validity), rewrite, regenerate. This improves: consistency, collision reasoning, causal validity. Unlike video-only world models, WoW learns physical dynamics directly from embodied interaction. It also introduces Inverse Dynamics → executable actions, achieving SOTA on manipulation simulation and real Franka setups. Overall implication: embodied pretraining may function as meta-learning for intuitive physics. 4. Agent OS / Robotics orchestration: Maestro (maestro-robot.github.io) Maestro reframes VLAs as modules inside a robot operating system layer. This OS layer is responsible for: deciding information sufficiency, invoking SaPaVe / ForeAct / WoW, tracking long-horizon state, selecting primitives / policies, maintaining task memory across time Pure VLAs remain weak at long-horizon reasoning. Missing system components (explicit gaps): causal latent learning (MPI-style), Diffusion MPC, tighter integration between generative world models and real-time control. Related systems (e.g., Dexmate) similarly argue for: representation layers, world models, agentic harnesses, modular execution systems. The emerging paradigm: robotics as orchestration, not monolithic policy learning Conclusion SaPaVe (information acquisition layer): semantic active perception, embodiment-agnostic camera control, decoupled action modeling, geometry-aware viewpoint reasoning. ForeAct (future grounding layer): atomic subtask decomposition, visual goal imagination, efficient diffusion-based foresight, plug-and-play steering of existing VLAs. System stack: Above both layers sit: embodied world models (WoW), agentic orchestration frameworks (Maestro), representation-centric architectures (Dexmate) Likely missing ingredients to close the loop: causal latent representation learning, diffusion-based model predictive control, MPI-style causal world modeling frameworks. @CVPR @CVPRConf @saturdayrobotic #CVPR2026

CVPR 2026 — Embodied AI Takeaways @CVPRConf @CVPR Embodied AI converges along three coupled axes: VLA policies, world models, agentic perception-action loops, linked via hierarchical memory + skill composition. 🤖 Robotics shows scenario-level generalization under distribution shift (novel objects, clutter, lighting variation), incl. unseen household items + long-tail tabletop objects, often without task finetuning. Common pattern: sim-scale pretraining + real adaptation language-conditioned manipulation policies hierarchical planning + reusable skills ManiSkill-style benchmark ecosystems Trend: compositional policies + simulation-scaled pipelines; cross-embodiment transfer remains open. 👓 Meta Aria = perception-first SLAM engineering SLAM-first embodied sensing design co-optimizes hardware + algorithms for stability over imaging. Key priorities: online calibration + drift correction illumination robustness visual-inertial SLAM primary objective per-sensor consistency for long-term tracking Optimized for continuous egocentric state estimation, not photography. 🌍 World models & agentic systems converge conceptually Shared abstraction: prediction–observation mismatch correction in continuous loops. Design directions: streaming latent state updates persistent memory / belief revision anomaly-driven representation correction tight perception–imagination–action coupling Shift: discrete I/O → continuous inference + continuous state maintenance. 📈 Scaling axes: larger multimodal foundation models recursive / iterative refinement loops test-time computation scaling (reasoning + planning) Shift: model size scaling + forward dynamics quality + inference-time adaptation. 🎙 Continuous interaction models Move beyond turn-taking: low-latency streaming speech (Moshi-style) overlap-tolerant dialogue continuous embodied perception-action loops Toward full-duplex systems with persistent internal state vs query-response cycles. 🦾 Robot “OS” = hierarchical orchestration Long-horizon manipulation remains hard under flat policies. Stack: high-level planners (language/symbolic/latent) mid-level skill libraries (reusable primitives) low-level reactive control Active perception: query environment under uncertainty manipulate to reduce ambiguity update belief before action 🧭 Synthesis: reactive policies → agentic systems with persistent world models Integration: world models + VLA active perception + uncertainty-aware control simulation scaling + real adaptation continuous interaction + streaming inference 🧩Summary: Embodied AI is moving toward systems that continuously perceive, maintain internal state, and iteratively refine predictions via environment interaction. Open problem: unifying perception, memory, planning, control into stable long-horizon agent loops. #CVPR2026 #EmbodiedAI #WorldModels #Robotics #VLA #AgenticAI

English

8

46

5.7K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·10 Haz

🌌 From Tokens to Robots — Fireside on VLAs, Robotics & World Models 💡Embodied intelligence is converging toward a unified sequence modeling stack, where perception, language, and action are jointly represented—but with distinct roles for policy learning, world modeling, and control. • Web-scale data is increasingly being used to power vision, multimodal, VLA, and robotics systems. The key challenge is not only model architecture, but also the data layer: collection, structuring, pipelines, infrastructure, deployment, and downstream workflows. • VLAs and LLMs can be viewed through the same lens: next-token prediction. VLAs extend sequence modeling, but actions are often generated via autoregressive tokens, latent plans, or diffusion-based continuous policies. The difference is that robot tokens correspond to physical actions, states, and control trajectories. Visual, tactile, language, teleoperation, egocentric, and action data can all be tokenized and aligned. • VLA architectures naturally combine System-2 planning and System-1 execution. High-level reasoning generates plans; low-level controllers generate action/control tokens for locomotion and manipulation in joint space or continuous state space. Action spaces become the robotics analogue of function calling (abstraction level of structured action interfaces, not at the low-level control dynamics). Agent/sub-agent architectures remain similar to LLM agents, but operate through a physical harness interacting with the world. • A central representation-learning question is whether all modalities should share encoders and latent spaces or use separate encoders with aligned representations. Raw modalities—including video, robot joint positions, text, numbers, tactile signals, and future sensor streams—can be discretized into tokens and embedded into shared spaces. CLIP-style representation learning demonstrated that multimodal embeddings can be pulled into a common latent manifold; robotics extends this idea toward action-conditioned representations. • Deployment raises additional questions: SLMs vs frontier models, edge inference vs cloud inference, and how reasoning systems interact with low-level control loops. Language-space reasoning and physical-world control remain fundamentally different optimization problems. • A recurring theme was that many robotics problems ultimately reduce to robust and optimal control: understanding what is physically possible, how systems evolve, and how actions influence future states. • World-model advocates argued that directly imitating behavior may be less scalable than learning action-conditioned dynamics. Rather than copying demonstrations, models learn how the world evolves under intervention. Given a state and candidate action, predict the next state and evaluate consequences before acting. • RL extends this framework through value functions and critics that estimate whether future states are safe, successful, or reward-maximizing. This enables exploration beyond demonstrations and provides a mechanism for selecting among multiple possible futures. • Implicit world models further allow internal imagination: generating hypothetical rollouts, counterfactual trajectories, and alternative action sequences before executing in the real world. • Data quality remains a core bottleneck. Imitation learning depends on high-quality demonstrations, but world models can also learn from failures. Failure trajectories still contain transition dynamics and may improve predictive understanding of the environment. • The defining property of robotics world models is controllability. Robotics requires action-conditioned prediction: given state + action, what happens next? Hallucinating plausible futures is insufficient if future states cannot be controlled through actions. • Data-distribution gaps become diagnostic tools. If a world model becomes uncertain or inaccurate in particular regions of state-action space, those gaps reveal missing coverage and indicate where additional data collection is needed. • At a scaling-law level, the open question is: if we had access to all web data, all teleoperation data, and all robot trajectories, what is the most scalable framework for embodied intelligence? Can next-token prediction absorb world knowledge, manipulation skills, teleoperation traces, successful trajectories, and failure trajectories into a single learning system? • A major unresolved research problem is robotics scaling laws. Unlike language models, robotics lacks a Chinchilla-style framework relating token count, model size, compute, losses, and downstream performance. Open questions include: How do web-data tokens scale relative to teleoperation tokens? How do perplexity loss and diffusion loss correlate with control performance? What metrics predict downstream task success? How should post-training quality be measured? • Calibration may matter as much as accuracy. Consider a dataset containing a 50/50 mixture of successful and failed trajectories. A world model should not merely predict outcomes; it should accurately estimate confidence. Correct uncertainty calibration may be a better indicator of world-model quality than raw accuracy alone. • Transfer learning emerged as another scaling property. Combining unrelated data sources—such as human door-opening videos and robot towel-folding trajectories—may improve performance on new manipulation tasks. Shared token representations could allow world knowledge learned in one domain to transfer into another. • VLA and world models optimize different objectives: VLA: learn behavior directly; benefit from as much action data as possible; deploy policies in the real world. World Model: explicitly model transitions, state deltas, contact dynamics, and action-conditioned futures; reason over outcomes before execution. • Contact-rich interaction remains one of robotics' hardest challenges. World models must accurately predict state transitions under contact, not merely free-space motion. Contact-physics datasets become particularly important because small state deltas can produce large behavioral differences. • Rich sensing remains necessary. Wrist cameras, tactile sensing, and close-range observations continue to provide information unavailable from distant viewpoints. • Continuous-space and discrete-token world models remain an open research tradeoff. JEPA-style continuous representation learning is promising but still relatively immature in robotics compared with token-based approaches. Some researchers believe new “information highways” enabled by continuous representations may unlock future breakthroughs. • Data scaling currently dominates architecture optimization. Transformers may not be the ideal robotics architecture, but they can efficiently absorb enormous quantities of data. In practice, abundant data often compensates for architectural imperfections. • One of the strongest arguments for world models is evaluation. High-fidelity world models aligned with both simulation and reality can dramatically reduce real-world testing costs by narrowing the search space of critical scenarios. • Autonomous-driving workflows provide a useful analogy. World models can identify rare or interesting scenarios, curate them into training datasets, and repeatedly evaluate new checkpoints against them. Instead of running extensive real-world testing hundreds of times, many evaluations may be performed inside the model itself. • This suggests world models may become evaluation engines before they become policy engines. Evaluation, validation, and failure discovery could be among their most valuable near-term applications. • Robotics imposes an unusually high reliability requirement. Many applications demand near-100% confidence, turning evaluation into a scientific go/no-go problem rather than a simple benchmark exercise. • A memorable contrast emerged: Self-driving simulation often ends when contact occurs. Robotics simulation often begins when contact occurs. This helps explain why robotics simulation remains substantially harder. Contact interactions can destabilize simulators and create complex dynamics that are difficult to model accurately. • Planning remains a key downstream use case. World models can support model-predictive control (MPC), search, planning, and imagined rollouts. Dreamer-style systems demonstrate this idea by learning world models that allow agents to internally simulate futures, including long-horizon tasks such as collecting diamonds in Minecraft. • Web data remains both powerful and insufficient. It provides semantics, common-sense knowledge, and broad world understanding, but often lacks action-conditioned viewpoints and robot-centric observations. Robotics fundamentally cares about state-action transitions: conditioned on an action, what state comes next? • Some robotics-relevant information may exist in web data even when task success is near zero. Pretraining on large-scale data can still provide useful priors before teleoperation fine-tuning. • Datasets such as UMI provide useful 3D observations, but the broader question remains: what portions of semantics, physics, manipulation knowledge, and low-level control can truly be learned from internet-scale data? • Robotics also faces challenges that extend beyond perception. Understanding messy real-world environments, human behavior, and interaction dynamics remains difficult. Several participants highlighted spatial reasoning as a persistent bottleneck, with errors in spatial understanding still responsible for a significant fraction of failures (~12%). • World models are not obviously more data-efficient than VLAs. Determining the relative scaling efficiency of policy learning versus predictive modeling remains an open empirical question. • Physics-based modeling remains highly relevant. Topics discussed included CFD-style physics solvers, neural operators, and Material Point Methods (MPM), which have proven effective for simulating deformable materials such as snow, sand, and related contact-rich phenomena. The frontier question is no longer “VLA or World Model?” It is whether web-scale semantics, teleoperation data, action-conditioned prediction, planning, evaluation, control, simulation, and physics can all be unified under a single scaling law for embodied intelligence. @itsdanielho, @j_golebiowski, @_ankurdeka_, Ahmet Sarkaya, @itsajchan

English

5

18

1.3K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·10 Haz

📚 @saturdayrobotic Robotics & World Models Reading Club 12: @QuantingX7410 on Dexterity — 06/13 👉🏻RSVP: luma.com/5w7c1t2a Keynote: @DanielXieee (Co-Founder, @QuantingX7410 YC W26): Dexterity Benchmark We Need 2026 is the year of dexterity claims — frontier labs advertise “human-level” and “dexterity-first” manipulation, yet this spring alone three separate benchmark initiatives launched from industry, academia, and standards bodies. None are comparable. Definitions remain vague, taxonomies cover only narrow slices of manipulation, and demo reels stand in for protocols. The rigorous verification and reproducibility machinery that worked for vision, NLP, and even 1940s occupational therapy has yet to arrive in robot manipulation. This talk traces a century of attempts — from the Purdue Pegboard to today’s fragmented benchmarks — and argues that every piece of a proper dexterity benchmark already exists, just scattered across communities that rarely talk. Highly interactive: bring your own definition of dexterity and we’ll see whether the room converges any better than the field has. Pre-Readings Definitions & taxonomies: Napier, The Prehensile Movements of the Human Hand, JBJS 1956 Elliott & Connolly, A Classification of Manipulative Hand Movements, Dev. Med. Child Neurol. 1984 Cutkosky, On Grasp Choice, Grasp Models, and the Design of Hands, IEEE T-RA 1989 Ma & Dollar, On Dexterity and Dexterous Manipulation, ICAR 2011 Bullock et al., A Hand-Centric Classification of Human and Robot Dexterous Manipulation, IEEE ToH 2013 Dafle et al., Extrinsic Dexterity: In-Hand Manipulation with External Forces, ICRA 2014 Feix et al., The GRASP Taxonomy of Human Grasp Types, IEEE THMS 2016 Human dexterity assessment: Tiffin & Asher, The Purdue Pegboard, J. Applied Psychology 1948 Mathiowetz et al., Box and Block Test, AJOT 1985 Light et al., SHAP: Southampton Hand Assessment Procedure, Arch. PM&R 2002 Robot-hand dexterity benchmarks: Zhou et al., 50 Hand Dexterity Benchmarks (HD-marks), 2020 Coulson et al., The Elliott and Connolly Benchmark, IEEE-RAS Humanoids 2021 Elangovan et al., Modular Dexterity Test Board, 2022 Liconti, Zhou, et al., POMDAR: A Benchmark of Dexterity for Anthropomorphic Robotic Hands, arXiv:2604.09294, 2026 Task suites & object kits: Calli et al., YCB Object and Model Set, ICAR 2015 Kimble et al., Benchmarking Protocols for Small Parts Robotic Assembly (NIST task boards), IEEE RA-L 2020 Heo et al., FurnitureBench, RSS 2023 Luo et al., FMB: A Functional Manipulation Benchmark, IJRR 2024 Liu et al., LIBERO, NeurIPS 2023; Nasiriany et al., RoboCasa, RSS 2024 Evaluation methodology & infrastructure: Li et al., SimplerEnv: Evaluating Real-World Robot Manipulation Policies in Simulation, CoRL 2024 Zhou et al., AutoEval: Autonomous Evaluation of Generalist Robot Policies in the Real World, arXiv:2503.24278, 2025 Atreya, Pertsch, et al., RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies, CoRL 2025 Agia et al., CUPID: Curating Data your Robot Loves with Influence Functions, CoRL 2025 Chen, Kimble, et al., ManipulationNet, arXiv:2603.04363, 2026 Location: San Francisco (Downtown) (tentative) Time: Saturday, June 13, 2026 | 2:00 PM – 5:00 PM Hosts: @junfanzhu98, @aurorafeng_01 Agenda 2:00 PM — Doors open & social 🍓 Unlimited strawberries (official Reading Club fruit!) 2:30 PM — Keynote by @DanielXieee (@QuantingX7410) 4:00 PM — Q&A + open-floor roundtable (10–20 min per topic; spotlight any paper you’d like to highlight) Come ready to discuss what “dexterity” actually means, how to build rigorous and comparable benchmarks, hand-centric taxonomies, robot manipulation evaluation, and the missing reproducibility layer for embodied AI! Past sessions brought together researchers & engineers from Boston Dynamics, Google DeepMind, NVIDIA, Stanford, UC Berkeley, Dyna, Physical Intelligence, Tesla, Generalist, Rhoda AI, and leading Bay Area robotics startups. 👉🏻RSVP: luma.com/5w7c1t2a #Robotics #WorldModels #EmbodiedAI #Dexterity #RobotManipulation #SFTech

English

3

12

794

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·10 Haz

🎬 At @saturdayrobotic @CVPR 2026 Research Night, @guocheng_qian (Senior Research Scientist @Snap) presented lightning talk "Diffusion-DRF: a new RL/post-training paradigm for video diffusion models". TL;DR: Scalar rewards are too coarse for video generation. Diffusion-DRF turns VLM explanations into free, rich, dense, differentiable rewards with spatial/semantic credit assignment. Why: RL-style post-training powers LLM reasoning and image generation, but video is harder. A single reward cannot identify which object, motion, frame, physical inconsistency, or visual defect caused failure. GRPO-style methods often reward-hack and collapse within ~300 steps; video RL stability becomes a major bottleneck. Diffusion-DRF remains stable beyond 3K training steps. Core idea: Instead of binary/scalar rewards, use Qwen2.5-VL as a VQA reward engine. 1️⃣ Reference video + caption → structured decomposition: • Environment • Objects • Object locations • Other scene attributes 2️⃣ Generate multi-dimensional questions: • TA (Text Alignment): does the environment/object configuration match the prompt? • Phy (Physics): are objects physically plausible (deformation, motion, interactions)? • VQ (Visual Quality): blur, artifacts, defects? 3️⃣ Qwen2.5-VL answers these questions on the reference video, producing Yes/No + free-form explanations as targets. 4️⃣ Generated video is evaluated by the same VLM. 5️⃣ VQA next-token prediction loss becomes a differentiable reward Key insight: VLM token probabilities and explanations provide dense token-level feedback instead of brittle reward labels. Even more interesting: gradients flow through the VAE decoder and final denoising stages, allowing direct optimization of the video diffusion model. VLMs are not just judges—they become credit-assignment engines. Results (V-Bench 2.0): Diffusion-DRF (7B) achieves: • Overall 55.38 • Creativity 64.58 • Common Sense 56.96 • Controllability 27.98 • Human Fidelity 80.51 • Physics 56.85 • Material 75.52 • Dynamic Attribute 42.86 • Motion Rationality 40.23 • Complex Landscape 21.05 • Camera Motion 24.69 Outperforming Flow-DPO (50.27), Flow-GRPO (50.64), VideoAlign (53.55), Vanilla-DRF (53.72), and Wan2.1-3B-T2V (52.99), while maintaining far stronger training stability. Qualitatively: • More realistic human expressions and object interactions (e.g., honey pouring, facial fidelity) • Better object color/location control • More accurate manipulation actions • Stronger physical consistency and scene composition The next step is TC-GRPO (Diffusion-DRF + GRPO). Instead of scalar rewards, VLM gradients provide token-level credit assignment inside Group Relative Policy Optimization loops. On HunyuanVideo-1.5: • More natural handshakes • Better human-object interactions • Stronger lighting realism • Improved motion dynamics • Better road adhesion, composition, and photorealism in driving scenes Big picture: Video world models face the same scaling transition LLMs experienced: pretraining is no longer enough; post-training dominates compute (e.g., Composer 2.5 reportedly spends ~85% of compute on additional training/RL). The challenge is that scalar rewards break down in video. Diffusion-DRF suggests a different path: use VLM explanations + token probabilities as free, dense, differentiable rewards and token-level credit signals. VLM gradients may become for video generation what RLHF/RLAIF became for language models. @CVPRConf #CVPR2026

🌌 @saturdayrobotic @CVPR Research Night Lightning Talk #5, Zesen Zhao @SourORZ1 (@UMich, Cruise) presented Test-Time Scaling for World Action Models via Zero-Shot Geometric Verification. 💡Core insight: Geometry is still implicit. Depth is the next scaling axis. Geometry should become native, not emergent. WAMs take {primary cam + wrist cam + language} → {future observations + action chunk a₁:ₕ}. A key failure mode: predicted futures across views are often 3D-inconsistent. When imagined futures contain geometric artifacts, action quality drops. Key idea: WAMs already expose synchronized multi-view futures. Use them to verify generation quality before execution. Training-free, rollout-free, plug-and-play verifier: Stage 1 — Action-Future Gate • Optical flow f (current→future) • Action-induced motion Δu • c = cos(f,Δu) • c ≥ τ_gate → execute greedy α₁ • otherwise sample more candidates Stage 2 — Zero-Shot Geometric Verification • Candidate set = greedy + (N−1) sampled • Frozen VGGT depth reprojection (primary→wrist) • Compute reprojection error • α* = argmin(error) • Execute α₁ if Stage-1 passes, else α* No finetuning. No online robot rollouts. Reads only predicted images + actions. Works on any multi-view WAM (DreamZero, Cosmos Policy, Motus, LingbotVA, etc.). Larger question: VLA vs WAM? VLA: π₀ (PaliGemma/SigLIP+Gemma-2B), π₀.6/0.7 (Gemma-3B), InternVLA-A1, QwenVLA... WAM: DreamZero (Wan2.1-12V-14B), Cosmos Policy (Cosmos Predict-2), Motus, LingbotVA... Despite different architectures, both inherit the same limitation: they start from RGB pixel-space backbones and learn geometry, physics, and object dynamics only implicitly. The missing scaling axis: Depth as a native modality. Language → abstract, lossy, reasoning-friendly. RGB → 2D projection, indirect geometry. Depth → direct geometry: distance, spatial relations, object placement, motion via temporal derivatives. Current 3D-VLMs, VLAs, and world models still lack robust, generalizable depth representations. Conversations with VGGT researchers suggest current VGGT features can be inconsistent; Zhao noted he has not yet had time to validate VGGT-Ω. Historical analogy: VLMs evolved from frozen-LLM + projector (InternViT, pixel-unshuffle, dynamic resolution) → native multimodality (InternLM2.5, Qwen2.5/Qwen3.5). Physical AI may require the same transition: depth-native > depth-as-bolt-on. Summary • Evaluation remains the bottleneck. • Geometry remains implicit. • Test-time geometric verification already improves WAM reliability. • Depth may be the next fundamental scaling axis for Physical AI. @CVPRConf #CVPR2026

English

5

20

2.8K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·10 Haz

🌌 @saturdayrobotic @CVPR Research Night Lightning Talk #5, Zesen Zhao @SourORZ1 (@UMich, Cruise) presented Test-Time Scaling for World Action Models via Zero-Shot Geometric Verification. 💡Core insight: Geometry is still implicit. Depth is the next scaling axis. Geometry should become native, not emergent. WAMs take {primary cam + wrist cam + language} → {future observations + action chunk a₁:ₕ}. A key failure mode: predicted futures across views are often 3D-inconsistent. When imagined futures contain geometric artifacts, action quality drops. Key idea: WAMs already expose synchronized multi-view futures. Use them to verify generation quality before execution. Training-free, rollout-free, plug-and-play verifier: Stage 1 — Action-Future Gate • Optical flow f (current→future) • Action-induced motion Δu • c = cos(f,Δu) • c ≥ τ_gate → execute greedy α₁ • otherwise sample more candidates Stage 2 — Zero-Shot Geometric Verification • Candidate set = greedy + (N−1) sampled • Frozen VGGT depth reprojection (primary→wrist) • Compute reprojection error • α* = argmin(error) • Execute α₁ if Stage-1 passes, else α* No finetuning. No online robot rollouts. Reads only predicted images + actions. Works on any multi-view WAM (DreamZero, Cosmos Policy, Motus, LingbotVA, etc.). Larger question: VLA vs WAM? VLA: π₀ (PaliGemma/SigLIP+Gemma-2B), π₀.6/0.7 (Gemma-3B), InternVLA-A1, QwenVLA... WAM: DreamZero (Wan2.1-12V-14B), Cosmos Policy (Cosmos Predict-2), Motus, LingbotVA... Despite different architectures, both inherit the same limitation: they start from RGB pixel-space backbones and learn geometry, physics, and object dynamics only implicitly. The missing scaling axis: Depth as a native modality. Language → abstract, lossy, reasoning-friendly. RGB → 2D projection, indirect geometry. Depth → direct geometry: distance, spatial relations, object placement, motion via temporal derivatives. Current 3D-VLMs, VLAs, and world models still lack robust, generalizable depth representations. Conversations with VGGT researchers suggest current VGGT features can be inconsistent; Zhao noted he has not yet had time to validate VGGT-Ω. Historical analogy: VLMs evolved from frozen-LLM + projector (InternViT, pixel-unshuffle, dynamic resolution) → native multimodality (InternLM2.5, Qwen2.5/Qwen3.5). Physical AI may require the same transition: depth-native > depth-as-bolt-on. Summary • Evaluation remains the bottleneck. • Geometry remains implicit. • Test-time geometric verification already improves WAM reliability. • Depth may be the next fundamental scaling axis for Physical AI. @CVPRConf #CVPR2026

🌌 At @saturdayrobotic @CVPR Saturday Robotics Research Night — we hosted @JieWang_ZJUI (@Penn @GRASPlab) for a lightning talk on “Toward a Robotics MMLU: Lessons from Sim & Real Evaluations of Generalist Policies”. Core claim: Robot policies are transitioning into foundation models, but our evaluation methodology is not. The LLM community iterates on benchmarks like MMLU that decompose capability into reproducible, comparable axes. Robotics has no equivalent. Current practice consistently misses the failure modes that matter most for generalist deployment, and surfaces structural issues with current real-world evaluation. LLMs got “better” via MMLU-style decomposable, reproducible benchmarks; robotics has ~1000 benchmarks but ~0 trusted global axis → leaderboards (RoboLab / RoboArena / MolmoSpaces) disagree → same policy ranks #1–mid depending on eval → “score ≠ capability; could be policy-fit.” Case: “Evaluating π₀ in the Wild” @GRASPlab → key: tiny distribution shifts (camera nudge / lighting change) collapse SOTA generalist policies → robustness+eval failure, not just data scale issue. “red tasks” cluster at bottom = manipulation + human-interaction brittleness. Constraint: “You can’t do robotics without robotics” → standard hardware prerequisite (humanoid precedent) needed for manipulation too; embodiment mismatch currently confounds comparisons. Cost asymmetry: real-robot eval ~100h / checkpoint vs ~1h sim → sim needed for iteration, but sim2real gap too large → benchmark must explicitly amortize real-world eval cost. Proposed “Robotics MMLU” architecture: 1. Distributed evaluator network: labs/robots run local A/B policy comparisons; feed into arenas (RoboArena etc.); global ranking via pairwise preference aggregation 2. Credit + provenance layer: trace eval conditions/hardware/tasks → prevent leaderboard gaming/hacking 3. scalable decentralized participation (data point: frodo bots 2038 evals; Berkeley 495; UT Austin 390; NVIDIA 219; Stanford 134; UPenn 115; UMontreal 101; Yonsei 100; etc.) Eval as first-class research: 1. decomposable capability axes (diagnose why wins) 2. reproducible + comparable across sites 3. generalization-first (avoid arena overfitting) Benchmark Atlas Index (everloom-129.github.io/SimBench): existence proof → 22 benchmarks, 6 fields, 20+ simulators, sim-to-real unified map. 💡LLMs got their coordinate system in 2021. Robotics is overdue. Whoever builds it sets the next 5-year trajectory.

English

5

14

3.2K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·9 Haz

CVPR 2026 — Embodied AI Takeaways @CVPRConf @CVPR Embodied AI converges along three coupled axes: VLA policies, world models, agentic perception-action loops, linked via hierarchical memory + skill composition. 🤖 Robotics shows scenario-level generalization under distribution shift (novel objects, clutter, lighting variation), incl. unseen household items + long-tail tabletop objects, often without task finetuning. Common pattern: sim-scale pretraining + real adaptation language-conditioned manipulation policies hierarchical planning + reusable skills ManiSkill-style benchmark ecosystems Trend: compositional policies + simulation-scaled pipelines; cross-embodiment transfer remains open. 👓 Meta Aria = perception-first SLAM engineering SLAM-first embodied sensing design co-optimizes hardware + algorithms for stability over imaging. Key priorities: online calibration + drift correction illumination robustness visual-inertial SLAM primary objective per-sensor consistency for long-term tracking Optimized for continuous egocentric state estimation, not photography. 🌍 World models & agentic systems converge conceptually Shared abstraction: prediction–observation mismatch correction in continuous loops. Design directions: streaming latent state updates persistent memory / belief revision anomaly-driven representation correction tight perception–imagination–action coupling Shift: discrete I/O → continuous inference + continuous state maintenance. 📈 Scaling axes: larger multimodal foundation models recursive / iterative refinement loops test-time computation scaling (reasoning + planning) Shift: model size scaling + forward dynamics quality + inference-time adaptation. 🎙 Continuous interaction models Move beyond turn-taking: low-latency streaming speech (Moshi-style) overlap-tolerant dialogue continuous embodied perception-action loops Toward full-duplex systems with persistent internal state vs query-response cycles. 🦾 Robot “OS” = hierarchical orchestration Long-horizon manipulation remains hard under flat policies. Stack: high-level planners (language/symbolic/latent) mid-level skill libraries (reusable primitives) low-level reactive control Active perception: query environment under uncertainty manipulate to reduce ambiguity update belief before action 🧭 Synthesis: reactive policies → agentic systems with persistent world models Integration: world models + VLA active perception + uncertainty-aware control simulation scaling + real adaptation continuous interaction + streaming inference 🧩Summary: Embodied AI is moving toward systems that continuously perceive, maintain internal state, and iteratively refine predictions via environment interaction. Open problem: unifying perception, memory, planning, control into stable long-horizon agent loops. #CVPR2026 #EmbodiedAI #WorldModels #Robotics #VLA #AgenticAI

Denver, CO 🇺🇸 English

2

8

37

9.5K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·8 Haz

🌌 At @saturdayrobotic @CVPR Saturday Robotics Research Night — we hosted @JieWang_ZJUI (@Penn @GRASPlab) for a lightning talk on “Toward a Robotics MMLU: Lessons from Sim & Real Evaluations of Generalist Policies”. Core claim: Robot policies are transitioning into foundation models, but our evaluation methodology is not. The LLM community iterates on benchmarks like MMLU that decompose capability into reproducible, comparable axes. Robotics has no equivalent. Current practice consistently misses the failure modes that matter most for generalist deployment, and surfaces structural issues with current real-world evaluation. LLMs got “better” via MMLU-style decomposable, reproducible benchmarks; robotics has ~1000 benchmarks but ~0 trusted global axis → leaderboards (RoboLab / RoboArena / MolmoSpaces) disagree → same policy ranks #1–mid depending on eval → “score ≠ capability; could be policy-fit.” Case: “Evaluating π₀ in the Wild” @GRASPlab → key: tiny distribution shifts (camera nudge / lighting change) collapse SOTA generalist policies → robustness+eval failure, not just data scale issue. “red tasks” cluster at bottom = manipulation + human-interaction brittleness. Constraint: “You can’t do robotics without robotics” → standard hardware prerequisite (humanoid precedent) needed for manipulation too; embodiment mismatch currently confounds comparisons. Cost asymmetry: real-robot eval ~100h / checkpoint vs ~1h sim → sim needed for iteration, but sim2real gap too large → benchmark must explicitly amortize real-world eval cost. Proposed “Robotics MMLU” architecture: 1. Distributed evaluator network: labs/robots run local A/B policy comparisons; feed into arenas (RoboArena etc.); global ranking via pairwise preference aggregation 2. Credit + provenance layer: trace eval conditions/hardware/tasks → prevent leaderboard gaming/hacking 3. scalable decentralized participation (data point: frodo bots 2038 evals; Berkeley 495; UT Austin 390; NVIDIA 219; Stanford 134; UPenn 115; UMontreal 101; Yonsei 100; etc.) Eval as first-class research: 1. decomposable capability axes (diagnose why wins) 2. reproducible + comparable across sites 3. generalization-first (avoid arena overfitting) Benchmark Atlas Index (everloom-129.github.io/SimBench): existence proof → 22 benchmarks, 6 fields, 20+ simulators, sim-to-real unified map. 💡LLMs got their coordinate system in 2021. Robotics is overdue. Whoever builds it sets the next 5-year trajectory.

🌌 At @saturdayrobotic Saturday Robotics Research Night @CVPR, we hosted Xiaofan Li (World Model Tech Lead @XSquareRobot) for a lightning talk on WALL-WM. TLDR: From next chunk prediction → next event prediction. WALL-WM introduces a new training + inference workflow for world modeling, shifting from rigid frame chunks to semantic event signals. It explores tighter integration of agent intelligence and WAM for improved real-world dynamic perception and prediction. x2robot.com/api/files/file… WALL-WM is a World Action Model (WAM) built on event-level VLA pretraining. Existing WAMs typically: • initialize from multimodal/video foundation models • directly train & infer fixed-length action chunks conditioned on observation + instruction Problem: text, vision, and action lie on different manifolds and temporal scales → direct joint optimization can distort pretrained representations. 💡 Core idea: Event as atomic unit WALL-WM replaces fixed frame-chunk modeling with semantic event modeling. Well-posed event: (c_event, O) → v_event Event enforces: • semantic alignment (language ↔ event meaning) • temporal alignment (vision / action / tactile consistency) → “Carve nature at its joints” (Plato, Phaedrus 265e) 🧠 Architecture • Historical Observations + Executions buffer • Multi-View Video DiT → video latents (world dynamics) • Action Transformer → state-action modeling • Unified Event World Modeling block couples video + action pathways Language stack: • Qwen3.5 + Staircase Decoder in unified embedding space ⚙️ Two inference modes (same event-pretrained backbone) 1. Language-Guided Reasoning (Event Mode) • consumes next-event descriptions • produces variable-length execution chunks • includes explicit temporal event tokens (e.g., “pick…1.6s → fallback → pick…2.4s”) • ON/OFF switch separates reasoning from execution → semantic event rollout 2. Event World Modeling • Video DiT + Action Transformer • purely event-centric rollout of dynamics • no fixed-length chunk assumption in modeling → temporal event rollout 🔁 Key decomposition Semantic path: Language → Event Temporal path: Vision/Action → Event Unified abstraction stack: Pixel → Patch → Frame → Event 🧩 Training philosophy (anti–Bitter Lesson framing) Shift annotation cost → training cost via self-supervised event structure learning End-to-end target pipeline: Reasoning/Grounding → Perception → Future Video → 3D Representation → Action Core principle: “The more we do (preprocessing + structure), the less the model has to infer.” Includes: • normalization • spectrograms • voxelization • tokenization ⚠️ video-only pretraining critique Failure modes: • strong latent distribution assumptions (e.g., SIGReg-style constraints) • semantic rediscovery cost in vision-action alignment • weak coupling between language semantics and temporal execution Examples: VJPEA, LeWorldModel-style approaches Fix: Language acts as semantic tagging over VA event clusters, not temporal supervision signal. 🧬 Representation hierarchy Raw physical signals: vision / audio / action / biological signals ↓ (signal processing + mathematical abstraction) structured modalities ↓ Event layer (highest alignment primitive) 📌 Conclusion WALL-WM is not a chunk-level improvement. It replaces fixed temporal chunking with event-level alignment as the fundamental unit of world modeling. Where prior WAMs learn “what action follows this frame window”, WALL-WM learns “what event is unfolding in the world”. WALL-WM defines the event-based representation primitive for future world models and embodied agents. @CVPRConf #CVPR2026 #WorldModel

English

3

11

4.8K

Saturday Robotics retweetledi

Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98·8 Haz

🌌 At @saturdayrobotic Saturday Robotics Research Night @CVPR, we hosted @mli0603 Zhaoshuo Li (Robotics & World Model Tech Lead @NVIDIAAI Cosmos) for a lightning talk on Cosmos 3. Cosmos 3 is a unified omnimodal world model built on a Mixture-of-Transformers (MoT) backbone with parallel Autoregressive + Diffusion pathways connected via cross-attention. One model jointly understands & generates Language, Image, Video, Audio, and Action with flexible I/O. It effectively subsumes: 👁️ VLMs 🎥 Video Generators 🔊 Audio Generators 🌍 World Simulators 🤖 World-Action Models 🎮 Robot Policy Models Single backbone supports: • Vision Reasoning • Image Generation • Audio-Visual Generation • Robot Policy Control • Forward Dynamics • Inverse Dynamics Vision reasoning grounds language in spatial relations, temporal evolution, object states, and actions. Forward Dynamics: (obs + controls) → future video rollouts for planning, evaluation, and synthetic data generation. Inverse Dynamics: (video) → trajectories/actions explaining observed state transitions. 🍿 Popcorn demo: 0.3–3.4s pick cup 3.4–14.8s stabilize cup → insert scoop → scoop twice → transfer popcorn while maintaining alignment 14.8–18.7s place cup → return scoop → retract arms Not frame captioning—the model temporally segments manipulation into physically meaningful subgoals. Forward Dynamics demo: camera observation (blue point-cloud-like representation) + hand pose (green skeletal hands) → physically plausible future interaction rollouts respecting object dynamics. Inverse Dynamics demo: robot manipulation video → articulated 3D trajectories recovered from observed pixel changes. 🔥 Most impressive: Cosmos 3 Omni Block. Prompt: “pick the Cosmos 3 Omni block from bottom drawer and place it on counter” The model first performs explicit spatial grounding: gripper(514,769) block(471,780) drawer(400,760) counter(460,310) while identifying distractors: forklift, white truck, white SUV, quadruped robot, Physical AI Builder figure. It then generates structured reasoning + pixel-space action outputs: [514,769] approach block [507,783] grasp block [500,471] lift from drawer [464,278] move to counter [460,275] place on counter A second, far more cluttered scene containing multiple robot arms, excavators, vehicles, and the same drawer receives the identical prompt and produces analogous trajectories after grounding relevant objects and free-space regions. Cosmos 3 positions omnimodal world models as a scalable foundation for embodied agents, jointly performing understanding, generation, simulation, reasoning, and control inside a single architecture. It achieves SoTA across diverse understanding & generation benchmarks, and NVIDIA is releasing the full stack: code, checkpoints, curated synthetic datasets, and evaluation benchmarks. Cosmos 3 = a unified world-action engine. @CVPRConf #CVPR2026 #Cosmos3 #WorldModel

🍷 @saturdayrobotic CVPR Research Night After Party — we’re extending the @NVIDIAAI #Cosmos3 discussion on building always-on, self-evolving world models. A Cosmos3-style core system sketch for always-on, self-evolving world models: Self-monologue + always-on perceiver; persistent latent state (vs stateless prompting) maintained as evolving memory; self-reflective “dreaming” loop where the model iteratively replays/perturbs internal rollouts and identifies world-model misalignment in latent space; external VLM/vision perceiver continuously audits reality stream and corrects latent state drift; lifecycle management of memory via automatic pruning of stale/useless latent representations; test-time optimization as a first-class mechanism for self-evolving world models (inference-time learning, not just inference). Key hypothesis: the bottleneck is no longer primarily model scale, but the design of forward dynamics and the inference-time system. inference scaling appears 3 orthogonal axes: (1) vertical scaling → larger multimodal models; (2) recursive model folding/compression → recursive/self-referential LM or compressed latent recursion; (3) horizontal scaling → test-time reasoning + compute scaling via iterative rollout, self-consistency, search, and continuous refinement loops. data becomes the dominant constraint: egocentric trajectories + YouTube-scale dynamic multi-view video + dense action-conditional interaction logs; hypothesis: ~50× more high-quality action data (vs Cosmos3 baseline regime) may unlock a qualitatively different generalization phase. cross-embodiment transfer (cf. @XPENG_Global Fe₀-style setups) → broader state-space coverage → stronger policy/world-model generalization; data augmentation becomes policy augmentation. system-level requirement: built-in recommendation/diversification engine; without diversity, exploration collapses, latent space narrows, and self-improvement degenerates into self-reinforcement / mode collapse. core learning signal may shift: not just prediction or reconstruction, but anomaly detection / mismatch signal as primary driver—world-model intelligence emerges from continuous prediction–reality divergence pressure. architectural direction: inherently streaming world models where perception → prediction → mismatch detection → latent update is a continuous loop; imagined futures and observed reality co-exist in real time; alignment/misalignment continuously measured; latent planning is always-on; optimal intervention timing is learned online; future world states are generated and evaluated continuously against live streams. this implies full-duplex agents (vs today’s mostly half-duplex GPT systems: input → think → output). future loop: perceive ↔ imagine ↔ predict ↔ act ↔ revise continuously, with persistent shared state across modalities and time. connection to streaming modalities: voice AI already absorbs VAD into foundation models; speech systems are becoming streaming predictors rather than turn-based generators; world models likely follow same trajectory. system implication: multiple specialized models running concurrently (fast reactive perceiver, slow simulator, memory/retrieval, critic/verifier) may outperform single monolithic “elegant” model. Cosmos3 trajectory framing: Cosmos3 → modality unification; Cosmos3+ → full-duplex world modeling; Cosmos3++ → fully streaming world models with continuous evaluation of imagined futures vs live reality streams. broader resonance: @thinkymachines interaction-centric paradigm + Moshi-style overlapping conversational streams → endpoint may not be a chatbot, but an always-on interaction system whose primary function is the continuous detection of divergence between imagined futures and live reality streams — thereby driving persistent self-update of the world model.

English