
Saturday Robotics
162 posts

Saturday Robotics
@saturdayrobotic
🤖 Saturday Reading Club on Robotics & World Models for AI Researchers in SF Hosts: @junfanzhu98 @aurorafeng_01 @neuralmotion https://t.co/fLSL01KO9e


















🎬 At @saturdayrobotic @CVPR 2026 Research Night, @guocheng_qian (Senior Research Scientist @Snap) presented lightning talk "Diffusion-DRF: a new RL/post-training paradigm for video diffusion models". TL;DR: Scalar rewards are too coarse for video generation. Diffusion-DRF turns VLM explanations into free, rich, dense, differentiable rewards with spatial/semantic credit assignment. Why: RL-style post-training powers LLM reasoning and image generation, but video is harder. A single reward cannot identify which object, motion, frame, physical inconsistency, or visual defect caused failure. GRPO-style methods often reward-hack and collapse within ~300 steps; video RL stability becomes a major bottleneck. Diffusion-DRF remains stable beyond 3K training steps. Core idea: Instead of binary/scalar rewards, use Qwen2.5-VL as a VQA reward engine. 1️⃣ Reference video + caption → structured decomposition: • Environment • Objects • Object locations • Other scene attributes 2️⃣ Generate multi-dimensional questions: • TA (Text Alignment): does the environment/object configuration match the prompt? • Phy (Physics): are objects physically plausible (deformation, motion, interactions)? • VQ (Visual Quality): blur, artifacts, defects? 3️⃣ Qwen2.5-VL answers these questions on the reference video, producing Yes/No + free-form explanations as targets. 4️⃣ Generated video is evaluated by the same VLM. 5️⃣ VQA next-token prediction loss becomes a differentiable reward Key insight: VLM token probabilities and explanations provide dense token-level feedback instead of brittle reward labels. Even more interesting: gradients flow through the VAE decoder and final denoising stages, allowing direct optimization of the video diffusion model. VLMs are not just judges—they become credit-assignment engines. Results (V-Bench 2.0): Diffusion-DRF (7B) achieves: • Overall 55.38 • Creativity 64.58 • Common Sense 56.96 • Controllability 27.98 • Human Fidelity 80.51 • Physics 56.85 • Material 75.52 • Dynamic Attribute 42.86 • Motion Rationality 40.23 • Complex Landscape 21.05 • Camera Motion 24.69 Outperforming Flow-DPO (50.27), Flow-GRPO (50.64), VideoAlign (53.55), Vanilla-DRF (53.72), and Wan2.1-3B-T2V (52.99), while maintaining far stronger training stability. Qualitatively: • More realistic human expressions and object interactions (e.g., honey pouring, facial fidelity) • Better object color/location control • More accurate manipulation actions • Stronger physical consistency and scene composition The next step is TC-GRPO (Diffusion-DRF + GRPO). Instead of scalar rewards, VLM gradients provide token-level credit assignment inside Group Relative Policy Optimization loops. On HunyuanVideo-1.5: • More natural handshakes • Better human-object interactions • Stronger lighting realism • Improved motion dynamics • Better road adhesion, composition, and photorealism in driving scenes Big picture: Video world models face the same scaling transition LLMs experienced: pretraining is no longer enough; post-training dominates compute (e.g., Composer 2.5 reportedly spends ~85% of compute on additional training/RL). The challenge is that scalar rewards break down in video. Diffusion-DRF suggests a different path: use VLM explanations + token probabilities as free, dense, differentiable rewards and token-level credit signals. VLM gradients may become for video generation what RLHF/RLAIF became for language models. @CVPRConf #CVPR2026
















CVPR 2026 — Embodied AI Takeaways @CVPRConf @CVPR Embodied AI converges along three coupled axes: VLA policies, world models, agentic perception-action loops, linked via hierarchical memory + skill composition. 🤖 Robotics shows scenario-level generalization under distribution shift (novel objects, clutter, lighting variation), incl. unseen household items + long-tail tabletop objects, often without task finetuning. Common pattern: sim-scale pretraining + real adaptation language-conditioned manipulation policies hierarchical planning + reusable skills ManiSkill-style benchmark ecosystems Trend: compositional policies + simulation-scaled pipelines; cross-embodiment transfer remains open. 👓 Meta Aria = perception-first SLAM engineering SLAM-first embodied sensing design co-optimizes hardware + algorithms for stability over imaging. Key priorities: online calibration + drift correction illumination robustness visual-inertial SLAM primary objective per-sensor consistency for long-term tracking Optimized for continuous egocentric state estimation, not photography. 🌍 World models & agentic systems converge conceptually Shared abstraction: prediction–observation mismatch correction in continuous loops. Design directions: streaming latent state updates persistent memory / belief revision anomaly-driven representation correction tight perception–imagination–action coupling Shift: discrete I/O → continuous inference + continuous state maintenance. 📈 Scaling axes: larger multimodal foundation models recursive / iterative refinement loops test-time computation scaling (reasoning + planning) Shift: model size scaling + forward dynamics quality + inference-time adaptation. 🎙 Continuous interaction models Move beyond turn-taking: low-latency streaming speech (Moshi-style) overlap-tolerant dialogue continuous embodied perception-action loops Toward full-duplex systems with persistent internal state vs query-response cycles. 🦾 Robot “OS” = hierarchical orchestration Long-horizon manipulation remains hard under flat policies. Stack: high-level planners (language/symbolic/latent) mid-level skill libraries (reusable primitives) low-level reactive control Active perception: query environment under uncertainty manipulate to reduce ambiguity update belief before action 🧭 Synthesis: reactive policies → agentic systems with persistent world models Integration: world models + VLA active perception + uncertainty-aware control simulation scaling + real adaptation continuous interaction + streaming inference 🧩Summary: Embodied AI is moving toward systems that continuously perceive, maintain internal state, and iteratively refine predictions via environment interaction. Open problem: unifying perception, memory, planning, control into stable long-horizon agent loops. #CVPR2026 #EmbodiedAI #WorldModels #Robotics #VLA #AgenticAI











🌌 @saturdayrobotic @CVPR Research Night Lightning Talk #5, Zesen Zhao @SourORZ1 (@UMich, Cruise) presented Test-Time Scaling for World Action Models via Zero-Shot Geometric Verification. 💡Core insight: Geometry is still implicit. Depth is the next scaling axis. Geometry should become native, not emergent. WAMs take {primary cam + wrist cam + language} → {future observations + action chunk a₁:ₕ}. A key failure mode: predicted futures across views are often 3D-inconsistent. When imagined futures contain geometric artifacts, action quality drops. Key idea: WAMs already expose synchronized multi-view futures. Use them to verify generation quality before execution. Training-free, rollout-free, plug-and-play verifier: Stage 1 — Action-Future Gate • Optical flow f (current→future) • Action-induced motion Δu • c = cos(f,Δu) • c ≥ τ_gate → execute greedy α₁ • otherwise sample more candidates Stage 2 — Zero-Shot Geometric Verification • Candidate set = greedy + (N−1) sampled • Frozen VGGT depth reprojection (primary→wrist) • Compute reprojection error • α* = argmin(error) • Execute α₁ if Stage-1 passes, else α* No finetuning. No online robot rollouts. Reads only predicted images + actions. Works on any multi-view WAM (DreamZero, Cosmos Policy, Motus, LingbotVA, etc.). Larger question: VLA vs WAM? VLA: π₀ (PaliGemma/SigLIP+Gemma-2B), π₀.6/0.7 (Gemma-3B), InternVLA-A1, QwenVLA... WAM: DreamZero (Wan2.1-12V-14B), Cosmos Policy (Cosmos Predict-2), Motus, LingbotVA... Despite different architectures, both inherit the same limitation: they start from RGB pixel-space backbones and learn geometry, physics, and object dynamics only implicitly. The missing scaling axis: Depth as a native modality. Language → abstract, lossy, reasoning-friendly. RGB → 2D projection, indirect geometry. Depth → direct geometry: distance, spatial relations, object placement, motion via temporal derivatives. Current 3D-VLMs, VLAs, and world models still lack robust, generalizable depth representations. Conversations with VGGT researchers suggest current VGGT features can be inconsistent; Zhao noted he has not yet had time to validate VGGT-Ω. Historical analogy: VLMs evolved from frozen-LLM + projector (InternViT, pixel-unshuffle, dynamic resolution) → native multimodality (InternLM2.5, Qwen2.5/Qwen3.5). Physical AI may require the same transition: depth-native > depth-as-bolt-on. Summary • Evaluation remains the bottleneck. • Geometry remains implicit. • Test-time geometric verification already improves WAM reliability. • Depth may be the next fundamental scaling axis for Physical AI. @CVPRConf #CVPR2026





🌌 At @saturdayrobotic @CVPR Saturday Robotics Research Night — we hosted @JieWang_ZJUI (@Penn @GRASPlab) for a lightning talk on “Toward a Robotics MMLU: Lessons from Sim & Real Evaluations of Generalist Policies”. Core claim: Robot policies are transitioning into foundation models, but our evaluation methodology is not. The LLM community iterates on benchmarks like MMLU that decompose capability into reproducible, comparable axes. Robotics has no equivalent. Current practice consistently misses the failure modes that matter most for generalist deployment, and surfaces structural issues with current real-world evaluation. LLMs got “better” via MMLU-style decomposable, reproducible benchmarks; robotics has ~1000 benchmarks but ~0 trusted global axis → leaderboards (RoboLab / RoboArena / MolmoSpaces) disagree → same policy ranks #1–mid depending on eval → “score ≠ capability; could be policy-fit.” Case: “Evaluating π₀ in the Wild” @GRASPlab → key: tiny distribution shifts (camera nudge / lighting change) collapse SOTA generalist policies → robustness+eval failure, not just data scale issue. “red tasks” cluster at bottom = manipulation + human-interaction brittleness. Constraint: “You can’t do robotics without robotics” → standard hardware prerequisite (humanoid precedent) needed for manipulation too; embodiment mismatch currently confounds comparisons. Cost asymmetry: real-robot eval ~100h / checkpoint vs ~1h sim → sim needed for iteration, but sim2real gap too large → benchmark must explicitly amortize real-world eval cost. Proposed “Robotics MMLU” architecture: 1. Distributed evaluator network: labs/robots run local A/B policy comparisons; feed into arenas (RoboArena etc.); global ranking via pairwise preference aggregation 2. Credit + provenance layer: trace eval conditions/hardware/tasks → prevent leaderboard gaming/hacking 3. scalable decentralized participation (data point: frodo bots 2038 evals; Berkeley 495; UT Austin 390; NVIDIA 219; Stanford 134; UPenn 115; UMontreal 101; Yonsei 100; etc.) Eval as first-class research: 1. decomposable capability axes (diagnose why wins) 2. reproducible + comparable across sites 3. generalization-first (avoid arena overfitting) Benchmark Atlas Index (everloom-129.github.io/SimBench): existence proof → 22 benchmarks, 6 fields, 20+ simulators, sim-to-real unified map. 💡LLMs got their coordinate system in 2021. Robotics is overdue. Whoever builds it sets the next 5-year trajectory.










🌌 At @saturdayrobotic Saturday Robotics Research Night @CVPR, we hosted Xiaofan Li (World Model Tech Lead @XSquareRobot) for a lightning talk on WALL-WM. TLDR: From next chunk prediction → next event prediction. WALL-WM introduces a new training + inference workflow for world modeling, shifting from rigid frame chunks to semantic event signals. It explores tighter integration of agent intelligence and WAM for improved real-world dynamic perception and prediction. x2robot.com/api/files/file… WALL-WM is a World Action Model (WAM) built on event-level VLA pretraining. Existing WAMs typically: • initialize from multimodal/video foundation models • directly train & infer fixed-length action chunks conditioned on observation + instruction Problem: text, vision, and action lie on different manifolds and temporal scales → direct joint optimization can distort pretrained representations. 💡 Core idea: Event as atomic unit WALL-WM replaces fixed frame-chunk modeling with semantic event modeling. Well-posed event: (c_event, O) → v_event Event enforces: • semantic alignment (language ↔ event meaning) • temporal alignment (vision / action / tactile consistency) → “Carve nature at its joints” (Plato, Phaedrus 265e) 🧠 Architecture • Historical Observations + Executions buffer • Multi-View Video DiT → video latents (world dynamics) • Action Transformer → state-action modeling • Unified Event World Modeling block couples video + action pathways Language stack: • Qwen3.5 + Staircase Decoder in unified embedding space ⚙️ Two inference modes (same event-pretrained backbone) 1. Language-Guided Reasoning (Event Mode) • consumes next-event descriptions • produces variable-length execution chunks • includes explicit temporal event tokens (e.g., “pick…1.6s → fallback → pick…2.4s”) • ON/OFF switch separates reasoning from execution → semantic event rollout 2. Event World Modeling • Video DiT + Action Transformer • purely event-centric rollout of dynamics • no fixed-length chunk assumption in modeling → temporal event rollout 🔁 Key decomposition Semantic path: Language → Event Temporal path: Vision/Action → Event Unified abstraction stack: Pixel → Patch → Frame → Event 🧩 Training philosophy (anti–Bitter Lesson framing) Shift annotation cost → training cost via self-supervised event structure learning End-to-end target pipeline: Reasoning/Grounding → Perception → Future Video → 3D Representation → Action Core principle: “The more we do (preprocessing + structure), the less the model has to infer.” Includes: • normalization • spectrograms • voxelization • tokenization ⚠️ video-only pretraining critique Failure modes: • strong latent distribution assumptions (e.g., SIGReg-style constraints) • semantic rediscovery cost in vision-action alignment • weak coupling between language semantics and temporal execution Examples: VJPEA, LeWorldModel-style approaches Fix: Language acts as semantic tagging over VA event clusters, not temporal supervision signal. 🧬 Representation hierarchy Raw physical signals: vision / audio / action / biological signals ↓ (signal processing + mathematical abstraction) structured modalities ↓ Event layer (highest alignment primitive) 📌 Conclusion WALL-WM is not a chunk-level improvement. It replaces fixed temporal chunking with event-level alignment as the fundamental unit of world modeling. Where prior WAMs learn “what action follows this frame window”, WALL-WM learns “what event is unfolding in the world”. WALL-WM defines the event-based representation primitive for future world models and embodied agents. @CVPRConf #CVPR2026 #WorldModel





🍷 @saturdayrobotic CVPR Research Night After Party — we’re extending the @NVIDIAAI #Cosmos3 discussion on building always-on, self-evolving world models. A Cosmos3-style core system sketch for always-on, self-evolving world models: Self-monologue + always-on perceiver; persistent latent state (vs stateless prompting) maintained as evolving memory; self-reflective “dreaming” loop where the model iteratively replays/perturbs internal rollouts and identifies world-model misalignment in latent space; external VLM/vision perceiver continuously audits reality stream and corrects latent state drift; lifecycle management of memory via automatic pruning of stale/useless latent representations; test-time optimization as a first-class mechanism for self-evolving world models (inference-time learning, not just inference). Key hypothesis: the bottleneck is no longer primarily model scale, but the design of forward dynamics and the inference-time system. inference scaling appears 3 orthogonal axes: (1) vertical scaling → larger multimodal models; (2) recursive model folding/compression → recursive/self-referential LM or compressed latent recursion; (3) horizontal scaling → test-time reasoning + compute scaling via iterative rollout, self-consistency, search, and continuous refinement loops. data becomes the dominant constraint: egocentric trajectories + YouTube-scale dynamic multi-view video + dense action-conditional interaction logs; hypothesis: ~50× more high-quality action data (vs Cosmos3 baseline regime) may unlock a qualitatively different generalization phase. cross-embodiment transfer (cf. @XPENG_Global Fe₀-style setups) → broader state-space coverage → stronger policy/world-model generalization; data augmentation becomes policy augmentation. system-level requirement: built-in recommendation/diversification engine; without diversity, exploration collapses, latent space narrows, and self-improvement degenerates into self-reinforcement / mode collapse. core learning signal may shift: not just prediction or reconstruction, but anomaly detection / mismatch signal as primary driver—world-model intelligence emerges from continuous prediction–reality divergence pressure. architectural direction: inherently streaming world models where perception → prediction → mismatch detection → latent update is a continuous loop; imagined futures and observed reality co-exist in real time; alignment/misalignment continuously measured; latent planning is always-on; optimal intervention timing is learned online; future world states are generated and evaluated continuously against live streams. this implies full-duplex agents (vs today’s mostly half-duplex GPT systems: input → think → output). future loop: perceive ↔ imagine ↔ predict ↔ act ↔ revise continuously, with persistent shared state across modalities and time. connection to streaming modalities: voice AI already absorbs VAD into foundation models; speech systems are becoming streaming predictors rather than turn-based generators; world models likely follow same trajectory. system implication: multiple specialized models running concurrently (fast reactive perceiver, slow simulator, memory/retrieval, critic/verifier) may outperform single monolithic “elegant” model. Cosmos3 trajectory framing: Cosmos3 → modality unification; Cosmos3+ → full-duplex world modeling; Cosmos3++ → fully streaming world models with continuous evaluation of imagined futures vs live reality streams. broader resonance: @thinkymachines interaction-centric paradigm + Moshi-style overlapping conversational streams → endpoint may not be a chatbot, but an always-on interaction system whose primary function is the continuous detection of divergence between imagined futures and live reality streams — thereby driving persistent self-update of the world model.