Utkarsh Mishra

101 posts

Utkarsh Mishra banner
Utkarsh Mishra

Utkarsh Mishra

@utkarshm0410

Intern@Amazon FAR, Robotics PhD Student @GeorgiaTech || TRI Summer’24 || Robot Learning || IITR'21 || 🎸🎢🤖|| He/Him || views are mine.

Atlanta, GA Katılım Temmuz 2015
681 Takip Edilen422 Takipçiler
Sabitlenmiş Tweet
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
Our paper "Compositional Diffusion with Guided Search (CDGS)" is an Oral at #ICLR2026! Short-horizon Foundation Models + Compositional Generative Planning + Inference-time Search = CDGS for goal-conditioned long-horizon planning! More details: cdgsearch.github.io 🧵 below
English
2
25
188
29K
Utkarsh Mishra retweetledi
Junfan Zhu 朱俊帆 ✈️ CVPR
🤖 Key takeaway from @danfei_xu & @TairanHe99 on #WhynotTV Podcast 5: Human Data, Robotics Interface, UMI, Teleop, EgoMimic, Structure v. Bitter Lesson Robotics fails when interfaces across 5 coupled axes break: structure ↔ search, data ↔ policy, human ↔ embodiment, perception ↔ action, intent ↔ control. Scaling models alone won’t fix misaligned interfaces. 🧩 Structure vs. Bitter Lesson Structure (e.g., TAMP) gives compositional generalization but creates ceilings. The real issue isn’t structure vs. scaling—it’s which inductive bias constrains search. → Shift: Generative TAMP turns planning into sampling (video models over trajectories), aligning more with scaling laws. 🧠 BC Is Not a Baseline — It’s the Regime GAIL assumes interaction is cheap. Reality: interaction is scarce, demos are abundant. → RL advantage collapses. → Behavior Cloning (BC) wins in practice (e.g., spatial softmax + RNN for long context). Inflection: DAgger → ALOHA (2023). Hot take: RL may not be necessary for many robotics problems. 🧍‍♂️ Human Data Is a Different Interface, Not “Noisy Teleop” Teleop is unstable (tiny controller changes → large behavior shifts). If interfaces align: → human = another robot → human data becomes scalable supervision. 3-layer decomposition: 1️⃣desired world change 2️⃣embodiment → world interaction (learnable from video) 3️⃣motor control (NOT learnable from video) → Layer 3 (control) is the real bottleneck. 👁️ Egocentric Beats Third-Person 3rd-person (YouTube): scale without alignment Egocentric: lower scale, high fidelity + aligned → Robotics is fidelity-limited, not data-limited. To treat humans as robots (BC-style): IMU + VIO + SLAM + hand pose → recover actions Open question: fundamental need, or patch for weak video models? ✋ UMI: Human–Robot Boundary Collapse Human hand directly controls a gripper with state estimation → near-zero sim2real at end-effector → performance = fidelity × scalability Future: human data ≈ robot data ⚙️ Hardware > Algorithms (Today) Dexterity (5-finger vs gripper) is NOT a learning problem. → bottleneck = actuator speed, bandwidth, control fidelity Embodiment determines transfer ceiling. 📊 Data Hierarchy (non-obvious) Ego video > hand pose > language > whole-body pose/force Tactile = proxy (force is ground truth) Audio/smell ≈ negligible → robots are force-exertion engines 📈 Scaling Law Is Real — but Misunderstood ~1M → ~100M hours needed Includes incidental data (feet, elbows, daily noise) → video-heavy models + data commoditization 🧠 Inductive Bias = Long Context Actions require history → BC = long-horizon modeling System 1/2 likely emerge in latent action space, not language. 💥 Field-Level Corrections “Language as robotics foundation” → wrong interface Autonomous driving → collapsed into vision (not full-stack) Robotics requires full-stack integration, not silos Algorithms are overestimated; systems + hardware are underestimated 🐦 Still Far from General Intelligence Current robots can overfit (RL), but cannot recombine skills (“Betty the Crow” gap). 🚀 Robotics won't be solved by bigger models. It emerges when we align: → human data + embodiment + control + long-context learning Break any interface → system fails Intelligence is not trained — it emerges when interfaces finally align. 📺👉🏻 youtu.be/__P5yygfRRQ?si…
YouTube video
YouTube
Junfan Zhu 朱俊帆 ✈️ CVPR@junfanzhu98

x.com/i/article/2050…

English
0
4
24
7.5K
Utkarsh Mishra retweetledi
Bowen Li
Bowen Li@Bw_Li1024·
Building autonomous robots that learn to reason and plan in the physical world is a long-standing problem. We are excited to release KinDER, a large task suite and benchmark to bring different communities (TAMP, VLA, RL, etc.) together for this challenge! Welcome to try it out!
Yixuan Huang@YixuanHuang13

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈 🌎 25 environments ♾️ Infinite tasks 🏋️ Gymnasium API ⚒️ Over 20 parameterized skills 🪧 Human demonstrations 📊 13 baselines (planning and learning) From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL 🧵 1/n

English
0
6
24
3.5K
Utkarsh Mishra retweetledi
Danfei Xu
Danfei Xu@danfei_xu·
Long-horizon physical reasoning used to be the specialty of TAMP. KinDER is a new sim benchmark that horizontally compares across paradigms, from VLA to PDDL bilevel planning to RL. If you care about hard physical reasoning tasks, give KinDER a try! To appear at RSS 2026.
Yixuan Huang@YixuanHuang13

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈 🌎 25 environments ♾️ Infinite tasks 🏋️ Gymnasium API ⚒️ Over 20 parameterized skills 🪧 Human demonstrations 📊 13 baselines (planning and learning) From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL 🧵 1/n

English
0
8
69
8.9K
Utkarsh Mishra retweetledi
Vaibhav Saxena
Vaibhav Saxena@saxenavaibhav11·
Physical reasoning is core to robot learning, and what KinDER offers is a clean and robust way of testing it - whether you have a large pre-trained manipulation model, or you just want to see if your model can finetune to such data. More details 👇
Yixuan Huang@YixuanHuang13

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈 🌎 25 environments ♾️ Infinite tasks 🏋️ Gymnasium API ⚒️ Over 20 parameterized skills 🪧 Human demonstrations 📊 13 baselines (planning and learning) From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL 🧵 1/n

English
0
2
5
986
Utkarsh Mishra retweetledi
Tom Silver
Tom Silver@tomssilver·
As a planning+learning researcher, I’m really excited about KinDER. It clarifies planning (especially TAMP) for outsiders, defines key open challenges for the field, and creates a common ground to compare & combined planning+learning approaches. (1/n)
Yixuan Huang@YixuanHuang13

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈 🌎 25 environments ♾️ Infinite tasks 🏋️ Gymnasium API ⚒️ Over 20 parameterized skills 🪧 Human demonstrations 📊 13 baselines (planning and learning) From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL 🧵 1/n

English
2
11
76
6.8K
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
KinDER is finally released 🎉 A step forward in defining what physical reasoning is and how to evaluate it, through many kinematic and dynamic manipulation challenges. See @YixuanHuang13 post to learn more! See you at RSS!
Yixuan Huang@YixuanHuang13

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈 🌎 25 environments ♾️ Infinite tasks 🏋️ Gymnasium API ⚒️ Over 20 parameterized skills 🪧 Human demonstrations 📊 13 baselines (planning and learning) From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL 🧵 1/n

English
1
7
13
1.7K
Utkarsh Mishra retweetledi
Danfei Xu
Danfei Xu@danfei_xu·
Gave a talk on Robot Learning from Human Data at Stanford. It was great to be back! Some opinionated points: 1. Human data collection capacity is outpacing the research. 2. We still don't have the "science" for scaling robot capability with human data. 3. We are far from being able to model naturalistic human behaviors. youtube.com/watch?v=NUtaN1…
YouTube video
YouTube
Danfei Xu tweet media
English
1
21
179
13.6K
Utkarsh Mishra retweetledi
Wei Guo
Wei Guo@WeiGuo01·
I’ll present two papers at ICLR and I’m happy to chat! (1) Proximal Diffusion Neural Sampler (Apr 23 morning, P3-#411) (2) Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond (Apr 23 afternoon, P4-#4509)
Wei Guo@WeiGuo01

How annealing helps overcoming multimodality? In our ICLR 2025 paper openreview.net/forum?id=P6IVI… and preprint arxiv.org/abs/2502.04575, we established the first complexity bound for annealed sampling and normalizing constant (⇔free energy) estimation under weak assumptions on target!

English
0
8
36
3.8K
Utkarsh Mishra retweetledi
Yongxin Chen
Yongxin Chen@YongxinChen1·
Check out our ICLR oral paper led by @utkarshm0410 @davidhe137 We demonstrate the power of inference time scaling in long horizon planning tasks with only short horizon generative models
Utkarsh Mishra@utkarshm0410

Our paper "Compositional Diffusion with Guided Search (CDGS)" is an Oral at #ICLR2026! Short-horizon Foundation Models + Compositional Generative Planning + Inference-time Search = CDGS for goal-conditioned long-horizon planning! More details: cdgsearch.github.io 🧵 below

English
5
3
26
3.5K
Utkarsh Mishra retweetledi
Shun Iwase
Shun Iwase@s1wase·
TRIで最後に関わったプロジェクトである、VLA Foundryがついにリリースされました!異なる言語モデルやビジョンモデルを手軽に試せるだけでなく、Drake + Blenderを用いたシミュレーション環境で複数タスクの評価も簡単に行えます。ぜひ試してみてください!
Jean Mercat@MercatJean

Releasing VLA Foundry: an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. End-to-end control from language pretraining to action-expert fine-tuning — no more stitching together incompatible repos.

日本語
0
17
116
14.8K
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
@AnthonyZhang123 Wow, amazing find. Exactly! Our approach is from the compositional angle. We believe that iterative resampling pushes the intermediate states to satisfy the data distribution of both the overlapping modes in a more informed manner.
English
0
0
3
133
Anthony Zhang
Anthony Zhang@AnthonyZhang123·
@utkarshm0410 super cool paper Utkarsh! the iterative resampling reminds me of this paper: arxiv.org/abs/2601.18577, it seems like CDGS’s constant noising and denoising also pushes the output closer to the data distribution
English
1
1
6
189
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
Our paper "Compositional Diffusion with Guided Search (CDGS)" is an Oral at #ICLR2026! Short-horizon Foundation Models + Compositional Generative Planning + Inference-time Search = CDGS for goal-conditioned long-horizon planning! More details: cdgsearch.github.io 🧵 below
English
2
25
188
29K
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
Super excited for the Oral talk and visiting Brazil for the first time! Both @davidhe137 and I are attending #ICLR2026. We will be presenting on April 24: Poster: 10 AM, Pavilion 3 P3-#1309 Oral: 4:27 PM, 201 A/B Looking forward to catching up with everyone!
Utkarsh Mishra tweet media
Utkarsh Mishra@utkarshm0410

Our paper "Compositional Diffusion with Guided Search (CDGS)" is an Oral at #ICLR2026! Short-horizon Foundation Models + Compositional Generative Planning + Inference-time Search = CDGS for goal-conditioned long-horizon planning! More details: cdgsearch.github.io 🧵 below

English
1
2
20
1.4K
Utkarsh Mishra
Utkarsh Mishra@utkarshm0410·
9/10 We show that CDGS scales with more inference-time compute by expanding search over wide denoising paths and strengthening message passing between consecutive atomic segments. This leads to improved performance without any retraining.
Utkarsh Mishra tweet media
English
1
1
4
276