Max Fu

1

17

3K

Max Fu retweetledi

Jia-Bin Huang@jbhuang0604·11 Nis

CAPTCHA for 3D vision people

Ryan Schmidt@rms80

wait wait wait. So a new kind of 3D representation (gaussian splats) was invented and they decided to use Y-DOWN as the standard up-direction ?? In 2023 ?!?!!! Need to have some words w/ my old friends at INRIA...and I guess we need a new chart...

English

34

334

1.8K

166.8K

Max Fu retweetledi

Long Lian@LongTonyLian·8 Nis

Our parallel reasoning project ThreadWeaver is now open-sourced 🎉! Check out our Data Gen/SFT/RL recipe at github.com/facebookresear… In case you don't know, ThreadWeaver 🧵⚡️ is the first parallel reasoning method to achieve comparable reasoning performance to widely-used sequential long-CoT LLMs, with up to 3x speedup across 6 challenging tasks.

AK@_akhaliq

ThreadWeaver Adaptive Threading for Efficient Parallel Reasoning in Language Models

English

Pete Florence@peteflorence

23

129

53.7K

Max Fu retweetledi

Stephen James@stepjamUK·8 Nis

𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀 𝗰𝗮𝗻 𝗽𝗮𝘀𝘀 𝗹𝗮𝘄 𝗲𝘅𝗮𝗺𝘀. 𝗧𝗵𝗲𝘆 𝗰𝗮𝗻 𝘄𝗿𝗶𝘁𝗲 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗰𝗼𝗱𝗲. 𝗕𝘂𝘁 𝗮𝘀𝗸 𝘁𝗵𝗲𝗺 𝘁𝗼 𝘄𝗿𝗶𝘁𝗲 𝗮 𝗽𝗿𝗼𝗴𝗿𝗮𝗺 𝘁𝗵𝗮𝘁 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝘀 𝗮 𝗿𝗲𝗮𝗹 𝗿𝗼𝗯𝗼𝘁, 𝗮𝗻𝗱 𝘁𝗵𝗲𝘆 𝘀𝘁𝗶𝗹𝗹 𝗳𝗮𝗹𝗹 𝘀𝗵𝗼𝗿𝘁 𝗼𝗳 𝗮 𝗵𝘂𝗺𝗮𝗻 𝗲𝘅𝗽𝗲𝗿𝘁. That's the core finding from CaP-X, a new framework from NVIDIA, UC Berkeley, Stanford, and CMU that systematically benchmarks coding agents for robot manipulation. The underlying idea is not new. Code as Policy has been around since 2022/2023, and it is best understood as a modern evolution of Task and Motion Planning - a classical robotics paradigm where engineers manually decompose high-level goals into structured programs combining perception, planning, and control. What has changed is that instead of a human writing that code, a language model does it. It works well when the abstractions are high-level. It degrades significantly when models have to reason at the level human engineers actually work at: raw perception outputs, IK solvers, collision constraints. Here is what the research actually shows: 𝗧𝗵𝗲 𝗮𝗯𝘀𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 𝗴𝗮𝗽 𝗶𝘀 𝗿𝗲𝗮𝗹. Performance drops as you move from high-level primitives to low-level APIs. Not because the models lack intelligence, but because the scaffolding disappears. 𝗠𝘂𝗹𝘁𝗶-𝘁𝘂𝗿𝗻 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗿𝗲𝗰𝗼𝘃𝗲𝗿𝘀 𝗺𝗼𝘀𝘁 𝗼𝗳 𝘁𝗵𝗮𝘁 𝗹𝗼𝘀𝘀. Multi-turn feedback with execution traces and structured observations dramatically improves performance. Raw images alone actually hurt. 𝗥𝗟 𝗼𝗻 𝗮 𝘀𝗺𝗮𝗹𝗹 𝗺𝗼𝗱𝗲𝗹 𝘁𝗿𝗮𝗻𝘀𝗳𝗲𝗿𝘀 𝘇𝗲𝗿𝗼-𝘀𝗵𝗼𝘁 𝘁𝗼 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹 𝘄𝗼𝗿𝗹𝗱. A 7B model fine-tuned with RL in simulation transfers zero-shot to a real Franka robot by reasoning over structured APIs. The takeaway is simple. The bottleneck is not model size. It is the feedback loop, the abstraction layer, and the system around the model. Credit: @letian_fu, Justin Yu, Karim El-Refai, Ethan Kou, @HaoruXue, @DrJimFan, and the full team across @nvidia, @UCBerkeley, @Stanford, and @CMU_Robotics And of course @AGIBOTofficial for providing the hardware in the attached video! What do you think is holding Code as Policy back from production deployment? Paper link in comments.

English

4

3

22

5.2K

Max Fu retweetledi

Kush Hari@KushtimusPrime·7 Nis

Our new work, STITCH 2.0, can perform consecutive running sutures to close a sample wound with the daVinci robot.

English

7

15

59

25.1K

Max Fu@letian_fu·7 Nis

I resonate with this framing a lot: robotics should be treated primarily as part of the pretraining problem, not a post-training or mid-training one. From roughly 2021 to 2024, some of the most exciting progress in robotics came from this lens: first through visual pretraining on egocentric human data (MVP, R3M, VC-1, etc.), then through robot trajectory pretraining (RPT, ICRT, etc.). Part of why these directions became quieter was not that the framing stopped being useful, but that VLM backbones began to dominate: they were pretrained on much larger-scale data and therefore offered stronger representations out of the box. In that sense, VLMs were a very useful scaffold in the low-robot-data regime, but that scaffold may become less central as robotics data scales.

x.com/i/article/2041…

English

3

2

42

3.2K

Max Fu retweetledi

Generalist@GeneralistAI·2 Nis

Introducing GEN-1. Our latest milestone in scaling robot learning. We believe it to be the first general-purpose AI model to master simple physical tasks. 99% success rates, 3x faster speeds, adapts in real time to unexpected scenarios, w/ only 1 hour of robot data. More🧵👇

English

50

282

1.7K

363.3K

Max Fu retweetledi

Baifeng@baifeng_shi·2 Nis

Excellent examples of how better our robots can be by just improving the high-level intelligence!

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

5

19

2.3K

Max Fu retweetledi

Pete Florence@peteflorence·2 Nis

Very nice! Time for CaP to thrive with modern agentic coding

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

4

18

3.3K

Max Fu retweetledi

Chris Paxton@chris_j_paxton·2 Nis

Building robots code is really hard; agents like this will be increasingly valuable.

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

4

10

46

6.1K

Max Fu retweetledi

Andy Zeng@andyzengineer·2 Nis

Excellent toolkit & benchmark, and I love the full stack web-based UI and visualization tools!

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

8

32

4.4K

Max Fu@letian_fu·2 Nis

I think the key is that the LLM does not need to generate joint-level actions at high frequency. Low-level feedback control and fast perception primitives can run independently at a high frequency, while the LLM can replan at a slower rate. The agent writes code that uses those primitives to specify waypoints, branching logic, and recovery behavior based on perceptual outputs, all based on the model's understanding of the task and how the robot should behave. In that sense, code-as-policy can operate at a higher level than raw motor control while still being highly reactive, and in some cases can be even faster/more reactive than VLA (i.e. dynamically update stiffness/impedance based on force feedback and subtask).

English

0

6

406

amv@aryanmadhaverma·2 Nis

had tried stitching IK functions as tool calls to llm agents a few months back where the agent plans these tool execution based on the vague tasks it has been given. happy to see it formalise into an actual system of operation for robots have a few doubts though. maybe I'm thinking wrong VLA policies output actions at 10+ hz. coding agents wont match this. even if you host on cerebras or groq at 1000+ tokens/sec, continuous reactive control needs more than fast generation. the agent would need to ingest current joint state, distance to target, sensor feedback, run a validation phase, potentially replan, all before the next action step. that loop is architecturally too slow for the kind of realtime adjustment that motor policies handle natively think pouring water into a glass where the water pours slightly outside the glass. you need a micro wrist correction in milliseconds. you can't pause to reflect and rewrite code. the reflect rewrite execution loop is wrong for continuous control. we probably need this loop baked into a VLA that itself acts as a tool for agentic planners where a hybrid agentic control system makes sense cap RL is the coolest part. the idea that an RL trained coding agent could one shot a controller for a task and if that generalizes across embodiments and linkages, not just task specific but for the whole robot, that would be REALLY COOL. we're not there yet (right now it's trained on clean ground truth state, specific tasks), real perception and control will be noisy and constructing the reward and attributing the error to the correct input will be a good problem to solve though really bullish on llm agents for high level task decomposition, long horizon planning and zero shotting new instructions which can be autocorrected later

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

2

0

9

1.9K

Max Fu retweetledi

Tianwei Yue@VyvyenYue·2 Nis

amazing: soon we can dictate to robots like we delegate tasks to llm agents today

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

3

13

2.5K

Max Fu retweetledi

Tairan He@TairanHe99·2 Nis

One day not far in the future we will be prompting robots like the way we prompt coding agents today.

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

3

9

82

10.5K

Max Fu retweetledi

Wenli Xiao@_wenlixiao·1 Nis

One thing I learned working on CaP-X: today's VLMs already have rich zero-shot capabilities that we roboticists keep losing when we distill them into VLAs. Giving models the right robotic MCP/CLI + harness engineering + test-time compute recovers a surprising amount of that. Maybe the next leap in robot deployment isn't a bigger policy -> it's a better coding agent. 🦞

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

7

65

8.9K

Max Fu retweetledi

Raven Huang@RavenHuang4·1 Nis

Check out how coding agents can solve robotics tasks!

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English

6

13

1.3K

Max Fu retweetledi

Wenlong Huang@wenlong_huang·1 Nis

Excited to see techniques we developed back in 2021/2022 remain at the frontier for generalization, now with the latest LLMs/VLMs: task decomposition (LMs as zero-shot planners), structured environment feedbacks (Inner Monologue), and hierarchical code generation (Code as Policies). Would be very interesting time to revisit how test-time planning can generate novel behaviors with task representations synthesized by more powerful LLMs/VLMs (e.g., potential maps or constraints from VoxPoser & ReKep).

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

English