Max Simchowitz (@max_simchowitz) - Twitter Profili

Sabitlenmiş Tweet

Max Simchowitz@max_simchowitz·30 Haz

As a PhD student, I was told not to work on deep RL - too full of hacks and alchemy" But after a year or two of working in this area, I’ve come to (deeply?) appreciate all of the thoughtful research that’s gone into understanding what/why things work and how to make them better. My lab (joint with @abhishekunique7) took what we’ve learned by reading this body of literature to answer the question: what are the actual best-practices for finetuning a diffusion/flow/generative robot policy (for now, in sim)? Under one set of constraints - ample compute but limited time on your robot - @servo97 paper gives a pretty compelling answer.

Sarvesh Patil@servo97

Interaction with the real world is the major bottleneck in robot learning. So what would robot RL look like if we didn’t need to limit compute per interaction? Our latest work, Off-Policy Generative Policy Optimization (OGPO, accepted to ICML26) embarks on answering this question (spoiler alert: when done correctly, it helps massively!). 🧵(1/N)

English

22

34

366

35.5K

Max Simchowitz@max_simchowitz·1d

Robotics people, follow @aran_nayebi . He's been asking some really fundamental questions about what morphologies we need for robotics, and comes from a really unique bio/neuro perspective. Check this work out!

Aran Nayebi@aran_nayebi

It's clear that to unlock the next big advances in robotics, we need at-scale tactile sensing. For the past year, in collab w/ @gs_ai_, we've been working on perhaps the most wide-ranging, realistic tactile simulator to ask: *What should the future of robot hands look like?*

English

2

6

143

28.4K

Max Simchowitz retweetledi

Andrew Zou Li@andrewzouli·2d

Diffusion / flow-based robot policies unlock two axes of test-time scaling: sequential (denoising steps) and parallel (samples). Both improve performance but cost latency on a robot, and knowing which to scale a priori is often unclear. ELASTIC learns it via RL!

English

4

20

86

18.7K

Max Simchowitz@max_simchowitz·4 Tem

pro or con: frontier labs who hire a phd student should pay the student’s advisor commission for their mentorship 🙃

English

8

2

139

24.3K

Max Simchowitz retweetledi

Sadhika Malladi@SadhikaMalladi·30 Haz

I am starting a blog about deep learning theory and its value to practitioners! First post is about Adam, broken convergence proofs, and what theory can contribute when stuff just works anyways without it. Subscribe on Substack if you like it! undertheassumptions.substack.com/p/the-optimize…

English

4

87

429

104.1K

Max Simchowitz@max_simchowitz·30 Haz

I think TL;DR our paper should be viewed as a proof of concept: there is more room to scale compute and innovate on policy regularization/stabilization. Expressive updates can mean more training stability and exploration, not less. And better policy extraction can be a really, really powerful tool (thanks for sticking with me to the end /n)

English

0

1

7

421

Max Simchowitz@max_simchowitz·30 Haz

As a PhD student, I was told not to work on deep RL - too full of hacks and alchemy" But after a year or two of working in this area, I’ve come to (deeply?) appreciate all of the thoughtful research that’s gone into understanding what/why things work and how to make them better. My lab (joint with @abhishekunique7) took what we’ve learned by reading this body of literature to answer the question: what are the actual best-practices for finetuning a diffusion/flow/generative robot policy (for now, in sim)? Under one set of constraints - ample compute but limited time on your robot - @servo97 paper gives a pretty compelling answer.

Sarvesh Patil@servo97

Interaction with the real world is the major bottleneck in robot learning. So what would robot RL look like if we didn’t need to limit compute per interaction? Our latest work, Off-Policy Generative Policy Optimization (OGPO, accepted to ICML26) embarks on answering this question (spoiler alert: when done correctly, it helps massively!). 🧵(1/N)

English

22

34

366

35.5K

Max Simchowitz@max_simchowitz·30 Haz

So should you use it? Well, of course, that depends on how much compute you have. We are still working to get OGPO cooking on VLAs, where the costs of compute/inference start to add up, and are testing on real hardware. We also are excited to see OGPO for an off-line / batch-RL setup (stay tuned ; /n).

English

0

1

3

319

Max Simchowitz@max_simchowitz·30 Haz

Ok so what did we learn? OGPO is a useful algorithmic template for full-policy finetuning algorithms. It is also compatible with greater compute scaling for more stable training (batch denoising trajectory sampling reduces variance), and is compatible with surprisingly performant techniques to further improve training behavior (success buffer conservative advantages). It trains quickly even if pre-training data are low and of mixed quality, and encourages *more exploration*, not less (the summary / n).

English

0

1

5

385

Max Simchowitz@max_simchowitz·30 Haz

Nope! Actually, this is the coolest thing. OGPO *increases exploration*. Magically (we don’t enitrely understand why), it pushes variances orthgonal to Q gradients. This means that exploration squeezes the exploration distribution along axes the matter, but preserves exploration across axes where the critic is indifferent . Said otherwise: expressive updates can explore more! (the cool part / n)

English

0

1

9

286

Max Simchowitz@max_simchowitz·30 Haz

Okay, but surely you are gonna kill diversity with this full-policy finetuning? If you didn’t have a great BC starting point your policy, this PPO/GRPO-ish frankenstein will make your policy collapse, right? (almost at the cool part / n)

English

0

1

3

284

Max Simchowitz@max_simchowitz·30 Haz

And why PPO? There are other ways to optimize a denoising MDP, like AWR weighting of likelihood rations, FPO/FPO++, or even reweighting the flow matching loss. But PPO seems to work best. Why? See the paper. But TL;DR - use the best *on-policy* RL algorithm you got for policy extraction. (keep reading /n)

English

0

1

5

257

Max Simchowitz@max_simchowitz·30 Haz

Why is this cool? Well these sample efficiency gains are coming *purely from pushing more FLOPs onto policy extraction*. We always think critic learning or sparse rewards or exploration collapse or plasticity are the bottleneck, but a lot of these challenges seem to be indirectly addressable thru the right policy extractions. I have more takes here but will save for offline (meh/n)

English

0

1

6

244

Max Simchowitz@max_simchowitz·30 Haz

Results: OGPO, and its improved variants work really well, particular on complex long horizon tasks with sparse reward. It can complete certain tasks where alternatives stall (e.g. robomimic suite), has 10x the sample efficiency of DPPO, and can even learn from extremely limited data. (13/n)

English

0

1

4

332

Max Simchowitz@max_simchowitz·30 Haz

Even more fun: conservative advantages. We optimize an ensemble of advantages, and take the smallest if all postive, largest if all negative, and 0 if the signs disagree. In the plot below, this kills the dip in offline-to-online adaptation, and beats regularization and cal-QL. Note you can only do this with a zero-order optimization through the denoising because it relies on advantages, *not critic gradient*(12/n)

English

0

1

5

328

Max Simchowitz@max_simchowitz·30 Haz

We propose two better alternatives: BC regularization only to successful trajectories (OGPO+), stored in a “success buffer”. Basically just an SFT penalty but only on successes. (11/n)

English

0

5

266

Max Simchowitz@max_simchowitz·30 Haz

Well, the obvious thing would be to add regularization or make your critic conservative. That works okay, but its literally just slowing down learning. Not the most sample-efficiency pilled thing to do (10/n).

English

0

4

241

Max Simchowitz@max_simchowitz·30 Haz

Okay okay, but surely this is too aggressive for optimizing a really expressive policy? Aren’t you gonna exploit your critic and collapse? Yes! Unless…👉👈🥺

English

0

4

244

Max Simchowitz@max_simchowitz·30 Haz

PPO through parallel denoising trajectories is awesome because zero-order updates work well for highly precise tasks, where gradients might be stiff or unreliable (e.e. here) (8/n)

English

0

1

3

299

Max Simchowitz@max_simchowitz·30 Haz

More interestingly, denoising steps can be done purely in your imagination. This means that we can scale gradients through multiple denoising steps from a single state, a la GRPO. No need to train a value network either - MC through the denoising gives you PPO baselines. (7/n)

English

0

1

7

366

Max Simchowitz@max_simchowitz·30 Haz

Optimizing a critic drastically improves credit assignment and data reuse (no surprise there). (6/n)

English

0

5

304

Max Simchowitz@max_simchowitz·30 Haz

We propose off-policy generative policy optimization (OGPO). It’s a pretty natural algorithm template. Like DPPO, we treat each denoising step as one step in a “denoising MDP.” We then use this formulation to learn a policy which maximizes a Q critic as a terminal reward. This effectively severs the DPPO formulation after denoising, which has tons of cool benefits (5/n)

English

0

1

8

425

Max Simchowitz

Keşfet