Max Simchowitz

We just brought flow maps to language modeling for one-step sequence generation 💥 Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥 We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀 Full thread from first author @chandavidlee: x.com/chandavidlee/s…

16

163

19K

Max Simchowitz ретвитнул

Aran Nayebi@aran_nayebi·4 Mar

1/ As AI agents become increasingly capable, what must *inevitably* emerge inside them? We prove selection theorems: strong task performance forces world models, belief-like memory and—under task mixtures—persistent variables resembling core primitives associated with emotion.

English

3

36

189

24.9K

Max Simchowitz@max_simchowitz·20 Şub

Great work by @nmboffi and @AdtRaghunathan and others! Excited to see the power of few-step flow maps extend from images to other domains like text. Check out Nick's post to learn more.

Nicholas Boffi@nmboffi

English

4

23

5.4K

Max Simchowitz ретвитнул

Aviral Kumar@aviral_kumar2·13 Şub

Can just a 4B model solve IMO-level proof problems at the level of much stronger LLMs like Gemini 3 Pro? Yes, if you can train the LLM to scale test-time compute well! We're very excited to release our 4B model "QED-Nano", built via an awesome open collab! Details below🧵⬇️

English

8

27

168

21.5K

Max Simchowitz ретвитнул

Boyuan Chen@BoyuanChen0·5 Oca

Introducing Large Video Planner (LVP-14B) — a robot foundation model that actually generalizes. LVP is built on video gen, not VLA. As my final work at @MIT, LVP has all its eval tasks proposed by third parties as a maximum stress test, but it excels!🤗 boyuan.space/large-video-pl…

English

21

96

577

94.1K

Max Simchowitz ретвитнул

Yutong (Kelly) He@electronickale·4 Oca

I’ve always seen twitter bros be like “keep building keep shipping” when using ai agents for coding, but now that I’ve tried it for real, it feels less like grinding and more adjacent to eating popcorn while bed rotting and watching heated rivalry (I’m not complaining)

English

Andrew Wagenmaker@ajwagenmaker

38

5.1K

Max Simchowitz ретвитнул

Francesco Bertolotti@f14bertolotti·24 Ara

The authors ask whether an N-layer ViT can be rewritten using just K<<N layers by recurring on them. Remarkably, they match DINOv2 performance with only 2-3 layers. The paper also offers rich dynamical-systems analysis. Very cool work! 🔗arxiv.org/abs/2512.19941

English

8

66

654

42.7K

Max Simchowitz@max_simchowitz·23 Ara

Standard training of generative policies is surprisingly bad at learning multi-modal strategies from training data. So what can we do to pre-train for diverse solutions and effective post-training? Check out @ajwagenmaker 's new paper for an answer :)

How should we pretrain a policy from demonstrations to ensure it is an effective initialization for RL finetuning, while preserving the performance of the pretrained policy itself? We propose Posterior Behavioral Cloning (PostBC)! (1/11)

English

5

64

11.8K

Max Simchowitz@max_simchowitz·22 Ara

My friends @elvisnava and @mimicrobotics (makers of awesome robot hands 🧤) just put out a video-first VLA. 📹📹 Motivation: A bitter-lesson alternative to retargeting, and a path to cross-embodiment. Key Idea: Nvidia Cosmos backbone with T5 text encoder, and but pre-train from RGB rather than low-dim actions. Low-Dim actions only needed for a lightweight flow-decoder for task-specific fine-tuning. Benefits: Opens to the door for human data collect with off-the-shelf cloth gloves (and maybe soon, YouTube!), leverages internet scale video, embodiment agnostic. Excited to see where you go next!

Elvis Nava@elvisnavah

Today @mimicrobotics and friends are excited to share mimic-video, a new class of Video-Action Model that elevates video model backbones as first class citizens for robot learning!

English

4

13

114

8.9K

Max Simchowitz ретвитнул

Jonathan Fine@jonathanbfine·19 Ara

the mistake so many people make is seeing university professors as intellectuals when they’re actually employees at a combination hedge fund and healthcare conglomerate that operates a small luxury resort/sports franchise where student-customers occasionally take classes

English

246

2.8K

25.5K

575.7K

Max Simchowitz@max_simchowitz·19 Ara

@sedielem Awesome paper!

English

1

408

Sander Dieleman@sedielem·19 Ara

Overgeneralisation is probably the main reason why we're still so reliant on it. I think we are also intentionally collapsing the distribution towards its more canonical examples (guidance is just one way to do that, various forms of post-training are also used for this). The data distribution used in pre-training is usually too broad compared to what we want the model to produce. People don't tend to show direct comparisons anymore these days, but most models are pretty bad without guidance (see x.com/sedielem/statu…). I like the toy experiments in the autoguidance paper: arxiv.org/abs/2406.02507

English

7

106

5K

Max Simchowitz@max_simchowitz·18 Ara

What is the best current explanation for why CFG is needed for conditional generation? Is it just that are unguided models “undersharpened”/ over smoothed bc of NN inductive bias ?

English

3

5

73

11.7K

Max Simchowitz ретвитнул

Anirudha Majumdar@Majumdar_Ani·17 Ara

Check out the paper and the project website to find out more! 📃 Paper: arxiv.org/abs/2512.05927 🌐 Website: c-cubed-uq.github.io

English

Excited to introduce PolaRiS, a real-to-sim recipe for turning short real-world videos into high fidelity simulation environments for scalable and reliable zeroshot generalist policy evaluation. polaris-evals.github.io (1/N 🧵)

4

16

2.2K

Max Simchowitz@max_simchowitz·18 Ara

Amazing work from @abhishekunique7 and team! The Evals problem in robotics is major, so awesome to see this progress.

Arhan Jain@prodarhan

English

0

9

1.9K

Max Simchowitz@max_simchowitz·18 Ara

@FeinbergVlad Chad Feinberg

CY

Google DeepMind@GoogleDeepMind

1

266

Vlad Feinberg@FeinbergVlad·17 Ara

Flash 3.0 is out! Culmination of years of high-conviction infra & science investment (we knew it'd be this good a priori). I truly believe no other place trains these models the way we do. I'm lucky to get to work so intimately with the team that made this happen. We aren't slowing down.

3 Flash delivers frontier performance on benchmarks like GPQA Diamond - evaluating PhD-level reasoning – and Humanity’s Last Exam – testing broad expert knowledge. It’s state-of-the-art on MMMU Pro, with a score comparable to 3 Pro - easily analyzing inputs across videos and images, not just text. And it handles complex tasks significantly faster than 2.5 Pro at a lower cost, using fewer tokens - or units of information - to save time.

English

13

24

298

65.8K

Max Simchowitz@max_simchowitz·17 Ara

Make sure to read the newest version with all of the gorgeous illustrations, and even a little primer on control theory in the appendix (arxiv.org/abs/2507.09061). And check out our paper website (simchowitzlabpublic.github.io/action-chunkin…) for the TL;DR.

English

Thomas Zhang@ThomasTCKZhang

1

7

799

Max Simchowitz@max_simchowitz·17 Ara

⏰⏰(another) science of robot learning paper. Why does action chunking work so well in robotic manipulation? Probably lots of reasons. but here’s one you may not have thought of: control stability. After months of polishing and 5 revisions, check out“ Action Chunking and Exploratory Data Collection Yield Exponential Benefits for Imitation Learning”. The title says it all. Lead by the eloquent @ThomasTCKZhang , illustrated by the talented @dpfroms , supported by @ChaoyiPan of MIP fame (simchowitzlabpublic.github.io/much-ado-about…) fame, and in collaboration with the one and only @NikolaiMatni .

🤖🤖Very excited to finally share our new work “Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control” Everyone in robotics does action-chunking, but why does it actually work?🤔🤔And, what can theory tell us about the properties of data we should be collecting for robotic behavior cloning? 🧵1/N

English

16

163

19K

Max Simchowitz@max_simchowitz·16 Ara

@thegautamkamath @SoloGen @canondetortugas thank you !! and would love to catch up soon

English

Yutong (Kelly) He@electronickale

1

240

Gautam Kamath@thegautamkamath·16 Ara

@SoloGen There was a great #NeurIPS2025 tutorial on Imitation Learning by Adam Block, @canondetortugas, and @max_simchowitz. You may already have many of the references, but just in case. dylanfoster.net/il-tutorial/

English

3

0

15

3K

Amir-massoud Farahmand@SoloGen·16 Ara

What are you favourite Imitation Learning and Inverse Reinforcement Learning papers? What are the essential papers? It doesn't matter much whether they are new or old, but I prefer a conceptually elegant and mathematically solid work.

English

6

2

29

5.9K

Max Simchowitz@max_simchowitz·10 Ara

👋👋New Generative Modeling Paper from @electronickale and @KeelyAi04: Evaluating sample likelihoods is a fundamental primitive in flow-based generative modeling . Now we can compute them faster. Much faster. Like 10-100x faster. ✈️✈️ Check out our new work on fast likelihood distillation, F2D2, lead by Kelly and Xinyue, together with @_albertgu , @rsalakhu, @zicokolter, and @nmboffi . And stay tuned for more in this direction in the next few months 😉 (And: @KeelyAi04 's applying to grad school and she is aawesome)

Diffusion/Flow-based models can sample in 1-2 steps now 👍 But likelihood? Still requires 100-1000 NFEs (even for these fast models) 😭 We fix this! Introducing F2D2: simultaneous fast sampling AND fast likelihood via joint flow map distillation. arxiv.org/abs/2512.02636 1/🧵

English

18

100

11.4K

Max Simchowitz@max_simchowitz·10 Ara

@electronickale was a blast working on this together !

English

7

1K

Yutong (Kelly) He@electronickale·10 Ara

Diffusion/Flow-based models can sample in 1-2 steps now 👍 But likelihood? Still requires 100-1000 NFEs (even for these fast models) 😭 We fix this! Introducing F2D2: simultaneous fast sampling AND fast likelihood via joint flow map distillation. arxiv.org/abs/2512.02636 1/🧵

English

9

72

429

118.8K

Max Simchowitz@max_simchowitz·5 Ara

second, the entire point is that we don't need anything like SGLD. This is why score matching is the wrong formalism. because in two steps, what we do looks nothing like any form of mixing. the fact that we don't need to mix emphasizes that distribution fitting is not needed.

English

0

108

Philip Bachman@philip_bachman·5 Ara

@max_simchowitz For proper samples, you could run some SGLD iterations through the second denoising stage with appropriately scaled noise. Though, mixing time may be bad. Drastically increasing the "scale resolution" was a (the?) big win for DDPMs et al vs similar, earlier DAE-based efforts.

English