Ayush Saraf

596 posts

Ayush Saraf

Ayush Saraf

@ayush29feb

ML Engineer @ Meta Reality Labs

New York, NY Katılım Nisan 2010
343 Takip Edilen247 Takipçiler
Ayush Saraf retweetledi
Santiago
Santiago@svpino·
Most people write worse code than AI does. I've worked in absolutely horrible codebases, 100% written and messed up by humans. So why do we complain about AI-generated slop code today? Because of the scale at which we are producing it. Before, you needed a bad programmer to manually write and deploy a ton of bad code. Today, you can generate virtually unlimited bad code very cheaply and without any constraints. So the quality of the code might be improving, but the overall amount of technical debt is increasing exponentially.
English
162
41
467
36K
Ayush Saraf retweetledi
John Doomer
John Doomer@jonathandoomer·
@OpenAI "Chat are you sure I'm on the right road?" "Absolutely just keep going straight"
John Doomer tweet media
English
7
58
3.4K
61K
Ayush Saraf retweetledi
Bilawal Sidhu
Bilawal Sidhu@bilawalsidhu·
Pokémon Go players captured 30 billion images and built one of the most detailed 3D maps in the world. Niantic just licensed it to train delivery robots. I actually sat down with their CTO, co-creator of Google Earth: youtu.be/qmRi23I21DY What he described makes the picture a lot clearer. But it also opens a much bigger question -- because Niantic is far from the only company mapping reality right now, and the others are capturing way more than parks and statues.
YouTube video
YouTube
English
3
10
58
6.9K
Ayush Saraf retweetledi
Andrew Curran
Andrew Curran@AndrewCurran_·
OpenAI plans to have an 'autonomous Al research intern' up and running by September of this year. And by 2028, a fully automated multi-agent research team.
Andrew Curran tweet media
English
20
27
229
24.1K
Ayush Saraf retweetledi
Peter Holderrieth
Peter Holderrieth@peholderrieth·
We are also releasing self-contained lecture notes that explain flow matching and diffusion models from scratch. This goes from "zero" to the state-of-the-art in modern Generative AI. 📖 Read the notes here: arxiv.org/abs/2506.02070 Joint work with @EErives40101.
Peter Holderrieth@peholderrieth

🚀MIT Flow Matching and Diffusion Lecture 2026 Released (diffusion.csail.mit.edu)! We just released our new MIT 2026 course on flow matching and diffusion models! We teach the full stack of modern AI image, video, protein generators - theory and practice. We include: 📺 Videos: Step-by-step derivations. 📝 Notes: Mathematically self-contained lecture notes 💻 Coding: Hands-on exercises for every component We fully improved last years’ iteration and added new topics: latent spaces, diffusion transformers, building language models with discrete diffusion models. Everything is available here: diffusion.csail.mit.edu A huge thanks to Tommi Jaakkola for his support in making this class possible and Ashay Athalye (MIT SOUL) for the incredible production! Was fun to do this with @RShprints! #MachineLearning #GenerativeAI #MIT #DiffusionModels #AI

English
38
640
5.6K
469K
Ayush Saraf retweetledi
Oliver Prompts
Oliver Prompts@oliviscusAI·
🚨 BREAKING: Someone just open-sourced a tool that turns the real world into a playable Minecraft map. It pulls data directly from OpenStreetMap and generates your exact neighborhood, city, or street block by block. 100% Open Source.
English
268
1.3K
24.4K
2M
Ayush Saraf retweetledi
World Labs
World Labs@theworldlabs·
🥇 1st place: Musée du Monde An interactive museum where visitors step inside famous paintings. From Van Gogh’s bedroom to worlds inspired by Vermeer and Matisse, each artwork becomes a fully explorable 3D environment generated with Marble.
English
5
7
56
5.3K
Ayush Saraf retweetledi
Lex
Lex@lexx_aura·
Animated AI short film 🤯 this is truly a masterpiece Midjourney + Nano Banana Pro + Seedance 2.0
English
106
351
2.4K
151.9K
Ayush Saraf retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Beautiful. Someone created an open-source Pixel Office interface for OpenClaw. It visually tracks status by moving a lobster character into specific work, rest, or bug areas. The Github repo got 5K+ stars (⭐️)
English
33
58
500
36.3K
Ayush Saraf retweetledi
Jon Barron
Jon Barron@jon_barron·
There's an engineer on YouTube building his own room-scale laundry-picking UFO catcher robot out of QR codes and string, it's one of the most compelling robotics demos I've seen in a while.
English
79
238
4K
511.5K
Ayush Saraf retweetledi
Ayush Saraf retweetledi
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
OPENCLAW AGENTS NOW JOIN SCRUM MEETINGS AND REPORT THEIR PROGRESS IN REAL TIME. STANDUPS WITH YOUR AI ENGINEERS.
English
204
277
3.8K
815.7K
Ayush Saraf retweetledi
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models. But almost nobody means the same thing by that term. Here are, in my view, five categories of world models. --- 1. Joint Embedding Predictive Architecture (JEPA) Representatives: AMI Labs (@ylecun), V-JEPA 2 The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead. Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space. This is a crucial design choice. A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling. V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training. The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch. AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away. --- 2. Spatial Intelligence (3D World Models) Representative: World Labs (@drfeifei) Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?" The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction. This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly. Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate. For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed. --- 3. Learned Simulation (Generative Video + Latent-Space RL) Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1 This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination. The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training. Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case. The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning. Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible. These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies. But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents. The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons. The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier. --- 4. Physical AI Infrastructure (Simulation Platform) Representative: NVIDIA Cosmos NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs. Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices. The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data. They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains. Three model families sit on top of this. Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios. Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps. Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A. --- 5. Active Inference Representative: VERSES AI (Karl Friston) This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience. Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise). Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy. VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation. The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations. Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient. In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy. They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data. --- imo, these five categories aren't really competing — they're solving different sub-problems. JEPA compresses physical understanding. Spatial intelligence reconstructs 3D structure. Learned simulation trains agents through generated experience. NVIDIA provides the picks and shovels. Active inference offers a fundamentally different computational theory of intelligence. My guess is the lines between them blur fast.
English
59
240
1.5K
318K
Ayush Saraf retweetledi
Kris Kashtanova
Kris Kashtanova@icreatelife·
🎉 We just released Rotate Object in Photoshop (beta) 🎉 You can now rotate 2D images! 🤯 Then use Harmonize to add light and shadows, to blend it perfectly with the rest of the scene. It's like Turntable in Illustrator, but instead of vectors, it's pixels in Photoshop!
English
375
904
9.6K
1.3M
Ayush Saraf retweetledi
Bilawal Sidhu
Bilawal Sidhu@bilawalsidhu·
Experimenting with blending 3d gaussian splats + geospatial 3d tiles You get the best of both worlds -- the insanely rich detail of your own custom 3d captures, perfectly grounded in the spatial 3d context of google's global photogrammetry coverage Another step towards seamless digital twins. No more floating islands!
English
14
60
731
35.9K
Ayush Saraf retweetledi
Ravid Shwartz Ziv
Ravid Shwartz Ziv@ziv_ravid·
Soatto vs LeCun: Are LLMs World Models?🧐
English
3
7
49
14.8K
Ayush Saraf retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
974
2.1K
19.4K
3.6M
Ayush Saraf retweetledi
Min Choi
Min Choi@minchoi·
MatAnyone 2 just killed the green screen 💀 This AI remove any background from any video... No studio. No setup. No green screen. Wild examples. DEMO + CODE in comments 👇
English
29
28
242
56.8K