Grant Watson

538 posts

Grant Watson

Grant Watson

@grhwatson

ML @RecursionPharma. Previous: ML Engineer @dewpoint_tx @PhenomicAI. Into ML, physics, math, music, computer-generated art, and Dungeons & Dragons.

Toronto, Ontario Beigetreten Ağustos 2017
2.2K Folgt234 Follower
Grant Watson retweetet
Bo Wang
Bo Wang@BoWang87·
Three weeks ago I shared that Claude had shocked Prof. Donald Knuth by finding an odd-m construction for his open Hamiltonian decomposition problem in about an hour of guided exploration. Prof. Knuth titled the paper Claude’s Cycles. The story didn't end there. The updated paper shows the story got much bigger. For the base case m=3, there are exactly 11,502 Hamiltonian cycles. Of those, 996 generalize to all odd-m, and Prof. Knuth shows there are exactly 760 valid “Claude-like” decompositions in that family. The even case, which Claude couldn’t finish, was then cracked by Dr. Ho Boon Suan using GPT-5.4 Pro to produce a 14-page proof for all even m≥8, with computational checks up to m=2000. Soon after, Dr. Keston Aquino-Michaels used GPT + Claude together to find simpler constructions for both odd and even m, by using the multi-agent workflow. Dr. Kim Morrison also formalized Knuth’s proof of Claude’s odd-case construction in Lean. So yes: the problem now appears fully resolved in the updated paper’s ecosystem of human + AI + proof assistant work! We went from one AI solving one problem to a full mathematical ecosystem (multiple AI systems, multiple humans, formal verification) running in parallel on a problem that stumped experts for weeks. We are living in very interesting times indeed. Paper (updated): www-cs-faculty.stanford.edu/~knuth/papers/…
Bo Wang tweet mediaBo Wang tweet media
Bo Wang@BoWang87

Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…

English
41
265
1.4K
169.1K
Grant Watson retweetet
Sakana AI
Sakana AI@SakanaAILabs·
The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature Nature: nature.com/articles/s4158… Blog: sakana.ai/ai-scientist-n… When we first introduced The AI Scientist, we shared an ambitious vision of an agent powered by foundation models capable of executing the entire machine learning research lifecycle. From inventing ideas and writing code to executing experiments and drafting the manuscript, the system demonstrated that end-to-end automation of the scientific process is possible. Soon after, we shared a historic update: the improved AI Scientist-v2 produced the first fully AI-generated paper to pass a rigorous human peer-review process. Today, we are happy to announce that “The AI Scientist: Towards Fully Automated AI Research,” our paper describing all of this work, along with fresh new insights, has been published in @Nature! This Nature publication consolidates these milestones and details the underlying foundation model orchestration. It also introduces our Automated Reviewer, which matches human review judgments and actually exceeds standard inter-human agreement. Crucially, by using this reviewer to grade papers generated by different foundation models, we discovered a clear scaling law of science. As the underlying foundation models improve, the quality of the generated scientific papers increases correspondingly. This implies that as compute costs decrease and model capabilities continue to exponentially increase, future versions of The AI Scientist will be substantially more capable. Building upon our previous open-source releases (github.com/SakanaAI/AI-Sc…), this open-access Nature publication comprehensively details our system's architecture, outlines several new scaling results, and discusses the promise and challenges of AI-generated science. This substantial milestone is the result of a close and fruitful collaboration between researchers at Sakana AI, the University of British Columbia (UBC) and the Vector Institute, and the University of Oxford. Congrats to the team! @_chris_lu_ @cong_ml @RobertTLange @_yutaroyamada @shengranhu @j_foerst @hardmaru @jeffclune
GIF
English
48
398
1.9K
653K
Grant Watson retweetet
Archie Sengupta
Archie Sengupta@archiexzzz·
i spent a few hours going through /karpathy/autoresearch repo line by line. the "ai agents doing research" angle is what's getting all the attention but i think the more interesting thing is what's actually inside the training script and the engineering decisions that make the search loop tight. it's one of the most dense single-file training setups i've read. let me start with the thing that makes the whole project possible: the time budget is fixed at 300 seconds wall clock. not fixed steps, not fixed tokens, not fixed flops. wall clock seconds. this sounds like a minor detail but it's the entire reason the autonomous loop works. the agent can make the model 3x bigger, cut the batch size in half, swap in a completely different architecture, and the result is still directly comparable to every other experiment because they all got exactly 5 minutes of training on the same gpu. if you fixed steps instead, a bigger model would get less gradient updates per second and you'd be penalizing it unfairly. if you fixed tokens, you'd have the same problem. fixing wall time means you're asking the right question: given this hardware and this much time, what is the best model you can produce? everything else is a free variable. the agent can explore the full pareto surface of model size vs throughput vs convergence speed without any of those tradeoffs being confounded by the evaluation protocol. the metric is also carefully chosen. it's bits per byte, not cross entropy loss. cross entropy depends on your vocab size. a model with 32k tokens and a model with 8k tokens will have very different loss values even if they compress the data equally well. bpb normalizes this away by summing the per-token cross entropy in nats, summing the utf-8 byte lengths of the target tokens, and converting nats-per-byte to bits-per-byte. so even if the agent changes something that affects the effective token distribution, the comparison remains fair. these two choices, fixed wall time and a vocab-invariant metric, turn what would be a messy incomparable search into a clean optimization problem. now the model itself. it's a GPT but with a bunch of modern tricks that are worth understanding. first, RMSnorm everywhere. on the block inputs (pre-norm), and also on queries and keys right before the attention dot product. this QK-norm thing is important because without it the norms of q and k can grow unboundedly during training, causing attention logits to sharpen and softmax to saturate. normalizing q and k keeps the dot products in a stable range regardless of how deep the network is or how training dynamics evolve. the attention itself is FA 3, loaded through the kernels library. it uses varunneal's implementation on hopper (sm_90) and falls back to a community build on older gpus. the attention pattern is "SSSL" which means three layers of sliding window attention (window = half the sequence length) followed by one layer of full causal attention, repeating. this is the sparse-to-dense pattern you see in mistral and gemma2. the local attention layers are computationally cheap because the attention matrix is banded, and the periodic global layer lets information flow across the full context. with 8 layers and a 4-character pattern you get layers 0,1,2 local, layer 3 global, layers 4,5,6 local, layer 7 global. the last layer is forced global regardless of pattern. the value embedding thing is subtle and i think underappreciated. every other layer gets its own embedding table, completely separate from the main token embedding, that maps token ids directly to value-dimension vectors. these get mixed into the attention values through a learned gate: v = v + 2 * sigmoid(W_gate @ x:32) * ve. the gate weight is zero-initialized, so sigmoid(0) = 0.5, times 2 gives 1.0, which is a neutral starting point. over training the model can learn to amplify or suppress the value embedding per-head based on the first 32 dimensions of the hidden state. this is from the ResFormer line of work and the intuition is that it gives attention a direct shortcut to token identity. the value vectors can carry information about "what token is at this position" without that information having to survive the residual stream transformations from earlier layers. it's essentially a skip connection from the input directly into the attention values, gated so the model can decide when it's useful. there are also per-layer learnable scalars on the residual stream: x = lambda_residi * x + lambda_x0i * x0, where x0 is the normalized embedding from layer 0. every layer can independently control how much it listens to the running residual vs the original input. the residual lambdas start at 1.0, the x0 lambdas start at 0.1. this is a soft version of the "disentangled residual" idea. in a standard transformer the residual stream is a sum of all previous layer outputs and it gets increasingly polluted as you go deeper. giving each layer access to the clean original embedding means it doesn't have to learn to "undo" earlier layers to recover low-level information. the logits are softcapped at 15 via tanh(logits/15)*15 which prevents the model from being overconfident early in training when the representations are still noisy. but honestly the most interesting part of the whole file is the optimizer. MuonAdamW is a combined optimizer that dispatches different update rules based on parameter group. embeddings (token embedding, value embeddings, unembedding head) and per-layer scalars get standard AdamW with different learning rates for each group. the spread is wild. embedding lr is 0.6, unembedding lr is 0.004, that's a 150x difference, and it's intentional. the embedding matrix sees every single token and needs to update aggressively. the unembedding matrix is a linear probe on the final representation and benefits from stability. the embedding, value embedding, and unembedding learning rates are all scaled by (d_model / 768)^(-0.5) which is a muP-inspired correction. as model width changes, those learning rates adjust to keep the feature learning dynamics scale-invariant. the scalar learning rates for the per-layer lambdas are handled separately and don't get this scaling. the 2D weight matrices in the transformer, attention projections and mlp weights, get Muon, and this is where it gets genuinely interesting. muon takes the gradient, applies nesterov momentum, then runs a newton-schulz iteration to approximate the polar decomposition of the gradient matrix. the polar decomposition factors a matrix G into G = U * S where U is orthogonal and S is symmetric positive semi-definite. muon computes U, the nearest orthogonal matrix to the gradient, and uses that as the update direction. the newton-schulz iteration is 5 steps. for tall matrices (more rows than columns), A = X^T @ X then X -> aX + X @ (bA + cA^2). for wide matrices, A = X @ X^T then X -> aX + (bA + cA^2) @ X. the coefficients are hardcoded from a precomputation. they call it "polar express." the whole thing compiles to a single fused kernel via torch.compile. why does this matter? because for weight matrices the frobenius norm gradient (what adam and sgd use) is geometrically wrong. the "correct" steepest descent direction for a weight matrix is the one that minimizes the loss subject to the constraint that the update has unit spectral norm, not unit frobenius norm. the orthogonal polar factor gives you exactly this. in practice it means muon makes much larger effective updates because it's not wasting step size on scaling the singular values. it only rotates them. this is why muon converges significantly faster than adam on transformer weight matrices. muon does maintain per-element momentum buffers (same shape as the parameters, stacked across each shape group), but unlike adam it doesn't track per-element second moments. the second moment estimates are per-row or per-column after orthogonalization, not per-element. that's where NorMuon comes in. on top of the base muon there's NorMuon, a variance reduction scheme. after orthogonalization, it computes per-row (or per-column depending on aspect ratio) second moment estimates, maintains an exponential moving average of those, and rescales the update so each output dimension gets its own adaptive step size. it's essentially the adam adaptivity idea but applied in the orthogonalized coordinate system rather than the raw parameter space. the weight decay is also non-standard. it's "cautious," meaning it only decays parameters where the muon update direction agrees with the parameter sign: mask = (g * params) >= 0. this avoids the known failure mode where weight decay pushes parameters toward zero against the update's wishes, which can destabilize training. one small detail i appreciated: after the very first training step, the code calls gc.collect(), gc.freeze(), gc.disable() to completely shut off python's garbage collector. python's GC runs periodically and causes ~500ms stalls. when your total budget is 300 seconds and each step is maybe 300ms, a random GC pause costs you almost 2 training steps. they manually trigger gc.collect() every 5000 steps as a compromise. this is the kind of thing you only learn by profiling real training runs and noticing mysterious throughput drops. the first 11 steps (0 through 10) aren't counted toward the time budget either. that's the warmup where torch.compile does its thing and CUDA kernels get JIT'd. without this exclusion, different experiments would get different amounts of "real" training depending on how long compilation takes for that particular model configuration. again, a design choice that seems small but is critical for making experiments comparable. now zoom out. the actual autoresearch loop is: the agent reads program.md (a markdown file that describes its job), modifies train .py, commits, runs for 5 minutes, checks if val_bpb improved, keeps or reverts, repeats. program.md explicitly says "NEVER STOP." the agent runs indefinitely until the human kills it. ~12 experiments per hour, ~100 overnight while you sleep. the thing i keep coming back to is how tight the constraints make the problem: > one file to edit. > one metric to optimize. > one gpu. > five minutes. > no new dependencies allowed. the search space is large but the evaluation is fast, cheap, and unambiguous. without the fixed time budget the agent would have to reason about compute-performance tradeoffs which is a much harder problem. without the single-file constraint it could create sprawling multi-file messes that are impossible to revert cleanly. the constraints are what make it work. this is honestly a general lesson in research. the tighter the evaluation protocol, the faster you make progress.
English
38
100
1.2K
99.6K
Grant Watson retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
970
2.1K
19.5K
3.6M
Grant Watson retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1.1K
3.7K
28.3K
10.9M
Grant Watson retweetet
Bo Wang
Bo Wang@BoWang87·
The most interesting thing about the diffusion LM debate isn't what it means for languages. It's what it means for biological foundation models. The blog arguing diffusion LMs will outscale autoregressive models is making the rounds. The argument is interesting — but it's actually stronger for biological foundation models than for language, and I don't think enough people are talking about that. The core claim of diffusion LLM is that left-to-right generation imposes an inductive bias that creates an artificial loss floor. Autoregressive models predict token N given tokens 1 through N-1 — a constraint with no natural basis in many domains. In human language, sentences at least proceed forward in time. In biology, that constraint is almost entirely fictitious. A protein's function comes from its 3D structure, which emerges from all residues simultaneously. There's no biological reason residue 50 should be predicted from residues 1 through 49 in order. Gene expression works the same way — a cell's transcriptional state is a high-dimensional point cloud, not a sentence. DNA regulatory elements interact across hundreds of kilobases in both directions. The field already knew this intuitively. ESM-2 is bidirectional. RFdiffusion and Chroma generate protein backbones through diffusion, not sequential decoding. ProteinMPNN designs sequences by considering all positions jointly. So the real question is whether discrete diffusion models will do for single-cell and genomic foundation models what continuous diffusion already did for protein structure — replace the autoregressive default with something that actually matches the geometry of the problem. I think yes.
Bo Wang tweet media
English
23
85
573
49.7K
Grant Watson retweetet
Martin Bauer
Martin Bauer@martinmbauer·
Yes, this is a significant result and a solid research paper. And it would’ve been much harder to achieve without GPT. While I understand the instinct, I think it is more interesting to evaluate what type of contribution the AI has made as opposed to focussing on how relevant the result is. ChatGPT generalised a previously derived result for an amplitude that was assumed to vanish for all physical kinematics people care about. These amplitudes are very complicated, lengthy expressions with certain structures and symmetries that are sometimes hidden and difficult to see. This kind of problem is exactly where AI shines! AI is better at detecting breast cancer than a clinician because it has seen millions of scans and detects structures where humans -which are limited by their lifetime exposure- can't. AI that systematically surveys many large amplitudes has a similar advantage Similar to the Erdős problems, it was mostly really an attention bottleneck that left this problem unsolved. This calculation was considered another elaborate way of arriving at zero. So few people were interested in the result and even fewer were working on it. Most if not all physicists would therefore consider the insight of the human physicists that there is in fact a kinematic region where these amplitudes are not zero the most meaningful piece of progress here Both these points aren't to diminish the result. It is seriously impressive and deserves the publicity. It shows where the strengths of modern models can significantly accelerate science and I'm convinced there will be even more relevant discoveries in the future.
Noam Brown@polynoamial

There have been fair questions on whether LLM contributions to STEM are overhyped, but I've spoken with physicists about this result and they've told me it is a truly significant research contribution, roughly at the level of a solid journal paper, and GPT-5.2 played a key role.

English
13
41
443
38.1K
Grant Watson retweetet
Rohan Paul
Rohan Paul@rohanpaul_ai·
Terence Tao: AI isn’t hype anymore in Math discovery. Terence Tao is one of the greatest living mathematicians, in his new lecture explains how AI and human professional mathematicians are now complementary. "There has been a really visible increase in capability. It is not pure hype by any means. To me, these advances show there is a complementary way to do mathematics. Humans traditionally work in small groups on hard problems for months, and we will keep doing that. But we can also now set AI to scale: sweep a thousand problems and pick up all the low-hanging fruit. Figure out all the ways to match problems to methods. If there are 20 different techniques, apply them all to 1,000 problems and see which ones can be solved by these methods. This is the capability that is present today." From 'Institute for Pure & Applied Mathematics (IPAM)' YT channel.
English
47
378
2.2K
173K
Grant Watson retweetet
OpenAI
OpenAI@OpenAI·
We worked with @Ginkgo to connect GPT-5 to an autonomous lab, so it could propose experiments, run them at scale, learn from the results, and decide what to try next. That closed loop brought protein production cost down by 40%.
English
467
1.3K
9.7K
3.1M
Grant Watson retweetet
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
New paradigm from Kaiming He's team: Drifting Models! With this approach, you can generate a perfect image in a single step. The team trains a "drifting field" that smoothly moves samples toward equilibrium with the real data distribution. The result? A one-step generator that sets a new SOTA on ImageNet 256x256, beating complex multi-step models.
机器之心 JIQIZHIXIN tweet media
English
15
165
1.3K
315.4K
Grant Watson retweetet
Sakana AI
Sakana AI@SakanaAILabs·
Introducing Digital Red Queen (DRQ): Adversarial Program Evolution in Core War with LLMs Blog: sakana.ai/drq Core War is a programming game where self-replicating assembly programs, called warriors, compete for control of a virtual machine. In this dynamic environment, where there is no distinction between code and data, warriors must crash opponents while defending themselves to survive. In this work, we explore how LLMs can drive open-ended adversarial evolution of these programs within Core War. Our approach is inspired by the Red Queen Hypothesis from evolutionary biology: the principle that species must continually adapt and evolve simply to survive against ever-changing competitors. We found that running our DRQ algorithm for longer durations produces warriors that become more generally robust. Most notably, we observed an emergent pressure towards convergent evolution. Independent runs, starting from completely different initial conditions, evolved toward similar general-purpose behaviors—mirroring how distinct species in nature often evolve similar traits to solve the same problems. Simulating these adversarial dynamics in an isolated sandbox offers a glimpse into the future, where deployed LLM systems might eventually compete against one another for computational or physical resources in the real world. This project is a collaboration between MIT and Sakana AI led by @akarshkumar0101 Full Paper (Website): pub.sakana.ai/drq/ Full Paper (arxiv): arxiv.org/abs/2601.03335 Code: github.com/SakanaAI/drq/
English
21
98
577
142.1K
Grant Watson retweetet
Sakana AI
Sakana AI@SakanaAILabs·
Our AI agent has achieved 1st place in a competitive optimization programming contest against over 800 human participants. Blog: sakana.ai/ahc058 In AtCoder Heuristic Contest 058, Sakana AI’s ALE-Agent took the top spot. For context on the difficulty of these challenges, an OpenAI agent secured 2nd place in the AHC world tournament last year. The task was a 4-hour production planning optimization challenge. While the problem setters anticipated a standard approach combining constructive heuristics and simulated annealing, our agent independently discovered a more effective strategy. It implemented a "virtual power" heuristic and a diverse neighborhood search that allowed it to escape local optima better than human experts. This was achieved through inference time scaling using multiple frontier AI models. The agent ran parallel code generation, analyzed the results, and iteratively refined its algorithms in real time. The total cost was approximately $1,300. This result suggests AI agents can now match top human experts in tasks requiring extended reasoning and original scientific discovery. Please read our blog for more details. We extend our deepest thanks to the host, @algo_artis, and @atcoder. We will continue to research AI as a partner that expands human exploration to discover solutions to complex real-world problems.
Sakana AI tweet media
English
15
62
324
132.3K
Grant Watson retweetet
sway
sway@SwayStar123·
Speedrunning ImageNet Diffusion - 360x faster training There have been many new techniques demonstrating convergence speedups compared to DiT in the past few years, however all of these have been studied in isolation, against increasingly outdated baselines. I present SR-DiT (SpeedrunDiT), which combines some of the best techniques into one new modern baseline
sway tweet media
English
22
61
489
81.1K
François Fleuret
François Fleuret@francoisfleuret·
Hear me out: A question is its answer with noise, a reasoning model is a denoising autoencoder, the reasoning is the embedding Z of the question so that a dumb causal decoder can generate the answer.
English
27
8
181
25K
Grant Watson retweetet
hardmaru
hardmaru@hardmaru·
Excited to announce our MIT Press book “Neuroevolution: Harnessing Creativity in AI Agent Design” by Sebastian Risi (@risi1979), Yujin Tang (@yujin_tang), Risto Miikkulainen, and myself. We explore decades of work on evolving intelligent agents and shows how neuroevolution can drive creativity in deep learning, RL, LLMs and AI Agents! 📖 Free open-access edition: neuroevolutionbook.com In addition to our own works, this video features work by Jürgen Schmidhuber (@SchmidhuberAI), Seth Bling (@SethBling), Igor Karpov, Jacob Schrum, Yulu Gan (@yule_gan), Ken Stanley (@kenneth0stanley), Joel Lehman (@joelbot3000), Jeff Clune (@jeffclune), Nick Cheney (@CheneyLab), Richard Song (@XingyouSong), Chelsea Finn (@chelseabfinn), Julian Togelius (@togelius), Sam Earle (@Smearle_RH), Hod Lipson (@hodlipson), and Jean-Baptiste Mouret (@jb_mouret).
English
16
219
1K
161.9K
Grant Watson retweetet
Oriol Vinyals
Oriol Vinyals@OriolVinyalsML·
The secret behind Gemini 3? Simple: Improving pre-training & post-training 🤯 Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight! Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team. Congratulations to the whole team 💙💙💙
Oriol Vinyals tweet media
English
120
544
4.4K
2M
Grant Watson retweetet
Sakana AI
Sakana AI@SakanaAILabs·
Introducing Petri Dish Neural Cellular Automata (PD-NCA) 🦠 The search for open-ended complexification, a north star of Artificial Life (ALife) simulations, is a question that fascinates us deeply. In this work we explore the role of continual adaptation in ALife simulation, where the cellular automata in our system do not rely on a fixed set of parameters, but rather learn continuously during the simulation itself. Our Petri Dish Neural Cellular Automata (PD-NCA) is a new ALife substrate that consists of a differentiable world where multiple NCA learn to self-replicate and grow via ongoing gradient descent. Every individual is constantly trying to grow, all the while learning to adapt and out-compete its neighbors. PD-NCA allows for complex, emergent behaviors like cyclic dynamics, territorial defense, and spontaneous cooperation. The video below shows the sheer variety and complexity that unfolds during several different simulations (each colour is a different NCA).
English
14
67
328
105K
Grant Watson retweetet
Saining Xie
Saining Xie@sainingxie·
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
Saining Xie tweet media
English
57
322
1.9K
413.7K