Alfredo De la Fuente

2.8K posts

Alfredo De la Fuente banner
Alfredo De la Fuente

Alfredo De la Fuente

@alfo_512

swe @Google | prev @Meta

New York, USA Katılım Ağustos 2016
1.3K Takip Edilen401 Takipçiler
Alfredo De la Fuente retweetledi
Max Zhdanov
Max Zhdanov@maxxxzdn·
🌍Today we release Mosaic, a probabilistic weather model that shifts the Pareto frontier of ML weather forecasting. It matches the skill of state-of-the-art models while generating a 24-member, 10-day global forecast in under 12 s on a single H100. Thread!
English
25
128
1.1K
103.6K
Alfredo De la Fuente retweetledi
OpenAI
OpenAI@OpenAI·
Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.
English
940
3.6K
25.3K
12M
Alfredo De la Fuente retweetledi
Sungjin Ahn
Sungjin Ahn@SungjinAhn_·
🧠We introduce "Generative Recursive Reasoning"! Recursive Reasoning Models like HRM, TRM, and Looped Transformers are deterministic — same input, same reasoning, every time. They collapse the entire space of plausible reasoning paths into a single attractor. Our model GRAM (Generative Recursive reAsoning Models) turns recursion itself into a stochastic latent trajectory. Multiple hypotheses, alternative solution strategies, and inference-time scaling not just by depth, but by width — parallel trajectory sampling. And here's the kicker: the same formulation that gives us conditional reasoning p(y|x) also makes GRAM a general generative model p(x). With only 10M params: • Sudoku-Extreme: 97.0% (TRM 87.4%) • ARC-AGI-1: 52.0% • ARC-AGI-2: 11.1% • N-Queens coverage: 90%+ 📄 Paper: arxiv.org/abs/2605.19376 🌐 Project page: ahn-ml.github.io/gram-website w/ Junyeob Baek @JunyeobB (KAIST), Mingyu Jo @pyross0000 (KAIST), Minsu Kim @minsuuukim (KAIST & Mila), Mengye Ren @mengyer (NYU), Yoshua Bengio @Yoshua_Bengio (Mila), Sungjin Ahn @SungjinAhn_ (KAIST)
Sungjin Ahn tweet mediaSungjin Ahn tweet mediaSungjin Ahn tweet media
English
25
194
1.4K
151.6K
Alfredo De la Fuente retweetledi
Eric Jang
Eric Jang@ericjang11·
For the last few months I've been working on a from-scratch implementation of AlphaGo, a 2016 AI breakthrough that inspired me to get into deep learning. My casual understanding of AlphaGo was "search-augmented deep neural networks trained with self-play", but I wanted to go deeper and understand it by creating it. Frontier deep learning research has always been expensive, but any given capability gets cheaper very quickly. In 2026, you no longer need DeepMind's resources to train a strong Go AI - you can vibe code all of it yourself for just a few thousand dollars of rented compute. It was a huge honor to be invited to teach this with @dwarkesh_sp on @dwarkeshpodcast I am an AlphaGo & Go apprentice, not a master, so all factual errors in the podcast are mine. Web version of tutorial: evjang.com/2026/04/28/aut… Code: github.com/ericjang/autogo Play the go bot here: autogo.evjang.com
Dwarkesh Patel@dwarkesh_sp

New blackboard lecture w @ericjang11 He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

English
46
178
2.4K
496.2K
Alfredo De la Fuente retweetledi
Nikita Morozov
Nikita Morozov@nvimorozov·
(1/n) Excited to share our latest work “Learning Shortest Paths with Generative Flow Networks”! We uncover a novel theoretical connection between flow minimization in GFlowNets and finding shortest paths, and develop a learning approach that rivals SOTA in solving Rubik's Cubes!
Nikita Morozov tweet media
English
3
54
344
18K
Alfredo De la Fuente retweetledi
Nous Research
Nous Research@NousResearch·
Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
Nous Research tweet media
English
150
420
3.7K
440.9K
Alfredo De la Fuente retweetledi
Kirill Neklyudov
Kirill Neklyudov@k_neklyudov·
Population dynamics (eg murmuration of birds 🐦🐦🐦) is notoriously hard to learn; choosing the right model for the dynamics is even harder. In our #ICML2026 spotlight, we introduce Wasserstein Lagrangian Mechanics (WLM) for learning population dynamics from observations, which - Covers both first-order (gradient descent) and second-order dynamics (e.g. oscillations) - Allows learning more expressive dynamics (including complex interactions) with fewer assumptions - Generalizes in space (across different initial conditions) and time (beyond the training time snapshots) [1/n] 🧵
GIF
English
5
45
276
23.4K
Alfredo De la Fuente retweetledi
Andreas Bergmeister
Andreas Bergmeister@AndBergmeister·
Pretraining diffusion/flow-matching is simple and scales: noise a clean sample, regress against a closed-form target. RL post-training can be made just as simple. Reinforce Adjoint Matching (RAM) matches Flow-GRPO on SD3.5M in 50× fewer steps. Highest reward, no reward hacking.
English
7
52
411
48.7K
Alfredo De la Fuente retweetledi
Linlu Qiu
Linlu Qiu@linluqiu·
Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵
Linlu Qiu tweet media
English
15
130
804
133.9K
Alfredo De la Fuente retweetledi
Julie Kallini ✨
Julie Kallini ✨@JulieKallini·
Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/
English
14
111
738
95.6K
Alfredo De la Fuente retweetledi
yingzhen
yingzhen@liyzhen2·
Our own answer: structured coupling arxiv.org/abs/2605.07676 - flow matching with VAE-based coupling - VAE encoder & flow sharing networks - VAE decoder init. + flow refinement for sampling flow matching 🤝 VAEs -> good representation & sample quality🚀
yingzhen tweet media
yingzhen@liyzhen2

Tons of papers re diffusion/flow matching at ML confs these days, but to my surprise very few of them consider learning the prior🤔 Am I missing any important work here? 🙏 for suggestions

English
3
44
239
22K
Alfredo De la Fuente retweetledi
Probability and Statistics
One theorem every ML engineer should know: The Johnson–Lindenstrauss Lemma. It states that high-dimensional data can be projected into a much lower-dimensional space while approximately preserving pairwise distances. Why it matters: • Explains why random projections work • Enables scalable learning in high dimensions • Used in embeddings, compressed learning, and ANN search • Helps fight the curse of dimensionality The surprising part: You can reduce dimensions dramatically without destroying the geometry of the data. That’s why many ML systems can operate efficiently even with massive feature spaces. Modern representation learning is deeply connected to this idea: Good embeddings preserve structure while compressing information. In ML, compression is often not loss of intelligence — it’s removal of redundancy.
Probability and Statistics tweet media
English
18
239
1.8K
130.9K
Alfredo De la Fuente retweetledi
hardmaru
hardmaru@hardmaru·
The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: arxiv.org/abs/2603.23198 Blog: pub.sakana.ai/sparser-faster… Code: github.com/SakanaAI/spars… ⚡️
hardmaru tweet media
Sakana AI@SakanaAILabs

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️ Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: Paper: arxiv.org/abs/2603.23198 Blog: pub.sakana.ai/sparser-faster… Code: github.com/SakanaAI/spars… While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance. So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations. Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy. This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

English
51
502
3.5K
424.4K
Alfredo De la Fuente retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
English
591
1.7K
16.5K
2.4M
Alfredo De la Fuente retweetledi
Sander Dieleman
Sander Dieleman@sedielem·
My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks. It's the longest one yet👀 Let me know what you think! sander.ai/2026/05/06/flo…
English
7
168
723
80.9K
Alfredo De la Fuente retweetledi
Lee Sharkey
Lee Sharkey@leedsharkey·
My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)
English
34
193
1.5K
238.2K
Alfredo De la Fuente retweetledi
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]
English
21
95
621
100.4K
Alfredo De la Fuente retweetledi
λux
λux@novasarc01·
the era of experience is a great read if you want to understand david silver’s current research direction and the kind of future he is envisioning with ineffable intelligence.
λux tweet media
Ineffable Intelligence@IneffableLabs

Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. ineffable.ai

English
10
30
437
42.1K