Sean McLeish

125 posts

Sean McLeish

Sean McLeish

@SeanMcleish

PhD student at the University of Maryland

Katılım Kasım 2023
132 Takip Edilen604 Takipçiler
Sabitlenmiş Tweet
Sean McLeish
Sean McLeish@SeanMcleish·
Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7
Sean McLeish tweet media
English
9
65
387
64.3K
Sean McLeish retweetledi
Tom Goldstein
Tom Goldstein@tomgoldsteincs·
⛷️Here’s my entry for the fast generative model olympics🥇 The Sphere Encoder is an autocoder so powerful that it produces high quality images quickly and without diffusion. At training time, we learn an encoder that maps natural images uniformly onto the surface of a sphere. At inference time, we sample a random vector from the sphere, and a decoder makes it into an image.
Tom Goldstein tweet media
English
16
62
500
51.4K
Sean McLeish retweetledi
Dayal Kalra
Dayal Kalra@dayal_kalra·
Excited to share work from my internship at MSL @AIatMeta! 🚀 We analyze Critical Sharpness: a scalable curvature measure requiring only ~6 forward passes to analyze LLM training dynamics at scale. We extend this measure to introduce Relative Critical Sharpness, which measures the relative curvature between two landscapes. We use this to answer a major practical question: How much pre-training data should we mix during fine-tuning to avoid catastrophic forgetting? 🧵 (1/n)
Dayal Kalra tweet mediaDayal Kalra tweet media
English
7
33
279
17.7K
Sean McLeish retweetledi
Simon Barnett
Simon Barnett@SimonDBarnett·
♻️ Recursive language models (RLMs) are incredibly cool and now is the time to be paying attention to them. Reasoning models are clearly the frontier. They've matured at breakneck speed. We've gone from simple chains-of-thought to sophisticated test-time scaling paradigms in a few years. Great! But how can we make reasoning more efficient at scale? -- ‼️ TL;DR Do surgery on an existing transformer. Install an internal recursion mechanism to create an RLM. The model's immune system will respond. That's okay. Conduct a 'healing phase' to reawaken the RLM to its new, hybrid reality. You don't need to do RL or SFT with valid reasoning traces to amplify higher-level thinking. Now, instead of scaling tokens/context at test-time to boost reasoning, the thinking happens within the model's latent space as it iteratively polishes its hidden state -- saving inference compute. Current reasoning models pay for every thought twice—once to generate the token, once to store it. RLMs think in place, exiting tokens only when ready. What's the new efficiency ceiling? Couldn't tell ya'. Is this a robust procedure yet? No. Do we know how the method scales or if we can reapply any other modern reasoning mechanisms? Also no. Is it obvious how far RLMs will take us or if they'll be the prevailing paradigm? Definitely not. But if we knew all these answers, it wouldn't be as interesting to read about. More detailed mini-essay below. 👇 -- ⏩ Skip this section if you don't want the background. I found out about RLMs at the inaugural workshop on efficient reasoning at @NeurIPSConf, where they were a fixture. It struck me how tech progress is rarely linear. It ebbs and flows with funding cycles, grinds to a halt if technology barriers pile up, and can explode with one eureka moment. But other times cool ideas fade into the scientific backwater if we get fixated on something that works. That's sort of what happened with the 2017 transformer unlock. We got these smooth-looking scaling laws with a simple recipe of parameters, tokens, and FLOPs. As pretraining waned, the field has moved into post / test-time training to keep the party going, and to much success. So, why not keep spamming this formula? I certainly would, especially if I'm a multi-billion $ frontier lab that can't afford to fall behind, break a narrative, etc. Engineering inertia is very real. That's why RLMs were so cool to hear about. It was like rediscovering an idea. Recursion isn't new, by any means. But every once in a while, ideas orbit back around. The immense gravity of parallel-processing via transformers seems to have pulled recurrent scaling back into the limelight. -- ▶️ RLM stuff starts here. The transformer essentially killed recurrent neural networks (RNNs) for language-modeling tasks. RNNs excel in some areas like real-time processing of sequential data, but they're notoriously hard to train as gradient updates can vanish/explode inside them. They also don't take advantage of GPU parallelism to the same extent. But was there a way to have one's cake and eat it too with some sort of hybrid model? Yes! Universal Transformers (UTs, 2018) were kind of like patient zero for this idea. You exploit the parallel attention of a transformer, but impose a recurrent inductive bias that exists in depth rather than sequence position. Basically, you're refining a hidden state representation at each token position until the model decides it's done thinking. But when's that? Ponder time (a/k/a adaptive computation time, ACT) came out a few years earlier. Here, you embed a lightweight halting classifier at each position that determines its doneness. Similar ideas were floating at the time, like neural GPUs, neural Turing machines, etc, but I like UTs because they combined global attention with recurrent depth and dynamic halting and also showed they could smoke vanilla transformers on contemporary benchmarks. arxiv.org/abs/1807.03819 -- The main RLM work I want to talk about combines several modern ideas (e.g., test-time scaling, recurrence, latent reasoning). The lead author @jonasgeiping also gave the talk! Conceptually, the idea is that humans don't vocalize our intermediate thoughts while reasoning, which is what current reasoning LLMs do -- they construct their reasoning trajectories via token-scaling. Using a prior on metastable brain waves, we talked about how neuronal activity bounces between these 'thinking modes' defined by MRI activity. So how can we bio-mimic this? Well, we can reason in the model's latent space rather than forcing thinking through the pinhole of token verbalization. Neat, but how? They made Huginn-0125, a 3.5B parameter model trained on 0.8T tokens. It's got three main parts: an encoder section (prelude), the inner recurrent block (R), and a decoder (coda), shown below. Importantly, there's a residual stream that concatenates the tokenized (but unaltered) input through each iteration to ensure training/inference stability. Huginn wasn't trained with an explicit number of iterations (k), but rather via random (Poisson distributed) k un-rollings, making sure the model stays on it's toes and doesn't expect to exit at a specific time. At test-time, the model dynamically iterates the hidden state through the R block, polishing each position until it's ready to go. The paper fully admits the model was trained sub-optimally for budget reasons, so I read this as a proof-of-principle, which makes the results even more exciting because there's a lot of work that can be stacked quickly. Without any RL or SFT on reasoning traces, Huginn is pretty competitive with models 2-3 its size on reasoning tasks (e.g., GSM8k). In effect, this is another way to scale at test-time. You simply increase iteration count in latent space instead of blowing up your token count as is currently being done. What really blew my mind was the call-back to brainwave regimes. They ran a PCA analysis of the model's latent space trajectories, finding that it sometimes converges on these orbital-like shapes during certain tasks. That's crazy to me because we saw the same thing with the MRI scans. I don't want to get heavy-handed with the bio-analogies, but this is 'pinch me' stuff. I think this is important because latent reasoning loses interpretability by default, so some semblance of a way to monitor these trajectories could be useful. arxiv.org/pdf/2502.05171 -- Many of the same authors (+ @SeanMcleish) did some important follow-up work that cements a few guardrails around RLMs and also an important concept -- You can adapt pre-trained transformers to do recurrent, latent reasoning. You don't need to start from scratch. This opens up doors for accessibility quite a lot. First, they find that you can take a many-layer, non-recurrent pre-trained transformer and cut it into the aforementioned prelude/recurrent/coda blocks. While this was originally traumatic for Llama, they noticed that additional training (healing) can adapt the network to the new inductive bias. Also, they find that initializing from the pre-trained weights is vastly more FLOPs-efficient. Compared to Huginn-0125, this retrofitted recurrence method was +12 points on MMLU and +7 points on GSM8k despite Huginn having 4-5x more parameters and being trained on 0.8T of data. Economically, this sounds like model distillation to me. At least in the sense you take a pre-trained teacher and build a smaller (in this case, depth-scalable) student without paying the full cost again. openreview.net/pdf?id=Oq3Xblt… -- I'll close with a bit of efficiency talk. We've seen how these RLMs can be created, even forked from existing open-source models, but I want to talk about economics. Another work points to the fact that Huginn-0125 was much slower (a factor of k) than the non-recurrent versions. We need some early threads about how to recover that speed. From the original paper Huginn-0125 paper, we know that some token positions mature quicker than others. Simpler iterates can exit quicker than others. So, this follow-up work addresses that by noting conceptual/mathematical similarities with diffusion. I could be wrong, but my understanding is that you can consider an RLM to be a continuous, latent diffusion model. Obviously the randomized unrolling objective is different than static denoising, but the analogy holds. So, they flip the recurrent process from batch processing to an assembly-line. Because some token positions finish early, waiting for the whole batch to be done is inefficient. They use a different sampler that fills the token position x iteration depth grid on a diagonal wavefront. At each step, you advance active positions by a step, decode draft tokens at the frontier, and freeze stable tokens. It's a bit like speculative decoding. You don't necessarily save on FLOPs, but you are exploiting GPU parallelism in a way the initial, sequential setup left on the table. This netted a 5x speed-up with minimal accuracy loss, though obviously there's still much room to run. openreview.net/pdf?id=nA5IRfA… -- Alright, here's the rub. On paper, nothing about RLMs seems economically appealing as of today. Recurrent depth imposes a speed penalty that has only been partially offset with wavefront diffusion sampling. This is substantially behind transformers and not anywhere close to cases where time-to-first-token matters. There are new failure modes (e.g., overthinking can lead to inverse scaling), new hyper-parameters, less mature tooling, etc. Current performance doesn't appear to be an OOM better than token-scaling. So, why be excited about this? Well, consider a future where frontier labs have multiple different models/architectures under the hood that they route to different requests depending on what the situation calls for (with gross margin being ubiquitous in the denominator). The fact that RLMs have fixed hidden states regardless of iteration count is important, especially if juxtaposed against KV caches that grow linearly with chain-of-thought methods. Theoretically, RLMs are Turing complete and can loop ad infinitum, maybe allowing them to address extremely difficult tasks that feed-forward networks can't, though there's zero proof of this yet? I'm not sure how many latent iterations will be needed to approximate equivalent token-scale reasoning, but if it's relatively small, I can see a place for RLMs to co-exist. There will need to be new infrastructure. Maximizing inference efficiency seems like the biggest one. We'd need routers/schedulers that assume fixed KV and dynamic iteration. Perhaps some purpose-built kernels for diagonal wavefront sampling, exit tracking, etc could be useful optimizations? I'm really hoping this stuff goes mainstream!
Simon Barnett tweet mediaSimon Barnett tweet mediaSimon Barnett tweet mediaSimon Barnett tweet media
English
7
19
154
10.9K
Sean McLeish retweetledi
Jonas Geiping
Jonas Geiping@jonasgeiping·
We just published a new open-source model that we trained with RL to be capable of open-ended forecasting! Open-endedness is a really interesting capability for a forecasting system. Instead of a standard prediction market setting, mainly focused on binary forecasts (where you have to know the solution space to even ask the question), our models come up with answers to open-ended questions, and quantify the likelihood of each answer. After training, this not only elevates open-source models into the realm of frontier models on forecasting evals, but it also improves the model's understanding of uncertainty in general! Many more details are in Nikhil's linked thread, on our blog post (openforecaster.github.io) and in the paper (arxiv.org/abs/2512.25070). --- There are a ton of details in our report on how to do data generation well, how to circumvent forecasting evaluation pitfalls, how we reward the model in a GRPO-style RL setup for both accuracy and calibration and how we train with retrieval to learn to understand contextual info. What I find really cool, directionally, about this line of work is that it moves in the direction of general purpose oracles, which is the hardest task a model can do that is not taking actions in the real world. Future models in this line of work could act as analysts, writing whole open-ended essays justifying their assessment of future events. And I think there is a lot of work still to do in this domain, when it comes to learning information gathering and learning to optimally reason under uncertainty, and progress can be tracked by evaluating forecasting capability.
Jonas Geiping tweet mediaJonas Geiping tweet media
Nikhil Chandak@nikhilchandak29

✨New work: How do we train language models for open-ended forecasting?🔮 For example, consider “Which tech company will the US government buy a > 7% stake in by September 2025?”. This requires one to explore the outcome space, not just assign probabilities to choices (as in most binary questions in prediction markets). Instead of picking from “Yes/No” options, we forecast in natural language: name the outcome + state your probability of it being correct. Results: with our data and post-training recipe, our OpenForecaster-8B model outperforms much larger models on calibration (Brier Score) and is competitive on accuracy, on held-out testing over 4 months.📈✅

English
4
19
103
11.4K
Sean McLeish retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11
Micah Goldblum tweet media
English
24
175
1.4K
158.4K
Sean McLeish retweetledi
Kevin David Hayes
Kevin David Hayes@kevindavidhayes·
Thrilled that our NeurIPS 2025 Spotlight is now live 🎉 We introduce FineGRAIN, a text-to-image evaluation benchmark that stress-tests 25+ fine-grained failure modes in diffusion models. Explore the dataset, docs, and leaderboards at finegrainbench.ai #NeurIPS2025 #GenerativeAI
English
0
1
6
430
Sean McLeish retweetledi
Brian Bartoldson
Brian Bartoldson@bartoldson·
🧊 Off-policy RL for LLMs is hard. Dr. GRPO collapses at 10 steps off-policy. TBA doesn't. @Kimi_Moonshot K2's approach is robust too – both independently landed on the same key ingredients 🤝 We ablate RL recipe ingredients + show the 2 small changes giving off-policy robustness. 🧵below + NeurIPS poster Friday @ 11 AM.
Brian Bartoldson tweet media
English
6
29
236
46.2K
Sean McLeish retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
Must-read AI research of the week: ▪️ LeJEPA ▪️ The Path Not Taken: RLVR Provably Learns Off the Principals ▪️ RLVE: Scaling Up Reinforcement Learning for LMs with Adaptive Verifiable Environments ▪️ Intelligence per Watt: Measuring Intelligence Efficiency of Local AI ▪️ Teaching Pretrained LMs to Think Deeper with Retrofitted Recurrence ▪️ TiDAR: Think in Diffusion, Talk in Autoregression ▪️ Black-Box On-Policy Distillation of LLMs ▪️ Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces Find the full list in our weekly newsletter: turingpost.com/p/fod127
Ksenia_TuringPost tweet media
English
2
33
191
13.6K
Sean McLeish retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Make your Language Models *think* deeper with Retrofitted Recurrence New research shows how to convert existing pretrained LMs into depth-recurrent models. This decouples training & test-time compute, improving performance on tasks like mathematics while reducing cost.
DailyPapers tweet media
English
5
13
82
13.6K
Sean McLeish retweetledi
fly51fly
fly51fly@fly51fly·
[CL] Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence S McLeish, A Li, J Kirchenbauer, D S Kalra... [University of Maryland & New York University] (2025) arxiv.org/abs/2511.07384
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
2
7
56
3.8K
Sean McLeish
Sean McLeish@SeanMcleish·
@waxhn Nope, no layer removal for the baselines, exactly as you download from Hugging Face. We just untie the embeddings on Llama-3.2 so it is a fair comparison
English
0
0
1
43
Max Y
Max Y@waxhn·
Ah no worries, I was very lowkey about it since I thought it was a dead end. More curious about the differences for the science. Yeah the layer removal is quite interesting. Just to double check, for your flop comparison, baseline is also after removing blocks right? And yeah good point about qwen 🥲 one issue I ran into was that no matter how I train, the benchmark results go down… which was why I only stuck to losses, and gave up on running benchmarks. (Many benchmarks became an anti-signal 😂) Anyways, cool work!!!
English
1
0
0
50
Sean McLeish
Sean McLeish@SeanMcleish·
Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7
Sean McLeish tweet media
English
9
65
387
64.3K
Trelis Research
Trelis Research@TrelisResearch·
@SeanMcleish Cheers! this is helpful, and clearly I need to read the work on huginn models
English
1
0
0
232
Trelis Research
Trelis Research@TrelisResearch·
Well worth a read. Key Learnings: - Recursive models seem to be more FLOPs efficient to train - Scheduling the recursions (linearly works fine), i.e. start training with 4 loops and increase up to 16 with epochs, makes the training much more compute efficient - Paper doesn't use hierarchical reasoning. Unclear to me whether hierarchical layers (even if on the same neural net) are beneficial or not. - The paper back props on a max of 8 loops. Unclear to me still how many cycles one really needs to back-prop through (HRM does very little, TRM backprops a full outer loop with multiple lower level recursions, this paper here [LRM?] just has one layer of loops and back-props the last 8). - Scheduling the backprop loops seems to also save compute, but makes things less data efficient. Open question, as far as I can tell, whether you can epoch over the same data as a solution to that. - Not clear to me what exactly it is about recursion that helps. HRM paper shows plots indicating gradients are more stable (you get uneven grad norms across layers in deep networks). TRM paper hints the benefit is in achieving compression. - What are the optimal learning rates if you do recursion versus non recursion? For example, should one use a lower lr because every update is magnified by num_loops? - What are compute (and inference) optimal model sizes (chinchilla laws) for recursive models? same or different? - This paper uses 8+ layers. TRM paper suggests that just 2 is optimal for sudoku. Seems still open how to optimise physical layers versus recursive layers.... - Are the big labs already using recursion? Why or why not? - What are the downsides of recursion? Are there papers and blogs on this? Please do suggest them!
Sean McLeish@SeanMcleish

Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7

English
3
8
112
12.8K
Sean McLeish
Sean McLeish@SeanMcleish·
Hi, sorry I missed this, we will add a cite to this in our work. The FLOPs efficiency is difficult to achieve: 1. Removing some layers (minimal performance diff but a FLOPs saving) 2. Curriculum, you can save a lot, a lot, of FLOPs when training with a very aggressive curriculum like we do 3. The exact data is important, as we highlight in the data mix section. I would expect this is even more important for Qwen as it is so over trained. 4. Eval loss can be the opposite way round to benchmarks. We're not the first to see this for recurrent models, but also observe our final loss to be ~0.01 above the static depth models.
English
1
0
4
214
Max Y
Max Y@waxhn·
@SeanMcleish Hmmm this is interesting. I actually attempted this exact experiment a few months ago with Qwen as the base. My conclusion was the contrary, that it was less efficient FLOPs wise. Might be a result of data mixture, I used pertaining corpuses instead of gsm8k.
English
3
0
2
269
Sean McLeish retweetledi
Micah Goldblum
Micah Goldblum@micahgoldblum·
An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3
English
39
146
1.5K
514K
Sean McLeish
Sean McLeish@SeanMcleish·
Recurrent models are (normally) trained with more recurrences than needed to allow for early stopping or extrapolation at test time. We increase the number of recurrences over training, reducing wall clock time without hurting performance. We also repurpose weights from pretrained models meaning we can reach lower loss quicker.
English
0
0
0
347
Arip
Arip@machinestein·
@SeanMcleish Why TRM,HRM are inefficient to train in large scale?
English
2
0
2
443
Sean McLeish
Sean McLeish@SeanMcleish·
@karanjagtiani04 @rohanpaul_ai Check out Appendix Figures 22 and 25 for accuracy vs. effective parameters used during inference. The recurrent models are very competitive at inference time too!
English
1
0
0
24
Karan Jagtiani
Karan Jagtiani@karanjagtiani04·
@rohanpaul_ai Interesting approach to add recurrence without expanding context window. Curious how loop count scaling affects latency and energy efficiency at inference.
English
1
0
1
139
Sean McLeish retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
This paper teaches existing LLMs to “think longer” by adding a loop inside the network. They cut the model into a prelude, a recurrent block, and a coda, then run the block multiple times. A small adapter mixes the prelude’s features with the running hidden state so each loop sharpens the same thoughts. Users can spend more compute at test time by increasing the loop count while keeping parameters, context length, and memory fixed. Starting from pretrained weights beats random initialization on training loss and common benchmarks. They raise the loop count gradually during training which reduces compute for the same loss. Because some layers get removed, they “heal” first on general web text, then train on math data to regain fluency and boost reasoning. The method works across TinyLlama, OLMo, and Llama and improves results on GSM8K and MATH when using more loops at inference. The Muon optimizer trains these recurrent models more stably than AdamW and avoids loss spikes. The recipe is retrofit the loop, schedule the depth, heal the model, then scale loops at test time. ---- Paper – arxiv. org/abs/2511.07384 Paper Title: "Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence"
Rohan Paul tweet media
English
9
27
135
11.5K