Kusha Sareen

0

3

64

𝕎00t@wmertens·18 May

@Frank37004246 @daniel_mac8 @KushaSareen why is FST on the left graph picked lower than RL?

English

0

56

Dan McAteer@daniel_mac8·17 May

babe, wake up. new continual learning breakthrough just dropped. fast-slow training (fst) treats model params as "slow" weights and optimized context as "fast weights". "across math, code, and general reasoning benchmarks, fst beats weights-only training on *every* axis we measured."

English

20

91

786

41.8K

Kusha Sareen@KushaSareen·16 May

It generally seems to depend on the dataset/model and also how much budget you allow for GEPA (you can get the model to rely more/less on the prompt depending on how much information you allow into it), so there are some knobs here you could use to control this. Importantly, the whole point of the paper is that, in a continual learning setup, not all domain-specific information needs to go into the model weights. We're planning on running some more evals related to this.

English

1

12

661

Ashutosh Baheti@abaheti95·16 May

@KushaSareen @LakshyAAAgrawal @Cameron_Chann @rish2k1 @agarwl_ @Devvrit_Khatri @inderjit_ml @profjoeyg @KurtKeutzer Interesting. What is the performance in the no prompt case after FST? Does the model without GEPA prompt also improve as much as with the GEPA prompt?

English

0

1

122

Kusha Sareen@KushaSareen·13 May

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

20

92

542

130.1K

Kusha Sareen@KushaSareen·15 May

Hey! That's a great question and we thought about it a bit. For simplicity, we just kept the GEPA prompt that had the highest validation accuracy but there are certainly other options! Eg. what we get from this algorithm is really (pool of prompts, model) rather than just (prompt, model) so there are all kinds of clever things you could do to better make use of the pool of prompts at inference time.

English

2

1

11

1K

Changyu Chen@Cameron_Chann·15 May

@KushaSareen @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer hey Kusha this is very cool! congrats. Since you maintain a prompt set of Pareto frontier during training, I'm wondering if the gepa prompt is used during evaluation and which one is used?

English

2

0

4

354

Kusha Sareen retweetledi

Rishabh Agarwal@agarwl_·15 May

GPT-3 was a sensation because it claimed language models are few-shot in-context learners. I wonder why we dropped the ball on in-context learning, and moved to mostly execution oriented research on LLMs: training on any task of high value, a set that will keep increasing with no end in sight. Maybe we'll get these data centers with geniuses, but they are *only* geniuses on tasks that your favourite frontier lab decides to directly / indirectly optimize for.

Delip Rao e/σ@deliprao

In-context learning in LLMs

English

15

13

190

29.9K

Kusha Sareen retweetledi

Rishabh Tiwari@rish2k1·15 May

Thanks for sharing, I agree with the motivations and ideas you mentioned, for better understanding it can be seen as FST instantiation where: *slow weights* update rule = self distillation *fast weights* update rule = GEPA we did try one experiment in the same spirit in which we distilled FST fast-weights (gepa style prompt) back to the model using on-policy reverse KL (similar to SDFT paper) and leads to some learning but performs worse than FST w/ GEPA+RL (@LakshyAAAgrawal explained this result in more detail here: x.com/LakshyAAAgrawa…). The idea of combining RLVR signal with self distillation signal is also very interesting and we did try that as well some time back in a related project, we are planning to release that as well soon.

English

2

3

10

861

Kusha Sareen@KushaSareen·15 May

@reza_byt @dheeraj_46329 @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Thank you Reza! 🙂

English

2

203

Reza Bayat@reza_byt·14 May

@KushaSareen @dheeraj_46329 @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Very nice work, congratulations!

English

3

733

Kusha Sareen@KushaSareen·15 May

@sekoumarkaba @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Thank you Oumar! 🙏

English

1

150

Oumar Kaba@sekoumarkaba·15 May

@KushaSareen @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Amazing work, congrats @KushaSareen !

English

Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: gepa-ai.github.io/gepa/blog/2026…

0

3

200

Kusha Sareen retweetledi

Devvrit@Devvrit_Khatri·15 May

ICL lets models adapt rapidly to changing tasks (✅), but the weights stay frozen - leaving performance gains on the table (⚠️). Fine-tuning (like SFT, RL) reaches a higher perf ceiling (✅), but is slow, can hurt OOD performance, and often reduces plasticity (⚠️). Why not combine the strengths (✅) of both? We introduce Fast-Slow Training (FST): fast weights (prompts) quickly capture task-specific nuances, while slow weights (model parameters) internalize the more general, task-agnostic reasoning patterns that should persist across tasks. FST reaches a higher perf asymptote while being more efficient. Since prompts absorb more of the task-specific information, the parameters do not need to move as much. As a result, the model stays closer to the base model, and preserves more plasticity for learning new tasks!

Rishabh Agarwal@agarwl_

English

14

51

12.6K

Kusha Sareen retweetledi

Devvrit@Devvrit_Khatri·15 May

Thread by the amazing @KushaSareen here: x.com/KushaSareen/st… Blog: gepa-ai.github.io/gepa/blog/2026… Paper: arxiv.org/abs/2605.12484

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

3

8

1.5K

Kusha Sareen retweetledi

Rishabh Agarwal@agarwl_·15 May

Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: gepa-ai.github.io/gepa/blog/2026…

English

18

73

566

69.4K

Kusha Sareen retweetledi

Rishabh Agarwal@agarwl_·15 May

And see the tweet thread from @KushaSareen here x.com/i/status/20545…

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

5

26

6K

Kusha Sareen retweetledi

Moksh Jain@JainMoksh·13 May

The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.

English

17

38

6.1K

Kusha Sareen@KushaSareen·14 May

@muchomuchacho @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Thank you, Rafa :)

English

1

407

Rafael Pardinas@muchomuchacho·14 May

@KushaSareen @rish2k1 @agarwl_ @Devvrit_Khatri @LakshyAAAgrawal @inderjit_ml @profjoeyg @KurtKeutzer Very cool work, Kusha!

English

3

719

Kusha Sareen retweetledi

Rishabh Tiwari@rish2k1·14 May

Great article, I see a future where learning algorithms will co-evolve model-parameters and harness around around it for continuous improvement. Just like prompt engineering is better handled by a principled algorithm like GEPA, soon harness engineering will be handled by class of algorithms like FST (fast-slow training). x.com/KushaSareen/st…

English

2

13

1K

Kusha Sareen retweetledi

lovish@louvishh·13 May

very cool work!!

Rishabh Tiwari@rish2k1

Very excited about this line of research of fast-slow learning, 1) potential to solve a lot of issues with current RL (eg. entropy collapse, sparse rewards) 2) an intuitive way of incorporating rich feedback with RL 3) provides a way to transfer knowledge of text-only based learning into the model 4) a great candidate for model-harness co-evolution, seeing a lot discussion on X lately about future models developing their own harness. 5) most importantly, can imagine these kinds of algorithms to be more suitable candidates for discovery that requires both extreme exploration but at the same time improving the underlying model capabilities. and much more ...

English

3

13

3.5K

Kusha Sareen retweetledi

Lakshya A Agrawal@LakshyAAAgrawal·13 May

Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization. GEPA demonstrated this for context-space optimization (prompts and agent harnesses), delivering frontier results at a fraction of the cost of RL. But context-only optimization is bounded by the base model's capability ceiling; weight updates can reach further. Very excited about this new line of work on Fast-Slow Training (FST), which interleaves context and model weight optimization! The idea is a clean division of labor between two interleaved loops: 🔹 Fast loop (context): GEPA reads rich rollout feedback updating the context layer. The context becomes a fast-updating scratchpad of what the model needs to know about this task, right now. 🔹 Slow loop (model parameters): RL updates the model's parameters conditioned on the evolving context. Because the prompt already carries task-specific nuances, the model parameters are freed from absorbing them and focus on what actually generalizes across tasks and pushes the frontier. ⦁ 3× more sample-efficient than RL on math, code, and physics reasoning ⦁ ~70% lower KL divergence from base at matched accuracy ⦁ Plasticity preserved: FST checkpoints respond better to additional RL on new tasks than RL-only ones ⦁ Continual learning across changing tasks (HoVer → CodeIO → Physics) where RL stalls the moment the task switches FST is a direction towards: ⦁ Addressing RL's pain points: entropy collapse, sparse rewards, long-horizon exploration ⦁ Providing a clean channel for rich feedback into weight updates ⦁ Demonstrating model-harness co-evolution ⦁ Discovery: Using fast context updates for broad exploration, while leveraging a continually improving model. Check out the full thread below:

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

13

43

186

33K

Kusha Sareen retweetledi

Michael Griffiths@msjgriffiths·13 May

Now, this is a great framing the focuses on the duality of updates in discrete space (prompts) versus continuous space (weights).

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

3

5

748

Kusha Sareen retweetledi

Matei Zaharia@matei_zaharia·13 May

Really excited about this work that combines GEPA with RL! You get some of the advantages of both, with reflection on rich feedback leading to better weight updates.

@KushaSareen huggingface.co/papers/2605.12…

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

5

20

183

27.6K

Kusha Sareen retweetledi

Lakshya A Agrawal@LakshyAAAgrawal·13 May

QME

3

5

380

Kusha Sareen@KushaSareen·13 May

More broadly, FST represents a general blueprint for continual learning in LLMs: optimize context (using any method) to quickly learn task-specific information and update parameters (using any method) to build a general reasoning core. Incredibly grateful to the wonderful team: @rish2k1 @LakshyAAAgrawal @profjoeyg @matei_zaharia @KurtKeutzer @inderjit_ml @agarwl_ @Devvrit_Khatri Blog: gepa-ai.github.io/gepa/blog/2026… Paper: arxiv.org/abs/2605.12484 Code: rishabhtiwari.ai/projects/fst/c…

English

6

25

2.7K

Kusha Sareen@KushaSareen·13 May

The KL trajectories help understand why FST mitigates plasticity loss. Standard RL forces models to heavily update their weights to maximize reward, which drives up KL wrt the base model. Since FST offloads task-specific details to the prompt, the core model parameters don't need to shift nearly as much. We see higher performance with significantly less deviation from the base model.

English