Alexandre Ramé

829 posts

Alexandre Ramé banner
Alexandre Ramé

Alexandre Ramé

@ramealexandre

Research scientist @GoogleDeepMind. Previously PhD @Sorbonne_Univ_. Post-training Gemma LLMs: distillation, RL and merging.

Присоединился Mayıs 2011
773 Подписки1.9K Подписчики
Закреплённый твит
Alexandre Ramé
Alexandre Ramé@ramealexandre·
Welcome Gemma 3, our new open-weight LLM from @GoogleDeepMind. All sizes (1B, 4B, 12B and 27B) excel on benchmarks, but the key result may be the 27B reaching 1338 on LMSYS. For this, we scaled post-training, with our novel distillation, RL and merging strategies. Happy building!
Alexandre Ramé tweet media
English
5
24
202
20.6K
Alexandre Ramé ретвитнул
MLIA
MLIA@mlia_isir·
📢Yet another great news with the announcement of Mustafa Shukor's PhD defense! 🔥 📖Title: “Efficient and Scalable Multimodal Learning.” 📅Date: Tuesday, March 24, 2026 ⌚️Time: 2:30 p.m. CET 📍Location: Herpin amphitheater
MLIA tweet mediaMLIA tweet media
English
0
1
7
458
Alexandre Ramé ретвитнул
Xidong Feng
Xidong Feng@Xidong_Feng·
We've witnessed a crazy concurrent line of work on on-policy self-distillation in LLMs, and I truly believe this is the next paradigm of RL. Back in 2024, we proposed this exact conceptual shift in our paper, Natural Language Reinforcement Learning (NLRL). The real breakthrough here isn't just the specific distillation mechanics. It’s that RL is fundamentally shifting away from the traditional "sample -> then filter or amplify" approach. Instead of passively waiting to stumble upon a good action to upweight, the field is moving toward true synthetic language data generation from experience, which enables true continual learning. You can see this exact recipe playing out across all the recent hit papers: • RLTF (2602.02482): Text critiques as privileged info • OPSD (2601.18734): Ground-truth solutions • SDPO (2601.20802): Runtime errors & execution feedback • ERL(2602.13949): Self-reflections & demonstrations Instead of just using a scalar reward to filter bad rollouts, they all use language feedback to explicitly generate a corrected, high-quality trajectory in hindsight, and then distill that competence back into the base policy. While the specific ways we adapt RL to LLMs are still rapidly evolving, the core vision we outlined in NLRL holds true today: a single scalar is simply too poor of a carrier for credit assignment. When people talk about "experiential memory" for agents today, they are essentially describing what we framed as a Language Value Function (LVF)—not just RAG over past episodes, but storing the structured, strategy-level "why" behind what worked. And what we called "Language Policy Improvement" is exactly this feedback-aware self-distillation loop we see everywhere now. Language, not scalars, is the future of RL. 📄 Check out our early exploration of this framework here: arxiv.org/abs/2411.14251
English
6
27
203
31.5K
Alexandre Ramé ретвитнул
Alexander Lerchner
Alexander Lerchner@AlexLerchner·
🧵1/4 The debate over AI sentience is caught in an "AI welfare trap." My new preprint argues computational functionalism rests on a category error: the Abstraction Fallacy. AI can simulate consciousness, but cannot instantiate it. philpapers.org/rec/LERTAF
English
52
45
268
99.8K
Alexandre Ramé ретвитнул
fly51fly
fly51fly@fly51fly·
[CL] Think Before You Lie: How Reasoning Improves Honesty A Yuan, A Ghandeharioun, C Blum, A Machado… [Google DeepMind] (2026) arxiv.org/abs/2603.09957
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
0
6
22
2.1K
Alexandre Ramé ретвитнул
templar
templar@tplr_ai·
We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n
English
211
956
6.2K
1.8M
Alexandre Ramé ретвитнул
Jeff Dean
Jeff Dean@JeffDean·
⚡ Excited to announce Gemini 3.1 Flash-Lite! We’ve set a new standard for efficiency and capability to give developers our fastest, most cost-effective Gemini 3 model yet. We engineered this model with thinking levels, allowing it to handle high-volume queries instantly, while scaling up its reasoning for complex edge cases. By the numbers: ⏱️ 2.5X faster time-to-first-token than 2.5 Flash while being significantly higher quality 📉 $0.25 per 1M input tokens 📊 1432 Elo on LMArena & 86.9% on GPQA Diamond Thrilled to see what developers build with this kind of speed and quality at scale. Available now in Google AI Studio and Vertex AI. blog.google/innovation-and…
Jeff Dean tweet media
English
68
122
1.3K
116.7K
Alexandre Ramé ретвитнул
Fabien
Fabien@Fabien_Mikol·
Superbe vidéo sur "l'apprentissage subliminal" entre LLM : quand le fine-tuning d'un modèle à partir de données engendrées par un autre peut transmettre des traits et préférences de ce dernier, même quand les signaux transmis nous paraissent sémantiquement totalement neutres.
Owain Evans@OwainEvans_UK

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

Français
2
4
39
3.9K
Alexandre Ramé ретвитнул
Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦@BlackHC·
@pmddomingos It's crazy how you're so consistently wrong in all your takes I see on my timeline. Would you mind starting to give financial advice? I foresee another reverse-Kramer here with great return potential
English
2
3
131
4.6K
Alexandre Ramé ретвитнул
Ye Zhu
Ye Zhu@szyezhu·
Tenure-track Assistant Professor Opening at École Polytechnique, France #FacultyOpening #TenureTrackAP #LLMs #ComputerScience The Computer Science Department at École Polytechnique is hiring a tenure-track Assistant Professor focusing on Robust and efficient Deep Learning with applications to Large Language Model. Please find the detailed job description at the link below: orailix.com/files/position…
Français
1
11
29
3.2K
Alexandre Ramé ретвитнул
Edward Grefenstette
Edward Grefenstette@egrefen·
📢 The Autonomous Agents team at @GoogleDeepMind is seeking to hire one research scientist to work on human-centric reward modelling and learning from human interaction, as part of an established and successful line of research projects. Link at the end of this thread. [1/5]
English
4
24
257
32.7K
Alexandre Ramé ретвитнул
Martin Klissarov
Martin Klissarov@MartinKlissarov·
In the limit, what's important is our ability to adapt.  What is a good recipe for teaching agents to adapt on-the-fly? We introduce two meta-learning for LLMs papers written with @JonnyCoook at @GoogleDeepMind. This is research from last year we can finally share 🧵👇
Martin Klissarov tweet media
English
9
48
247
50.2K
Alexandre Ramé ретвитнул
Thomas Kleine Buening
Thomas Kleine Buening@thomasklbg·
Deployed LLMs and users generate millions of conversations every day. These are full of useful learning signals, yet we don't use them for training. We introduce self-distillation for learning directly from user conversations – no rewards, no labels, no extra models.
Thomas Kleine Buening tweet media
English
9
36
253
51.8K
Alexandre Ramé ретвитнул
Chenlu Ye
Chenlu Ye@ye_chenlu·
1/5 Happy CNY🎊 Still bothered by RL off-policy instability in LLM? Introducing a new way💡Adaptive Layerwise Perturbation (ALP)💡, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! 🔗 Blog: beneficial-curiosity-d98.notion.site/Adaptive-Layer…
Chenlu Ye tweet mediaChenlu Ye tweet mediaChenlu Ye tweet media
English
4
27
145
23.9K
Alexandre Ramé ретвитнул
Guohao Li 🐫
Guohao Li 🐫@guohao_li·
The on-policy cross-stage distillation in GLM-5 by @Zai_org is quite interesting as well. When @thinkymachines released their blog on on-policy distillation, I also shared some thoughts on how to use on-policy distillation for catastrophic forgetting, especially for omni models. My prediction is multi-teacher on-policy distillation will become a standard for training omni models in 2026 Checkout our original post: x.com/guohao_li/stat… And a very old paper we published pre-LLM era on multi-teach imitation learning: arxiv.org/pdf/1803.01129
Guohao Li 🐫 tweet mediaGuohao Li 🐫 tweet media
Guohao Li 🐫@guohao_li

The GLM-5 technical report is an impressive read. In the terminal environment section, their methodology is very similar to our SETA project: starting from seed tasks to draft terminal-task specifications, then building Docker environments and validating them with test scripts. They’ve also scaled this pipeline to generate thousands of environments using this approach. Thanks @Zai_org for sharing such a detailed report. If any of you are interested in building open source terminal environments, do also check out our 1376 environments and blog here: GitHub: github.com/camel-ai/seta-… Blog: camel-ai.org/blogs/seta-sca…

English
2
11
121
10.8K
Alexandre Ramé ретвитнул
Brendan McCord 🏛️ x 🤖
Brendan McCord 🏛️ x 🤖@Brendan_McCord·
The AI community is independently rediscovering that pure reward optimization is insufficient and that something like reflective self-formation is needed for durable learning. This rhymes with an idea from philosophy of "Bildung," or self-formation of the whole person through engagement with the world. The loop they use here (experience → reflection → conceptualization → revised action → internalization) is explicitly drawn from Kolb, who draws from Dewey, who draws from Hegel, who wrote about Bildung. The detail that matters most is the internalization step. With this technique ERL, reflection-guided improvements get distilled into the base policy so the agent can act well without scaffolding at deployment. Similar to the Bildung idea that genuine formation becomes part of your character. Where could this go from here? The variant of Bildung I like most, that of Wilhelm von Humboldt, is formation toward no externally predetermined end. The person unfolds toward their own unique completeness, and the encounter with the world transforms what they're trying to become. ERL, by contrast, optimizes toward a fixed, externally specified reward function. The agent never revises its ends, only its means. Humboldt would call this Ausbildung (professional training) rather than Bildung (self-formation). The reflection in ERL is entirely instrumental ("how do I get more reward?") and never the broader "what should I be trying to do?" Related: Bildung requires genuine Freiheit (freedom) and open-endedness. The agent in ERL operates in closed environments with crisp success criteria. There's no possibility of the Sokoban agent in the paper deciding that pushing boxes is meaningless and it wants to write poetry instead. The "self" that gets formed is always already circumscribed by the task specification. And the "internalization" mechanism, while structurally elegant, is really behavioral cloning of successful outputs. The agent learns to reproduce improved behavior, not to understand why the improvement matters. In Bildung, internalization transforms your relationship to the world. This is a great "philosophy to code" paper. I would encourage @taiwei_shi and the other authors to read Humbolt to ideate on further directions, e.g., from mechanization of the reflective loop to mechanization of the question of what the reflection is for.
Taiwei Shi@taiwei_shi

For decades, we’ve trained AI to chase rewards. But humans don’t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing 𝐄𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐭𝐢𝐚𝐥 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠, a step toward AI that truly learn from experience.

English
6
8
73
10K
Alexandre Ramé ретвитнул
Quentin Berthet
Quentin Berthet@qberthet·
🚨 🔬 PhD positions at Google DeepMind in France 🇫🇷 We are advertising Master Level Intern positions at Google DeepMind within our Frontier AI Unit. These could lead to co-advised PhD positions with Google DeepMind and French academic institutions. job-boards.greenhouse.io/deepmind/jobs/…
English
8
59
593
52.2K