Edward Beeching

228 posts

Edward Beeching

@edwardbeeching

Research Scientist @HuggingFace. PhD in Deep RL approaches for Robotic Navigation @INRIA.

Lyon, France Katılım Temmuz 2010

70 Takip Edilen2.5K Takipçiler

Edward Beeching@edwardbeeching·29 Eki

This opens up more flexible and powerful student-teacher pairings. You can now pick the best teacher and best student for the job, without worrying if their tokenizers match. You can read the full technical write-up and find the open-source code here: huggingface.co/spaces/Hugging…

English

138

Edward Beeching@edwardbeeching·29 Eki

We also benchmarked GOLD against GRPO. Even in this difficult cross-tokenizer scenario, GOLD still outperformed GRPO by 20%

English

191

Edward Beeching@edwardbeeching·29 Eki

The recent @thinkymachines post was a great reminder of how effective on-policy distillation is for LLMs. But it highlighted a major practical constraint, one that has limited its use: the teacher and student models must share the same tokenizer.

English

275

Edward Beeching retweetledi

Loubna Ben Allal@LoubnaBenAllal1·29 Tem

SmolLM3 full training and evaluation code is now live, along with 100+ intermediate checkpoints: ✓ Pretraining scripts (nanotron) ✓ Post-training code SFT + APO (TRL/alignment-handbook) ✓ Evaluation scripts to reproduce all reported metrics github.com/huggingface/sm… All Apache 2.0

English

459

17.2K

Edward Beeching retweetledi

Mishig Davaadorj@mishig25·14 Tem

It is cool to be capable. It is cool to know shit. That's why the HF team is open-sourcing not just the model, but the training code and datasets too. Learn. Build. Make it your own. github.com/huggingface/sm…

English

618

35.5K

Edward Beeching retweetledi

Carlos Miguel Patiño@cmpatino_·9 Tem

I'm super grateful for learning from @_lewtun and @edwardbeeching and for working @huggingface where we can share the details of our work with the community. The blogpost is ready and we'll link there all our artifacts (code, datasets, and recipes)! huggingface.co/blog/smollm3 (2/2)

English

3.3K

Edward Beeching@edwardbeeching·8 Tem

You can find out more in our blogpost: huggingface.co/blog/smollm3

English

250

Edward Beeching@edwardbeeching·8 Tem

I had the opportunity to spend the last month building an open-source, state of the art, dual-mode reasoning model at the 3B scale. Building on the amazing work of @huggingface's Pretraining team. It was tough but we managed to get on the Pareto front with the Qwen3 models. 🧵

English

11K

Edward Beeching@edwardbeeching·27 May

@casper_hansen_ The recipes for R1-zero are mostly ready in this PR: github.com/huggingface/op… More details found here: huggingface.co/spaces/open-r1… We should be officially releasing soon.

English

162

Casper Hansen@casper_hansen_·27 May

What I wanted from Open-R1: recipe to reproduce R1-Zero What I got from Open-R1: an SFT distillation dataset used to match distilled models of R1 How come this is so misaligned? Why not try to leap to R1-Zero?

English

4.9K

Edward Beeching retweetledi

Aritra 🤗@ariG23498·12 May

I just noticed that we have a survey of Vision Language Models every year. 2023: huggingface.co/blog/vision_la… (@RisingSayak and @alaradirik) 2024: huggingface.co/blog/vlms (@mervenoyann and @edwardbeeching) 2025: huggingface.co/blog/vlms-2025 (launched today) This is a nice time to read through all of them and figure out what worked and what did not. It also helps us shape our thoughts on VLMs and the trends that came into existance. I am pretty sure I would read them all in that order to discover something really interesting! (Weekend sorted)

English

4.9K

Edward Beeching retweetledi

Shirin Yamani@shirinyamani·7 May

Wanna learn how TRL makes online trainers go fucking brrr with vLLM? check the doc out then!🔥 huggingface.co/docs/trl/main/…

English

1.1K

Edward Beeching retweetledi

Casper Hansen@casper_hansen_·22 Mar

TRL now handles multi-node training with vLLM for GRPO🤯

English

236

26.1K

Edward Beeching@edwardbeeching·20 Mar

@danielhanchen @natolambert The DAPO paper did some nice ablations of this and confirmed our intuition / more limited empirical observations: arxiv.org/pdf/2503.14476

English

125

Daniel Han@danielhanchen·15 Mar

@natolambert Another question I had was in your RLHF book #proximal-policy-optimization-1" target="_blank" rel="nofollow noopener">rlhfbook.com/c/11-policy-gr… I'm unsure why TRL is diff since github.com/huggingface/tr… Before it was the same - avg loss per reward, then taking the global mean. TRL now does the global mean. I showed an example how the loss is different:

English

Nathan Lambert@natolambert·14 Mar

Does anyone have an intuition or ablation on applying the KL penalty in the loss directly rather than when the reward is computed? How is this changing learning. normal rewards = rewards - self.beta * per_token_kl GRPO impl per_token_loss = pg_loss_max + self.beta * per_token_kl

English

339

68.6K

Edward Beeching retweetledi

Lewis Tunstall@_lewtun·13 Mar

I've just pushed the decontaminated subset of CodeForces-CoTs we used to train the OlympicCoder models. Checked for 8-gram overlap against: - AIME24 & AIME25 - GPQA Diamond - MATH-500 - LiveCodeBench - IOI24 Now you can train models to help pass your next LeetCode interview at FAANG ;) Link in the next post because X makes us play silly games 🤷‍♂️

English

5.9K

Edward Beeching retweetledi

Lewis Tunstall@_lewtun·17 Mar

GRPO is about to go brrrr in TRL github.com/huggingface/tr…

English

180

10.8K

Edward Beeching retweetledi

Aviral Kumar@aviral_kumar2·12 Mar

A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well. My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️ cohenqu.github.io/mrt.github.io/

English

200

16.7K

Edward Beeching retweetledi

Lewis Tunstall@_lewtun·12 Mar

We took a deep dive into the Gemma 3 tech report today at Hugging Face and recorded the discussion :) youtu.be/GAHCXXKmIT8 It's very cool to see Google baking so many post-training methods into a single model: from online knowledge distillation to RL with model merging RMs and policy. Enjoy!

YouTube

English

9.6K

Keşfet

@thinkymachines @_lewtun @huggingface @casper_hansen_ @RisingSayak @alaradirik @mervenoyann @danielhanchen