Edward Beeching

228 posts

Edward Beeching

Edward Beeching

@edwardbeeching

Research Scientist @HuggingFace. PhD in Deep RL approaches for Robotic Navigation @INRIA.

Lyon, France Katılım Temmuz 2010
70 Takip Edilen2.5K Takipçiler
Edward Beeching
Edward Beeching@edwardbeeching·
This opens up more flexible and powerful student-teacher pairings. You can now pick the best teacher and best student for the job, without worrying if their tokenizers match. You can read the full technical write-up and find the open-source code here: huggingface.co/spaces/Hugging…
English
0
0
0
138
Edward Beeching
Edward Beeching@edwardbeeching·
We also benchmarked GOLD against GRPO. Even in this difficult cross-tokenizer scenario, GOLD still outperformed GRPO by 20%
Edward Beeching tweet media
English
1
0
0
191
Edward Beeching
Edward Beeching@edwardbeeching·
The recent @thinkymachines post was a great reminder of how effective on-policy distillation is for LLMs. But it highlighted a major practical constraint, one that has limited its use: the teacher and student models must share the same tokenizer.
Edward Beeching tweet media
English
1
0
2
275
Edward Beeching retweetledi
Loubna Ben Allal
Loubna Ben Allal@LoubnaBenAllal1·
SmolLM3 full training and evaluation code is now live, along with 100+ intermediate checkpoints: ✓ Pretraining scripts (nanotron) ✓ Post-training code SFT + APO (TRL/alignment-handbook) ✓ Evaluation scripts to reproduce all reported metrics github.com/huggingface/sm… All Apache 2.0
English
9
69
459
17.2K
Edward Beeching retweetledi
Mishig Davaadorj
Mishig Davaadorj@mishig25·
It is cool to be capable. It is cool to know shit. That's why the HF team is open-sourcing not just the model, but the training code and datasets too. Learn. Build. Make it your own. github.com/huggingface/sm…
English
12
74
618
35.5K
Edward Beeching
Edward Beeching@edwardbeeching·
I had the opportunity to spend the last month building an open-source, state of the art, dual-mode reasoning model at the 3B scale. Building on the amazing work of @huggingface's Pretraining team. It was tough but we managed to get on the Pareto front with the Qwen3 models. 🧵
Edward Beeching tweet media
English
2
5
45
11K
Casper Hansen
Casper Hansen@casper_hansen_·
What I wanted from Open-R1: recipe to reproduce R1-Zero What I got from Open-R1: an SFT distillation dataset used to match distilled models of R1 How come this is so misaligned? Why not try to leap to R1-Zero?
English
6
3
48
4.9K
Edward Beeching retweetledi
Aritra 🤗
Aritra 🤗@ariG23498·
I just noticed that we have a survey of Vision Language Models every year. 2023: huggingface.co/blog/vision_la… (@RisingSayak and @alaradirik) 2024: huggingface.co/blog/vlms (@mervenoyann and @edwardbeeching) 2025: huggingface.co/blog/vlms-2025 (launched today) This is a nice time to read through all of them and figure out what worked and what did not. It also helps us shape our thoughts on VLMs and the trends that came into existance. I am pretty sure I would read them all in that order to discover something really interesting! (Weekend sorted)
English
2
11
69
4.9K
Edward Beeching retweetledi
Casper Hansen
Casper Hansen@casper_hansen_·
TRL now handles multi-node training with vLLM for GRPO🤯
Casper Hansen tweet media
English
3
22
236
26.1K
Daniel Han
Daniel Han@danielhanchen·
@natolambert Another question I had was in your RLHF book #proximal-policy-optimization-1" target="_blank" rel="nofollow noopener">rlhfbook.com/c/11-policy-gr… I'm unsure why TRL is diff since github.com/huggingface/tr… Before it was the same - avg loss per reward, then taking the global mean. TRL now does the global mean. I showed an example how the loss is different:
Daniel Han tweet media
English
2
6
50
6K
Nathan Lambert
Nathan Lambert@natolambert·
Does anyone have an intuition or ablation on applying the KL penalty in the loss directly rather than when the reward is computed? How is this changing learning. normal rewards = rewards - self.beta * per_token_kl GRPO impl per_token_loss = pg_loss_max + self.beta * per_token_kl
English
17
25
339
68.6K
Edward Beeching retweetledi
Lewis Tunstall
Lewis Tunstall@_lewtun·
I've just pushed the decontaminated subset of CodeForces-CoTs we used to train the OlympicCoder models. Checked for 8-gram overlap against: - AIME24 & AIME25 - GPQA Diamond - MATH-500 - LiveCodeBench - IOI24 Now you can train models to help pass your next LeetCode interview at FAANG ;) Link in the next post because X makes us play silly games 🤷‍♂️
Lewis Tunstall tweet media
English
4
7
69
5.9K
Edward Beeching retweetledi
Aviral Kumar
Aviral Kumar@aviral_kumar2·
A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well. My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️ cohenqu.github.io/mrt.github.io/
Aviral Kumar tweet media
English
3
32
200
16.7K
Edward Beeching retweetledi
Lewis Tunstall
Lewis Tunstall@_lewtun·
We took a deep dive into the Gemma 3 tech report today at Hugging Face and recorded the discussion :) youtu.be/GAHCXXKmIT8 It's very cool to see Google baking so many post-training methods into a single model: from online knowledge distillation to RL with model merging RMs and policy. Enjoy!
YouTube video
YouTube
English
1
5
63
9.6K