Sean Hendryx

114 posts

Sean Hendryx banner
Sean Hendryx

Sean Hendryx

@SeanHendryx

Research Engineer @ Meta Superintelligence Labs

Katılım Ocak 2021
171 Takip Edilen414 Takipçiler
Sabitlenmiş Tweet
Sean Hendryx
Sean Hendryx@SeanHendryx·
What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
Sean Hendryx tweet mediaSean Hendryx tweet media
English
12
28
130
23.8K
Sean Hendryx retweetledi
Anisha Gunjal
Anisha Gunjal@anisha_gunjal·
Great to see our work, Rubrics as Rewards, featured in the latest RLHF Book update 📘🚀 Rubric-based RLVR is emerging as a practical tool for modern training and evaluation. See §13.4 at rlhfbook.com. 📖
Nathan Lambert@natolambert

RLHF Book status update: lot's of great changes. Over the past month I've been doing a top to bottom update to the RLHF book. All of these changes are reflected on the website rlhfbook dot com, and will soon be translated to the Manning early access version (MEAP), and then more improvements for the physical copy. Overall, this took the PDF from ~150 to ~200 pages, the book is much more well rounded now. Some of the larger changes: - Updates to the RL chapter to add more algorithms like GSPO, CISPO, etc. - Updated the big table of reasoning model tech reports (full list below). Added a section on Rubrics for RLVR. - Updated the text in many chapters to better reflect best practices of today. - Many clarity fixes throughout, adding better transitions, introductions, etc. - More consistent notation throughout the book. I strongly recommend taking a look again if you only looked in the first half of 2025. There are also many surprising details, such as fixing this attached RLHF system diagram you may recognize from my first HuggingFace RLHF blog post in December of 2022, it had a bunch of minor errors. Next step I'm going to be focusing on making the physical Manning book great. The content will flow more smoothly than the web version (i'm trying to not change the links), such as linking the constitutional AI and synthetic data chapters. Overall this should make it read better from front to back. Also, all the diagrams and content will be designed to have a much more elegant presentation. Thanks for reading and feedback!

English
0
2
9
724
Sean Hendryx retweetledi
Manasi Sharma @ ICLR 2026
Manasi Sharma @ ICLR 2026@ManasiSharma_·
🚀New @scale_AI paper: 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗥𝘂𝗯𝗿𝗶𝗰𝘀, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <𝟲𝟴% 𝗿𝘂𝗯𝗿𝗶𝗰 𝗰𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲. We built 𝟮.𝟱𝗞+ expert rubrics with 𝟮.𝟴𝗞+ hrs of human labor to measure why.
Manasi Sharma @ ICLR 2026 tweet media
English
12
33
222
31.9K
Sean Hendryx retweetledi
Bing Liu
Bing Liu@vbingliu·
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
English
56
110
1.1K
568.8K
Sean Hendryx retweetledi
Anisha Gunjal
Anisha Gunjal@anisha_gunjal·
🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵
Anisha Gunjal tweet media
English
5
41
251
56.8K
Sean Hendryx
Sean Hendryx@SeanHendryx·
@karpathy a neat quality specific to language models is that you can just tell them what to do differently when they fail. And if you use importance sampling, gradients are aligned with the unguided context and it gets into the weights directly. No sleep needed x.com/SeanHendryx/st…
Sean Hendryx@SeanHendryx

For online RL, we introduce Guide, a class of algorithms which incorporate guidance into the model’s context when all rollouts fail and adjusts the importance sampling ratio in order to optimize the policy for contexts in which guidance is no longer present.

English
0
1
5
568
Andrej Karpathy
Andrej Karpathy@karpathy·
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly increase (/decrease) the probability of every action I took for the future". You get a lot more leverage from verifier functions than explicit supervision, this is great. But first, it looks suspicious asymptotically - once the tasks grow to be minutes/hours of interaction long, you're really going to do all that work just to learn a single scalar outcome at the very end, to directly weight the gradient? Beyond asymptotics and second, this doesn't feel like the human mechanism of improvement for majority of intelligence tasks. There's significantly more bits of supervision we extract per rollout via a review/reflect stage along the lines of "what went well? what didn't go so well? what should I try next time?" etc. and the lessons from this stage feel explicit, like a new string to be added to the system prompt for the future, optionally to be distilled into weights (/intuition) later a bit like sleep. In English, we say something becomes "second nature" via this process, and we're missing learning paradigms like this. The new Memory feature is maybe a primordial version of this in ChatGPT, though it is only used for customization not problem solving. Notice that there is no equivalent of this for e.g. Atari RL because there are no LLMs and no in-context learning in those domains. Example algorithm: given a task, do a few rollouts, stuff them all into one context window (along with the reward in each case), use a meta-prompt to review/reflect on what went well or not to obtain string "lesson", to be added to system prompt (or more generally modify the current lessons database). Many blanks to fill in, many tweaks possible, not obvious. Example of lesson: we know LLMs can't super easily see letters due to tokenization and can't super easily count inside the residual stream, hence 'r' in 'strawberry' being famously difficult. Claude system prompt had a "quick fix" patch - a string was added along the lines of "If the user asks you to count letters, first separate them by commas and increment an explicit counter each time and do the task like that". This string is the "lesson", explicitly instructing the model how to complete the counting task, except the question is how this might fall out from agentic practice, instead of it being hard-coded by an engineer, how can this be generalized, and how lessons can be distilled over time to not bloat context windows indefinitely. TLDR: RL will lead to more gains because when done well, it is a lot more leveraged, bitter-lesson-pilled, and superior to SFT. It doesn't feel like the full story, especially as rollout lengths continue to expand. There are more S curves to find beyond, possibly specific to LLMs and without analogues in game/robotics-like environments, which is exciting.
English
408
835
8.3K
1.1M
Sean Hendryx retweetledi
Miles Turpin
Miles Turpin@milesaturpin·
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
Miles Turpin tweet media
English
9
66
277
27.3K
Sean Hendryx
Sean Hendryx@SeanHendryx·
What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
Sean Hendryx tweet mediaSean Hendryx tweet media
English
12
28
130
23.8K
Sean Hendryx retweetledi
Jacob Phillips
Jacob Phillips@jacob_dphillips·
We’re entering a new era in robotics where generalized systems are starting to work in the real world, but researchers still don’t have good tools for understanding their data. That’s why I built ARES, an open-source platform for ingesting, annotating, and curating robotics data.
Jacob Phillips tweet media
English
14
30
157
60.8K