Bo Liu (Benjamin Liu)

174 posts

Bo Liu (Benjamin Liu)

@Benjamin_eecs

RL PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Prev @deepseek_ai @AIatMeta FAIR | DeepSeek-V2/VL/Prover SPIRAL SPICE

Singapore Beigetreten Şubat 2022

431 Folgt789 Follower

Angehefteter Tweet

Bo Liu (Benjamin Liu)@Benjamin_eecs·1 Tem

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.24… 🧑‍💻 Code: github.com/spiral-rl/spir…

English

280

70.9K

Bo Liu (Benjamin Liu) retweetet

Jason Weston@jaseweston·15h

🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/RAM/blogs/prin…

English

7.3K

Bo Liu (Benjamin Liu) retweetet

Xidong Feng@Xidong_Feng·4d

We've witnessed a crazy concurrent line of work on on-policy self-distillation in LLMs, and I truly believe this is the next paradigm of RL. Back in 2024, we proposed this exact conceptual shift in our paper, Natural Language Reinforcement Learning (NLRL). The real breakthrough here isn't just the specific distillation mechanics. It’s that RL is fundamentally shifting away from the traditional "sample -> then filter or amplify" approach. Instead of passively waiting to stumble upon a good action to upweight, the field is moving toward true synthetic language data generation from experience, which enables true continual learning. You can see this exact recipe playing out across all the recent hit papers: • RLTF (2602.02482): Text critiques as privileged info • OPSD (2601.18734): Ground-truth solutions • SDPO (2601.20802): Runtime errors & execution feedback • ERL(2602.13949): Self-reflections & demonstrations Instead of just using a scalar reward to filter bad rollouts, they all use language feedback to explicitly generate a corrected, high-quality trajectory in hindsight, and then distill that competence back into the base policy. While the specific ways we adapt RL to LLMs are still rapidly evolving, the core vision we outlined in NLRL holds true today: a single scalar is simply too poor of a carrier for credit assignment. When people talk about "experiential memory" for agents today, they are essentially describing what we framed as a Language Value Function (LVF)—not just RAG over past episodes, but storing the structured, strategy-level "why" behind what worked. And what we called "Language Policy Improvement" is exactly this feedback-aware self-distillation loop we see everywhere now. Language, not scalars, is the future of RL. 📄 Check out our early exploration of this framework here: arxiv.org/abs/2411.14251

English

203

31.3K

Bo Liu (Benjamin Liu)@Benjamin_eecs·13 Mar

@willccbb ?

QAM

will brown@willccbb·13 Mar

when is lifelong in-the-weights continual learning gonna be solved?

English

21.1K

Bo Liu (Benjamin Liu) retweetet

Google DeepMind@GoogleDeepMind·10 Mar

Ten years after AlphaGo, we’re still building on its foundations to advance AI. The techniques pioneered have helped us prove mathematical statements and are now assisting the scientific community in making new discoveries. Read more from @DemisHassabis ↓ goo.gle/40nljjK

GIF

English

155

1.1K

244.4K

Bo Liu (Benjamin Liu) retweetet

Demis Hassabis@demishassabis·10 Mar

@polynoamial @MillionInt @shyamalanadkat First we do something AlphaGo-like then maybe AlphaZero-like but that will likely be post-AGI imo, and as Noam says, we should be very careful with a step like that.

English

585

41.1K

Bo Liu (Benjamin Liu) retweetet

Demis Hassabis@demishassabis·10 Mar

Ten years ago, AlphaGo’s legendary match in Seoul heralded the start of the modern era in AI. Its famous ‘Move 37’ signaled to us that AI techniques were ready to tackle real-world problems in areas like science - and ideas inspired by these methods are critical to building AGI

English

172

508

3.6K

686.4K

Bo Liu (Benjamin Liu) retweetet

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

3.6K

28.1K

10.7M

Bo Liu (Benjamin Liu) retweetet

Jason Weston@jaseweston·5 Mar

Sign up for the Meta Networking Mixer at ICLR 2026: events.atmeta.com/iclrnetworking… Members of my team in FAIR co-authored 7 papers accepted to ICLR: 1/ The Alignment Waltz: Jointly Training Agents to Collaborate for Safety arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed - Improves safety & reduces overrefusals without degrading capabilities. 2/ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning arxiv.org/abs/2505.10320 - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 3/ Scaling Agent Learning via Experience Synthesis arxiv.org/abs/2511.03773 - Scaling training environments for RL by simulating them with reasoning LLMs! - Environment models + Replay-buffer + New tasks = cheap RL for any environments! - Strong improvements over non-RL-ready environments and multiple model families! - Works better in sim-2-real RL settings → Warm-start for high-cost environments 4/ OptimalThinkingBench: Evaluating Over and Underthinking in LLMs arxiv.org/abs/2508.13141 - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - We evaluate 33 different SOTA models & find improvements are needed! 5/ RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths. Uses all rollouts, offsets low-consistency rollout advantage. Down-weights low-consensus prompts. - Results: beats existing techniques on both training-time (label-free) and test-time scaling — all without labels. 6/ LLM Pretraining with Continuous Concepts arxiv.org/abs/2502.08524 - An LLM pretraining framework that predicts concepts and mixes them into its hidden state to improve next token prediction. - More sample-efficient: outperforms next token prediction, knowledge distillation, and inserting pause tokens. - Boosts interpretability & steerability by analyzing and modifying predicted concepts. 7/ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles brittleness of 0-1 signals & the noise of pure reward models -> better results! - Results: +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks

English

109

16.9K

Bo Liu (Benjamin Liu) retweetet

Tinker@tinkerapi·4 Mar

GEM is a standardized environment suite for training agentic LLMs with RL. It handles tool use, multi-environment benchmarking, and plugs directly into Tinker as a training backend — giving researchers a modular way to test RL algorithms on agentic tasks. x.com/zzlccc/status/…

Zichen Liu@zzlccc

GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!

English

4.1K

Bo Liu (Benjamin Liu) retweetet

Jason Weston@jaseweston·3 Mar

Continual learning in production FTW (with humans-in-the-loop) – a detailed report on methods to iteratively improve LLM social dialogue served to millions of users based on their interests. Personally, I've been working on pushing this direction for the last 10 years(!) (see papers below)! so it's exciting to see this stuff working in real systems. It will only get better -- lots more exciting methods now to try and more powerful models to make methods work than when I started. Some of my historical(!) research in this direction: 2025: The Era of Real-World Human Interaction: RL from User Conversations arxiv.org/abs/2509.25137 2022: When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels arxiv.org/abs/2210.15893 Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback arxiv.org/abs/2208.03270 2020: Deploying lifelong open-domain dialogue learning arxiv.org/abs/2008.08076 Open problems in continuous learning: arxiv.org/pdf/2006.12442 2016-2019: Learning from dialogue after deployment: Feed yourself, chatbot arxiv.org/abs/1901.05415 Unlikelihood training arxiv.org/abs/1911.03860 Dialogue learning with human-in-the-loop arxiv.org/abs/1611.09823

Yixin Nie@EasonNie

1/5 🤔 LLMs can solve olympiad math and write production code. But can they hold a conversation that's actually fun — one that people want to keep coming back to? 💬✨ We present CharacterFlywheel— an iterative process optimizing LLMs for real human engagement and character steerability, while maintaining rigorous safety protocols 🔒. Tested across Instagram, WhatsApp & Messenger 📱with millions of users — where they can create, share, and chat with their own AI characters 🤖. 📄 paper: arxiv.org/abs/2603.01973 huggingface.co/papers/2603.01…

English

149

18.5K

Bo Liu (Benjamin Liu) retweetet

Tinker@tinkerapi·26 Şub

Can LLMs replicate the success of game-playing AI with self-play? @Benjamin_eecs et al. built a multi-turn multi-agent RL system, and found that self-play on games such as Kuhn Poker improved model scores on math and reasoning evals. arxiv.org/abs/2506.24119

English

925

Bo Liu (Benjamin Liu) retweetet

Garrett Bingham@gjb_ai·25 Şub

Aletheia solved six FirstProof problems fully autonomously.

English

353

20.5K

Bo Liu (Benjamin Liu) retweetet

Garrett Bingham@gjb_ai·11 Şub

Our blog post describes how Deep Think agents are making research progress in pure mathematics, physics, and computer science. deepmind.google/blog/accelerat…

English

210

Bo Liu (Benjamin Liu) retweetet

Garrett Bingham@gjb_ai·11 Şub

This paper describes our Aletheia system that solved multiple open Erdős problems. It also contributed intermediate propositions on two research papers, collaborated with a human author on a third paper, and produced a standalone fourth paper on its own. github.com/google-deepmin…

English

347

14.4K

Bo Liu (Benjamin Liu)@Benjamin_eecs·9 Şub

@zzlccc @GoogleDeepMind @YiTayML @quocleix Congrats man :))

English

472

Zichen Liu@zzlccc·9 Şub

Thrilled to share that I’ve joined @GoogleDeepMind to work on Gemini post-training! I feel incredibly fortunate to be cooking on this sunny island under @YiTayML's leadership, within @quocleix's broader organization. Looking forward to enjoying RL research and pushing the frontiers of Gemini alongside such a brilliant team!

English

278

44.6K

Bo Liu (Benjamin Liu) retweetet

Omar Khattab@lateinteraction·6 Şub

Since there's now 5+ papers that propose on-policy context distillation, I feel comfortable confessing that we (@NoahZiems), too, were working on that haha.* But we found an even earlier proposal of this from early Nov 2025 by John Schulman! A "Tinker Project Idea": github.com/thinking-machi… *Don't worry we've pivoted to an even cooler angle ;D

English

253

15.6K

Bo Liu (Benjamin Liu) retweetet

Anthropic@AnthropicAI·5 Şub

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…

English

877

2.5K

21.5K

8.5M

Bo Liu (Benjamin Liu) retweetet

Xidong Feng@Xidong_Feng·5 Şub

@yus167 Hi Yuda nice work and thanks for citing our NLRL. But to clarify, we do much more than generating language critque -- something quite similar with the self distillation and feedback modelling (we call policy/value distillation). Check our newest version: arxiv.org/abs/2411.14251

English

533

Bo Liu (Benjamin Liu) retweetet

Jack Parker-Holder@jparkerholder·4 Şub

People often present world models like Genie as either useful for interactive media *or* embodied AGI. The true answer is both! Imagine asking an LLM researcher in 2021-2022 if their models would be useful only for coding, or math, or creative writing. The magic comes from the generalization between all of these tasks--which also enables totally new use cases to emerge (good models make good stepping stones #iykyk). World models (defined as predicting the next state, given actions) are a new class of foundation models altogether. This was clear to us when we wrote the original Genie paper [1] where we had results in both 2D platformer worlds and also using data from the RT1 paper (people often miss this). The robotics model was just as controllable, it just seemed less "fun" so didn't go viral 😅. The same can be said for video models (which I can agree are world models too with temporally extended actions 🤣🤣). Veo 3 has enabled novel creative content (eg the Prompt theory), while also demonstrating novel forms of intelligence [2,3]. People like @DrJimFan get this, which is why his team are working on embodied agents in games [4,5,6] as well as of course focusing on building general robotics policies. World models are the bridge between the two domains. If an agent can solve a task in a game-like world, we can "just ask for it to be more realworld" (CC @ericjang11) and then train on that next. We are still just playing with an early version of this, lots more to come :)

Jim Fan@DrJimFan

x.com/i/article/2018…

English

155

28K

Bo Liu (Benjamin Liu) retweetet

Demis Hassabis@demishassabis·2 Şub

The AI field is in need of harder benchmarks to test capabilities of the latest AI models. This update to @Kaggle Game Arena with werewolf and poker (heads-up) plus chess, gives us new objective measures of real-world skills like planning and decision making under uncertainty.

Kaggle@kaggle

📌 Mark Your Calendar: Live Game Arena Event This Monday! We are releasing two new games, Poker and Werewolf, along with an updated Chess leaderboard next Monday, February 2, running daily from 9:30 AM PT to 11:30 AM PT through February 4.

English

168

1.4K

214K

Entdecken

@willccbb @DemisHassabis @polynoamial @MillionInt @shyamalanadkat @zzlccc @GoogleDeepMind @YiTayML