Bo Liu (Benjamin Liu)

174 posts

Bo Liu (Benjamin Liu) banner
Bo Liu (Benjamin Liu)

Bo Liu (Benjamin Liu)

@Benjamin_eecs

RL PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Prev @deepseek_ai @AIatMeta FAIR | DeepSeek-V2/VL/Prover SPIRAL SPICE

Singapore Beigetreten Şubat 2022
431 Folgt789 Follower
Angehefteter Tweet
Bo Liu (Benjamin Liu)
Bo Liu (Benjamin Liu)@Benjamin_eecs·
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.24… 🧑‍💻 Code: github.com/spiral-rl/spir…
Bo Liu (Benjamin Liu) tweet media
English
4
55
280
70.9K
Bo Liu (Benjamin Liu) retweetet
Jason Weston
Jason Weston@jaseweston·
🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/RAM/blogs/prin…
Jason Weston tweet media
English
0
29
94
7.3K
Bo Liu (Benjamin Liu) retweetet
Xidong Feng
Xidong Feng@Xidong_Feng·
We've witnessed a crazy concurrent line of work on on-policy self-distillation in LLMs, and I truly believe this is the next paradigm of RL. Back in 2024, we proposed this exact conceptual shift in our paper, Natural Language Reinforcement Learning (NLRL). The real breakthrough here isn't just the specific distillation mechanics. It’s that RL is fundamentally shifting away from the traditional "sample -> then filter or amplify" approach. Instead of passively waiting to stumble upon a good action to upweight, the field is moving toward true synthetic language data generation from experience, which enables true continual learning. You can see this exact recipe playing out across all the recent hit papers: • RLTF (2602.02482): Text critiques as privileged info • OPSD (2601.18734): Ground-truth solutions • SDPO (2601.20802): Runtime errors & execution feedback • ERL(2602.13949): Self-reflections & demonstrations Instead of just using a scalar reward to filter bad rollouts, they all use language feedback to explicitly generate a corrected, high-quality trajectory in hindsight, and then distill that competence back into the base policy. While the specific ways we adapt RL to LLMs are still rapidly evolving, the core vision we outlined in NLRL holds true today: a single scalar is simply too poor of a carrier for credit assignment. When people talk about "experiential memory" for agents today, they are essentially describing what we framed as a Language Value Function (LVF)—not just RAG over past episodes, but storing the structured, strategy-level "why" behind what worked. And what we called "Language Policy Improvement" is exactly this feedback-aware self-distillation loop we see everywhere now. Language, not scalars, is the future of RL. 📄 Check out our early exploration of this framework here: arxiv.org/abs/2411.14251
English
6
27
203
31.3K
will brown
will brown@willccbb·
when is lifelong in-the-weights continual learning gonna be solved?
English
49
3
81
21.1K
Bo Liu (Benjamin Liu) retweetet
Google DeepMind
Google DeepMind@GoogleDeepMind·
Ten years after AlphaGo, we’re still building on its foundations to advance AI. The techniques pioneered have helped us prove mathematical statements and are now assisting the scientific community in making new discoveries. Read more from @DemisHassabisgoo.gle/40nljjK
GIF
English
76
155
1.1K
244.4K
Bo Liu (Benjamin Liu) retweetet
Demis Hassabis
Demis Hassabis@demishassabis·
@polynoamial @MillionInt @shyamalanadkat First we do something AlphaGo-like then maybe AlphaZero-like but that will likely be post-AGI imo, and as Noam says, we should be very careful with a step like that.
English
11
41
585
41.1K
Bo Liu (Benjamin Liu) retweetet
Demis Hassabis
Demis Hassabis@demishassabis·
Ten years ago, AlphaGo’s legendary match in Seoul heralded the start of the modern era in AI. Its famous ‘Move 37’ signaled to us that AI techniques were ready to tackle real-world problems in areas like science - and ideas inspired by these methods are critical to building AGI
English
172
508
3.6K
686.4K
Bo Liu (Benjamin Liu) retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1K
3.6K
28.1K
10.7M
Bo Liu (Benjamin Liu) retweetet
Jason Weston
Jason Weston@jaseweston·
Sign up for the Meta Networking Mixer at ICLR 2026: events.atmeta.com/iclrnetworking… Members of my team in FAIR co-authored 7 papers accepted to ICLR: 1/ The Alignment Waltz: Jointly Training Agents to Collaborate for Safety arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed - Improves safety & reduces overrefusals without degrading capabilities. 2/ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning arxiv.org/abs/2505.10320 - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 3/ Scaling Agent Learning via Experience Synthesis arxiv.org/abs/2511.03773 - Scaling training environments for RL by simulating them with reasoning LLMs! - Environment models + Replay-buffer + New tasks = cheap RL for any environments! - Strong improvements over non-RL-ready environments and multiple model families! - Works better in sim-2-real RL settings → Warm-start for high-cost environments 4/ OptimalThinkingBench: Evaluating Over and Underthinking in LLMs arxiv.org/abs/2508.13141 - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - We evaluate 33 different SOTA models & find improvements are needed! 5/ RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization arxiv.org/abs/2510.02172 - RESTRAIN turns spurious votes → self-Improving signals. No labels needed - Does this through self-penalizing unreliable reasoning paths. Uses all rollouts, offsets low-consistency rollout advantage. Down-weights low-consensus prompts. - Results: beats existing techniques on both training-time (label-free) and test-time scaling — all without labels. 6/ LLM Pretraining with Continuous Concepts arxiv.org/abs/2502.08524 - An LLM pretraining framework that predicts concepts and mixes them into its hidden state to improve next token prediction. - More sample-efficient: outperforms next token prediction, knowledge distillation, and inserting pause tokens. - Boosts interpretability & steerability by analyzing and modifying predicted concepts. 7/ Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles brittleness of 0-1 signals & the noise of pure reward models -> better results! - Results: +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks
English
2
13
109
16.9K
Bo Liu (Benjamin Liu) retweetet
Tinker
Tinker@tinkerapi·
GEM is a standardized environment suite for training agentic LLMs with RL. It handles tool use, multi-environment benchmarking, and plugs directly into Tinker as a training backend — giving researchers a modular way to test RL algorithms on agentic tasks. x.com/zzlccc/status/…
Zichen Liu@zzlccc

GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced deepseek-r1 length increasing with LoRA 3. benchmarked PPO, GRPO, REINFORCE and showed their tradeoffs 4. added Terminal, MCP, visual and multi-agent environments … Open the thread for a deep dive!

English
0
9
43
4.1K
Bo Liu (Benjamin Liu) retweetet
Jason Weston
Jason Weston@jaseweston·
Continual learning in production FTW (with humans-in-the-loop) – a detailed report on methods to iteratively improve LLM social dialogue served to millions of users based on their interests. Personally, I've been working on pushing this direction for the last 10 years(!) (see papers below)! so it's exciting to see this stuff working in real systems. It will only get better -- lots more exciting methods now to try and more powerful models to make methods work than when I started. Some of my historical(!) research in this direction: 2025: The Era of Real-World Human Interaction: RL from User Conversations arxiv.org/abs/2509.25137 2022: When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels arxiv.org/abs/2210.15893 Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback arxiv.org/abs/2208.03270 2020: Deploying lifelong open-domain dialogue learning arxiv.org/abs/2008.08076 Open problems in continuous learning: arxiv.org/pdf/2006.12442 2016-2019: Learning from dialogue after deployment: Feed yourself, chatbot arxiv.org/abs/1901.05415 Unlikelihood training arxiv.org/abs/1911.03860 Dialogue learning with human-in-the-loop arxiv.org/abs/1611.09823
Yixin Nie@EasonNie

1/5 🤔 LLMs can solve olympiad math and write production code. But can they hold a conversation that's actually fun — one that people want to keep coming back to? 💬✨ We present CharacterFlywheel— an iterative process optimizing LLMs for real human engagement and character steerability, while maintaining rigorous safety protocols 🔒. Tested across Instagram, WhatsApp & Messenger 📱with millions of users — where they can create, share, and chat with their own AI characters 🤖. 📄 paper: arxiv.org/abs/2603.01973 huggingface.co/papers/2603.01…

English
4
20
149
18.5K
Bo Liu (Benjamin Liu) retweetet
Tinker
Tinker@tinkerapi·
Can LLMs replicate the success of game-playing AI with self-play? @Benjamin_eecs et al. built a multi-turn multi-agent RL system, and found that self-play on games such as Kuhn Poker improved model scores on math and reasoning evals. arxiv.org/abs/2506.24119
English
1
2
17
925
Bo Liu (Benjamin Liu) retweetet
Garrett Bingham
Garrett Bingham@gjb_ai·
Aletheia solved six FirstProof problems fully autonomously.
Garrett Bingham tweet media
English
7
46
353
20.5K
Bo Liu (Benjamin Liu) retweetet
Garrett Bingham
Garrett Bingham@gjb_ai·
Our blog post describes how Deep Think agents are making research progress in pure mathematics, physics, and computer science. deepmind.google/blog/accelerat…
Garrett Bingham tweet media
English
3
29
210
9K
Bo Liu (Benjamin Liu) retweetet
Garrett Bingham
Garrett Bingham@gjb_ai·
This paper describes our Aletheia system that solved multiple open Erdős problems. It also contributed intermediate propositions on two research papers, collaborated with a human author on a third paper, and produced a standalone fourth paper on its own. github.com/google-deepmin…
Garrett Bingham tweet media
English
8
64
347
14.4K
Zichen Liu
Zichen Liu@zzlccc·
Thrilled to share that I’ve joined @GoogleDeepMind to work on Gemini post-training! I feel incredibly fortunate to be cooking on this sunny island under @YiTayML's leadership, within @quocleix's broader organization. Looking forward to enjoying RL research and pushing the frontiers of Gemini alongside such a brilliant team!
Zichen Liu tweet media
English
47
8
278
44.6K
Bo Liu (Benjamin Liu) retweetet
Omar Khattab
Omar Khattab@lateinteraction·
Since there's now 5+ papers that propose on-policy context distillation, I feel comfortable confessing that we (@NoahZiems), too, were working on that haha.* But we found an even earlier proposal of this from early Nov 2025 by John Schulman! A "Tinker Project Idea": github.com/thinking-machi… *Don't worry we've pivoted to an even cooler angle ;D
Omar Khattab tweet media
English
9
19
253
15.6K
Bo Liu (Benjamin Liu) retweetet
Anthropic
Anthropic@AnthropicAI·
New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…
English
877
2.5K
21.5K
8.5M
Bo Liu (Benjamin Liu) retweetet
Xidong Feng
Xidong Feng@Xidong_Feng·
@yus167 Hi Yuda nice work and thanks for citing our NLRL. But to clarify, we do much more than generating language critque -- something quite similar with the self distillation and feedback modelling (we call policy/value distillation). Check our newest version: arxiv.org/abs/2411.14251
English
1
2
5
533
Bo Liu (Benjamin Liu) retweetet
Jack Parker-Holder
Jack Parker-Holder@jparkerholder·
People often present world models like Genie as either useful for interactive media *or* embodied AGI. The true answer is both! Imagine asking an LLM researcher in 2021-2022 if their models would be useful only for coding, or math, or creative writing. The magic comes from the generalization between all of these tasks--which also enables totally new use cases to emerge (good models make good stepping stones #iykyk). World models (defined as predicting the next state, given actions) are a new class of foundation models altogether. This was clear to us when we wrote the original Genie paper [1] where we had results in both 2D platformer worlds and also using data from the RT1 paper (people often miss this). The robotics model was just as controllable, it just seemed less "fun" so didn't go viral 😅. The same can be said for video models (which I can agree are world models too with temporally extended actions 🤣🤣). Veo 3 has enabled novel creative content (eg the Prompt theory), while also demonstrating novel forms of intelligence [2,3]. People like @DrJimFan get this, which is why his team are working on embodied agents in games [4,5,6] as well as of course focusing on building general robotics policies. World models are the bridge between the two domains. If an agent can solve a task in a game-like world, we can "just ask for it to be more realworld" (CC @ericjang11) and then train on that next. We are still just playing with an early version of this, lots more to come :)
Jim Fan@DrJimFan

x.com/i/article/2018…

English
4
19
155
28K
Bo Liu (Benjamin Liu) retweetet
Demis Hassabis
Demis Hassabis@demishassabis·
The AI field is in need of harder benchmarks to test capabilities of the latest AI models. This update to @Kaggle Game Arena with werewolf and poker (heads-up) plus chess, gives us new objective measures of real-world skills like planning and decision making under uncertainty.
Kaggle@kaggle

📌 Mark Your Calendar: Live Game Arena Event This Monday! We are releasing two new games, Poker and Werewolf, along with an updated Chess leaderboard next Monday, February 2, running daily from 9:30 AM PT to 11:30 AM PT through February 4.

English
93
168
1.4K
214K