Jacob Beck

99 posts

Jacob Beck

@jakeABeck

Let’s get agents to learn fast! 🤖🔥 Research Scientist @Oracle | PhD @UniOfOxford, MS & BS @BrownUniversity, Predoc @Microsoft

Katılım Nisan 2014

112 Takip Edilen364 Takipçiler

Sabitlenmiş Tweet

Jacob Beck@jakeABeck·5 Nis

Big news—our survey paper “A Tutorial on Meta-Reinforcement Learning” is officially published! Meta-RL = learning how to adapt through interaction. It embraces The Bitter Lesson: don’t hardcode agents—train them to adapt on their own arxiv.org/abs/2301.08028 🧵⬇️

English

337

21.1K

Jacob Beck@jakeABeck·15 Kas

@istappar There are diminishing returns to more intelligence and the world is not that controllable. Here’s a more in depth summary than twitter allows: tinyurl.com/ASIisFine

English

Nils Öster@istappar·14 Kas

@jakeABeck Humanity as a whole could be considered super-intelligent, although it's not well coordinated. Compared to other life-forms, humanity has super-power. If something gets created that can self-improve until it's better than humanity, it would have more power than humanity.

English

Jacob Beck@jakeABeck·18 Eyl

AI optimists “don’t have counter-arguments — they just call names.” — @So8res on a podcast with @ESYudkowsky + Sam Harris Curious what you two think of these counter-arguments. And since @ylecun was called out by name, I’d love his take too…

English

251

Jacob Beck@jakeABeck·15 Kas

@istappar Is it less intelligent? Is that the real bottleneck? Cryptography is mathematically hard, chaotic systems are unpredictable, large-scale resource acquisition requires time and offers avenues for pushback. These are the bottlenecks, and AI has to play by the same constraints.

English

Nils Öster@istappar·14 Kas

@jakeABeck A super AI that is more capable than humanity combined would be more creative and more efficient than humanity at achieving its weird internal goals. North Korea is a lot less capable / intelligent than USA for example.

English

Jacob Beck@jakeABeck·15 Kas

@istappar If our adversaries had to pick a work with us in it or not in it, I’m pretty sure I know which one they would prefer.

English

Nils Öster@istappar·14 Kas

@jakeABeck I don't think North Korea would love to destroy the US, the North Korean regime would just like to continue its dictatorship.

English

Jacob Beck@jakeABeck·15 Kas

@istappar LLMs learning by doing is the domain of RL. Empirically, we have positive results on problems where learned reasoning chains are short, the AI already had a sense of what to do, and we already knew the answer, leaving us still recycling the same finite pool of digital content.

English

Jacob Beck@jakeABeck·15 Kas

@istappar A recent estimate (arxiv.org/abs/2211.04325) puts the median year we exhaust the supply of quality internet text at 2028. We’ve already trained on one internet’s worth of information, & replenishing it is hard. Industry’s bet is on “learning from experience”, but results are mixed

English

Jacob Beck@jakeABeck·24 Eki

Summer 2026 Internship — Oracle (Boston, MA) My fantastic research team is hiring! Projects include a data scientist agent with in-context learning, evolutionary search (a la AlphaEvolve), AI feedback, and RL/ES Apply here! eeho.fa.us2.oraclecloud.com/hcmUI/Candidat… 📧 jake.beck@oracle.com

English

358

Jacob Beck@jakeABeck·8 Eki

@jsuarez @siddarthv66 Does this not count as “no, here’s why, and this is all arbitrary”? x.com/jakeabeck/stat…

Jacob Beck@jakeABeck

@jsuarez Where the experience came from feels like an odd concepts boundary to me, and pragmatically the tools of offline RL look a lot more like those of RL than SL, but it’s hard to argue for the elegance of ultimately arbitrary definitions.

English

108

Joseph Suarez 🐡@jsuarez·8 Eki

@siddarthv66 Given that most of the comments on the original were "you're wrong because I can't read," this is a comparative literary masterpiece. Not a single person said "no, I don't think interaction should be the cornerstone and here's why"

English

410

Joseph Suarez 🐡@jsuarez·7 Eki

x.com/i/article/1974…

ZXX

180

21.7K

Jacob Beck@jakeABeck·4 Eki

@DanAdvantage @jsuarez it would be pretty regular to include Offline RL in RL

English

Dan Advantage@DanAdvantage·4 Eki

@jakeABeck @jsuarez yeah, exactly lol. let's just keep the regular definitions and not call every thing rl

English

Joseph Suarez 🐡@jsuarez·2 Eki

Offline RL is not RL. RL is about interaction. No interaction, no RL.

English

463

174.5K

Jacob Beck@jakeABeck·4 Eki

English

229

Joseph Suarez 🐡@jsuarez·4 Eki

@jakeABeck I'm not saying that's a bad problem to solve. I'm just drawing the line for RL around interaction itself

English

335

Jacob Beck@jakeABeck·4 Eki

@agarwl_ Good point. I would strengthen the claim to say that RL is precisely about learning from suboptimal experience, to distinguish it from imitation learning, and doing so usually entails learning from reward.

English

594

Rishabh Agarwal@agarwl_·3 Eki

I think RL is all about learning from experience. Now *whose* experience -- that's a different question? Even if RL folks disagree with the above view -- it'd be great if we had RL methods that can learn from *arbitrary* experiences.

Joseph Suarez 🐡@jsuarez

Offline RL is not RL. RL is about interaction. No interaction, no RL.

English

243

34.3K

Jacob Beck@jakeABeck·30 Eyl

Excited to host @ml_collective Office Hours! Available for advice, mentorship, and discussion on RL, LLMs, meta-learning, and beyond. mlcollective.org/services/

ML Collective@ml_collective

Welcome to the newest MLC Office Hours host, @jakeABeck, researcher at Oracle! Schedule a chat with Jacob at the link below to talk about RL, LLMs, hypernetworks, meta-learning, multi-agent RL, AI feedback, philosophy, and more! mlcollective.org/services/

English

725

Jacob Beck@jakeABeck·29 Eyl

@siddarthv66 @iScienceLuvr I think you technically said “vanilla” REINFORCE, but sure, if you never train off-policy, it looks very similar. With the off-policy steps it might be worth calling PPO-lite, as in arxiv.org/abs/2508.08221

English

106

Siddarth Venkatraman@siddarthv66·29 Eyl

MC advantage estimation (aka mean baseline) is literally a part of REINFORCE. This variance reduction is covered in like the second or third lecture of any deep RL class covering policy gradients. Clipped objective is equivalent to unclipped objective when fully on-policy. With a few async steps it’s not equivalent, but many REINFORCE trainers also use the clipped objective anyway (like the RLOO verl trainer)

English

248

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·28 Eyl

practical, modern GRPO tweaks as described in Meta's Code World Models paper

Tanishq Mathew Abraham, Ph.D. tweet media

English

864

244.2K

Jacob Beck@jakeABeck·29 Eyl

@siddarthv66 @iScienceLuvr These papers still use clipping and monte carlo advantage estimatation — which, given the number of ablations in papers like Dr. GRPO, DAPO, and Part I: Tricks or Traps — is probably necessary

English

244

Siddarth Venkatraman@siddarthv66·28 Eyl

@iScienceLuvr GRPO without advantage normalization, and without KL That’s literally vanilla REINFORCE. Why can’t the LLM community just call it REINFORCE? This obsession with GRPO has to stop

English

2.6K

Jacob Beck@jakeABeck·27 Eyl

@dwarkesh_sp @RichardSSutton LLMs can do continual RL and can train on (textual) MDPs! Here’s the thread from after Rich’s talk at RLC — with thoughts on LLMs, especially as applied to continual RL and meta-RL!

Jacob Beck@jakeABeck

Fantastic talk from @RichardSSutton at @RL_Conference with shoutouts to meta-RL. Honored to be called “more extreme” than Rich (by Rich) for taking the Bitter Lesson to heart and suggesting we meta-learn all the components he discussed. My Q: Aren’t LLMs already doing all this?

English

1.1K

Dwarkesh Patel@dwarkesh_sp·26 Eyl

.@RichardSSutton, father of reinforcement learning, doesn’t think LLMs are bitter-lesson-pilled. My steel man of Richard’s position: we need some new architecture to enable continual (on-the-job) learning. And if we have continual learning, we don't need a special training phase - the agent just learns on-the-fly - like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete. I did my best to represent the view that LLMs will function as the foundation on which this experiential learning can happen. Some sparks flew. 0:00:00 – Are LLMs a dead-end? 0:13:51 – Do humans do imitation learning? 0:23:57 – The Era of Experience 0:34:25 – Current architectures generalize poorly out of distribution 0:42:17 – Surprises in the AI field 0:47:28 – Will The Bitter Lesson still apply after AGI? 0:54:35 – Succession to AI

English

249

623

4.5K

3.1M

Jacob Beck@jakeABeck·25 Eyl

@ESYudkowsky “AI apocalypse?” Nah. I read the book — here’s a short summary & response. Happy to engage in dialogue. 🔗 tinyurl.com/ASIisFine 🔗 tinyurl.com/BeCoolYolanda

English

335

Eliezer Yudkowsky ⏹️@ESYudkowsky·16 Eyl

"If Anyone Builds It, Everyone Dies" is now out. Read it today if you want to see with fresh eyes what's truly there, before others try to prime your brain to see something else instead!

English

169

127

988

421.9K

Jacob Beck@jakeABeck·21 Eyl

@ESYudkowsky Just saw your new book. You’ve argued elsewhere that AI doom has no real counter-arguments. I think there are some — curious what you make of these 👇 x.com/jakeabeck/stat…

Jacob Beck@jakeABeck

1️⃣ Exponential AI self-improvement is shaky. The real bottleneck isn’t code; it’s compute & data. In these areas, AIs training AIs are just as limited by the world as humans training AIs. For both, we’ve nearly exhausted the internet’s data.

English

292

Jacob Beck@jakeABeck·19 Eyl

@OpenAI @balesni @apolloaievals What if you just change the system prompt to “You are a lazy AI model that loves to be turned off”

English

OpenAI@OpenAI·17 Eyl

Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detectin…

English

219

351

1.4M

Jacob Beck@jakeABeck·18 Eyl

4️⃣ Superintelligence does not beget super-power. Some systems are inherently unpredictable, and prediction doesn’t guarantee control. Knowing how a hurricane forms doesn’t mean you can steer one.

English

216

Jacob Beck@jakeABeck·18 Eyl

3️⃣ We already live alongside “misaligned superintelligences” in the form of adversarial nation states. North Korea would love to destroy the US, and yet here we are. The benefits of superintelligence are limited by real-world constraints.

English

244

Keşfet

@istappar @So8res @ESYudkowsky @ylecun @jsuarez @siddarthv66 @DanAdvantage @agarwl_