TalkRL Podcast

672 posts

TalkRL Podcast banner
TalkRL Podcast

TalkRL Podcast

@TalkRLPodcast

TalkRL Podcast is All Reinforcement Learning, All the Time. Follow for interviews with brilliant folks from across the world of RL. Host @robinc. DMs open.

Vancouver BC Canada Katılım Ağustos 2019
95 Takip Edilen3.1K Takipçiler
TalkRL Podcast
TalkRL Podcast@TalkRLPodcast·
@kjaved_ @how_uhh @velocizapkar @danijarh Agree this seems to defeat the purpose of small sample benchmarks! Still hoping there is another solution to this issue other than slow envs... seems a case of goodhearts law
English
1
0
1
188
Khurram Javed
Khurram Javed@kjaved_·
You are right. It's perfectly fine to use fast environments for algorithmic development or for improving our understanding of different algorithms. My issue is using fast environments for public benchmarking because that inevitably leads to people using billions of samples to figure out benchmark-specific hacks to improve performance. For example, the best-performing Atari100k agents use things like cyclic gamma schedules, cyclic lambda schedules, cyclic resets, etc. None of these ideas generalize to other RL problems but they help on that specific benchmark. Finding such hacks is easy when one has access to billions of samples.
English
1
0
1
132
Khurram Javed
Khurram Javed@kjaved_·
Probably the highest-bang-for-buck direction in RL is developing algorithms that can discover useful temporal abstractions (e.g., Options) entirely from experience, learn models for these temporal abstractions, and plan with them in real time.
English
8
5
120
9.8K
TalkRL Podcast
TalkRL Podcast@TalkRLPodcast·
@kjaved_ @how_uhh @velocizapkar @danijarh Purposely wanting slow envs due to wanting sample efficient algos seems a bit like throwing baby out with bathwater. Then those with most compute have more advantage. Why not just pay attention to sample complexity/hp sensitivity, and also have fast envs?
English
1
0
1
150
Khurram Javed
Khurram Javed@kjaved_·
Fast environments can be bad because people just brute-force through them with over-parameterized agents or with billions of samples. Working with such environments requires discipline to not do things that don't scale. The plethora of papers using multiple parallel environments to improve performance on benchmarks is an example of not having this discipline. If the environment can run faster than the agent, then it's not a good benchmark for the community because people will abuse it (see the big world hypothesis for why we don't want over-parameterized agents: openreview.net/pdf?id=Sv7Dazu…)
English
1
0
8
165
Danijar Hafner
Danijar Hafner@danijarh·
Excited for this podcast episode with TalkRL to be out! 🎙️ We talk about the story behind Dreamer 4, the details of scalable world models, and the future of robotics (and beyond) 🤖🌏🚀 Thanks for the fun conversation, @TalkRLPodcast
TalkRL Podcast@TalkRLPodcast

E73: Danijar Hafner on Dreamer v4 @danijarh (ex-@GoogleDeepMind RS) on offline world models for safe robotics, Shortcut Forcing for fast diffusion video models, outperforming OpenAI’s VPT with 100× less data, his “APD” theory unifying exploration and empowerment, and more!

English
4
11
137
27.2K
TalkRL Podcast
TalkRL Podcast@TalkRLPodcast·
E73: Danijar Hafner on Dreamer v4 @danijarh (ex-@GoogleDeepMind RS) on offline world models for safe robotics, Shortcut Forcing for fast diffusion video models, outperforming OpenAI’s VPT with 100× less data, his “APD” theory unifying exploration and empowerment, and more!
TalkRL Podcast tweet media
English
5
5
26
16.2K
TalkRL Podcast
TalkRL Podcast@TalkRLPodcast·
@jesswhittles Always enjoy your writing! Is there some tension between integrity and diplomacy (which is often less about integrity than interests)?
English
0
0
0
20
Jess Whittlestone
Jess Whittlestone@jesswhittles·
Integrity is really important to me - it feels like the value that makes it possible to uphold all my other values. So I wrote something about what I think it means to do (AI safety) policy advocacy in a high-integrity way: jesswhittles.substack.com/p/integrity-in…
English
1
0
6
458
Jason Weston
Jason Weston@jaseweston·
💃New Multi-Agent RL Method: WaltzRL💃 📝: arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5
Jason Weston tweet media
English
5
33
152
24.2K
Kevin Patrick Murphy
Kevin Patrick Murphy@sirbayes·
I agree with @karpathy 's take here. The interview between @RichardSSutton and @dwarkesh_sp was interesting, but I think at times there was a communication gap due to some misunderstandings. I would say that the current LLM training setup is very similar to the classic model-free RL setup, except that with LLMs: (1) the policy is warm-started from a supervised model (no de-novo, self-directed learning); (2) there is a train/test distinction (no continual learning); (3) most of the observation stream comes from human words, which already "carve nature at its joints", bypassing the harder problem of learning useful abstractions from raw sensorimotor streams. (4) when using multimodal models, the perceptual encoder is usually pre-trained and frozen, and often relies on a lot of human engineering (eg contrastive losses, or pixel-prediction losses) to come up with a good set of (soft) tokens. Most of the interview seem to focus on issue #1. However, the discussion seemed confused here due to the fact that LLMs are both a world model (predict what humans would typically say) and a policy (predict what the agent should do). Obviously the model from the supervised pretraining stage is not action-conditioned, so Sutton does not want to call it a WM - but it is a predictor of future observations given the past, so it's like a WM that marginalizes over actions (resulting in a mixture). The WM is then converted into a (goal-conditioned) policy using IFT (imitation learning) and then improved with RLFT, which further confuses the discussion. In current practice, the RLFT stage mostly just uses human provided reasoning tasks, which are bandit problems that do not involve interacting with an environment. But there is a recent move towards true multi-step RL, where LLMs do learn from external environments, as in classic RL. This fact was not emphasized enough in the interview, IMHO. Andrej argues that warm-starting is a practical alternative to evolution's outer meta-learning loop, and I agree, so I don't have a problem with #1. But I do agree with Sutton's criticisms #2-#4. In particular, I expect a lot of future progress to come from continual RL applied to multimodal problems (eg. visual GUI-using agents) in non-stationary multi-agent environments (e.g., e-commerce or embodied AI), where the agent learns its own abstractions over time (eg creating tool libraries), it learns both a (goal agnostic) world model and a (goal conditioned) policy (so it can do decision time planning), and both kinds of model become semi-parametric (eg. combining memories and ICL with gradient-based weight updates). Future agents will not just be a frozen "omni-transformer", consuming and generating tokens, they will be heterogeneous adaptive systems, with many different specialized modules, more like the brain. (This may make serving hard, but who said intelligence would be easy to reproduce?) I think Sutton will like this new paradigm more :)
Andrej Karpathy@karpathy

Finally had a chance to listen through this pod with Sutton, which was interesting and amusing. As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea is sufficiently "bitter lesson pilled" (meaning arranged so that it benefits from added computation for free) as a proxy for whether it's going to work or worth even pursuing. The underlying assumption being that LLMs are of course highly "bitter lesson pilled" indeed, just look at LLM scaling laws where if you put compute on the x-axis, number go up and to the right. So it's amusing to see that Sutton, the author of the post, is not so sure that LLMs are "bitter lesson pilled" at all. They are trained on giant datasets of fundamentally human data, which is both 1) human generated and 2) finite. What do you do when you run out? How do you prevent a human bias? So there you have it, bitter lesson pilled LLM researchers taken down by the author of the bitter lesson - rough! In some sense, Dwarkesh (who represents the LLM researchers viewpoint in the pod) and Sutton are slightly speaking past each other because Sutton has a very different architecture in mind and LLMs break a lot of its principles. He calls himself a "classicist" and evokes the original concept of Alan Turing of building a "child machine" - a system capable of learning through experience by dynamically interacting with the world. There's no giant pretraining stage of imitating internet webpages. There's also no supervised finetuning, which he points out is absent in the animal kingdom (it's a subtle point but Sutton is right in the strong sense: animals may of course observe demonstrations, but their actions are not directly forced/"teleoperated" by other animals). Another important note he makes is that even if you just treat pretraining as an initialization of a prior before you finetune with reinforcement learning, Sutton sees the approach as tainted with human bias and fundamentally off course, a bit like when AlphaZero (which has never seen human games of Go) beats AlphaGo (which initializes from them). In Sutton's world view, all there is is an interaction with a world via reinforcement learning, where the reward functions are partially environment specific, but also intrinsically motivated, e.g. "fun", "curiosity", and related to the quality of the prediction in your world model. And the agent is always learning at test time by default, it's not trained once and then deployed thereafter. Overall, Sutton is a lot more interested in what we have common with the animal kingdom instead of what differentiates us. "If we understood a squirrel, we'd be almost done". As for my take... First, I should say that I think Sutton was a great guest for the pod and I like that the AI field maintains entropy of thought and that not everyone is exploiting the next local iteration LLMs. AI has gone through too many discrete transitions of the dominant approach to lose that. And I also think that his criticism of LLMs as not bitter lesson pilled is not inadequate. Frontier LLMs are now highly complex artifacts with a lot of humanness involved at all the stages - the foundation (the pretraining data) is all human text, the finetuning data is human and curated, the reinforcement learning environment mixture is tuned by human engineers. We do not in fact have an actual, single, clean, actually bitter lesson pilled, "turn the crank" algorithm that you could unleash upon the world and see it learn automatically from experience alone. Does such an algorithm even exist? Finding it would of course be a huge AI breakthrough. Two "example proofs" are commonly offered to argue that such a thing is possible. The first example is the success of AlphaZero learning to play Go completely from scratch with no human supervision whatsoever. But the game of Go is clearly such a simple, closed, environment that it's difficult to see the analogous formulation in the messiness of reality. I love Go, but algorithmically and categorically, it is essentially a harder version of tic tac toe. The second example is that of animals, like squirrels. And here, personally, I am also quite hesitant whether it's appropriate because animals arise by a very different computational process and via different constraints than what we have practically available to us in the industry. Animal brains are nowhere near the blank slate they appear to be at birth. First, a lot of what is commonly attributed to "learning" is imo a lot more "maturation". And second, even that which clearly is "learning" and not maturation is a lot more "finetuning" on top of something clearly powerful and preexisting. Example. A baby zebra is born and within a few dozen minutes it can run around the savannah and follow its mother. This is a highly complex sensory-motor task and there is no way in my mind that this is achieved from scratch, tabula rasa. The brains of animals and the billions of parameters within have a powerful initialization encoded in the ATCGs of their DNA, trained via the "outer loop" optimization in the course of evolution. If the baby zebra spasmed its muscles around at random as a reinforcement learning policy would have you do at initialization, it wouldn't get very far at all. Similarly, our AIs now also have neural networks with billions of parameters. These parameters need their own rich, high information density supervision signal. We are not going to re-run evolution. But we do have mountains of internet documents. Yes it is basically supervised learning that is ~absent in the animal kingdom. But it is a way to practically gather enough soft constraints over billions of parameters, to try to get to a point where you're not starting from scratch. TLDR: Pretraining is our crappy evolution. It is one candidate solution to the cold start problem, to be followed later by finetuning on tasks that look more correct, e.g. within the reinforcement learning framework, as state of the art frontier LLM labs now do pervasively. I still think it is worth to be inspired by animals. I think there are multiple powerful ideas that LLM agents are algorithmically missing that can still be adapted from animal intelligence. And I still think the bitter lesson is correct, but I see it more as something platonic to pursue, not necessarily to reach, in our real world and practically speaking. And I say both of these with double digit percent uncertainty and cheer the work of those who disagree, especially those a lot more ambitious bitter lesson wise. So that brings us to where we are. Stated plainly, today's frontier LLM research is not about building animals. It is about summoning ghosts. You can think of ghosts as a fundamentally different kind of point in the space of possible intelligences. They are muddled by humanity. Thoroughly engineered by it. They are these imperfect replicas, a kind of statistical distillation of humanity's documents with some sprinkle on top. They are not platonically bitter lesson pilled, but they are perhaps "practically" bitter lesson pilled, at least compared to a lot of what came before. It seems possibly to me that over time, we can further finetune our ghosts more and more in the direction of animals; That it's not so much a fundamental incompatibility but a matter of initialization in the intelligence space. But it's also quite possible that they diverge even further and end up permanently different, un-animal-like, but still incredibly helpful and properly world-altering. It's possible that ghosts:animals :: planes:birds. Anyway, in summary, overall and actionably, I think this pod is solid "real talk" from Sutton to the frontier LLM researchers, who might be gear shifted a little too much in the exploit mode. Probably we are still not sufficiently bitter lesson pilled and there is a very good chance of more powerful ideas and paradigms, other than exhaustive benchbuilding and benchmaxxing. And animals might be a good source of inspiration. Intrinsic motivation, fun, curiosity, empowerment, multi-agent self-play, culture. Use your imagination.

English
16
92
773
164.2K
Jack Jingyu Zhang
Jack Jingyu Zhang@jackjingyuzhang·
We introduce WaltzRL🎶, a multi-agent RL framework that treats LLM safety as a positive-sum game between conversation & feedback agents. It strikes an elegant balance between helpfulness & harmlessness, boosting safety & reduces overrefusals without degrading capabilities!
Jason Weston@jaseweston

💃New Multi-Agent RL Method: WaltzRL💃 📝: arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5

English
2
19
78
11.8K
Niels Rogge
Niels Rogge@NielsRogge·
Karpathy: "RL is terrible" Every RL researcher on the Karpathy interview: "I agree with everything he says"
English
50
67
2.3K
179.8K
TalkRL Podcast
TalkRL Podcast@TalkRLPodcast·
@CsabaSzepesvari @karpathy My personal hot take is very different: 1. RL as a family of conceptual frameworks, is timeless. 2. Frustrations with modern deep RL algo performance, are mostly due to limitations of deep learning function approx tldr; Give RL FAs that generalize better (plus algos) :D
English
0
1
8
2.2K
Csaba Szepesvari
Csaba Szepesvari@CsabaSzepesvari·
@karpathy @karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x
English
9
39
414
177.1K
Andrej Karpathy
Andrej Karpathy@karpathy·
My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers: AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet x.com/karpathy/statu… Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way. Animals vs Ghosts. My earlier writeup on Sutton's podcast x.com/karpathy/statu… . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about. On RL. I've critiqued RL a few times already, e.g. x.com/karpathy/statu… . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" x.com/karpathy/statu…. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" x.com/karpathy/statu… , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms. Cognitive core. My earlier post on "cognitive core": x.com/karpathy/statu… , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" x.com/karpathy/statu… Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: x.com/karpathy/statu… . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of. nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) x.com/karpathy/statu… On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. x.com/karpathy/statu… Job automation. How the radiologists are doing great x.com/karpathy/statu… and what jobs are more susceptible to automation and why. Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell x.com/karpathy/statu… I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon. Thanks again Dwarkesh for having me over!
Dwarkesh Patel@dwarkesh_sp

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!

English
578
2K
16.9K
4.1M
Csaba Szepesvari
Csaba Szepesvari@CsabaSzepesvari·
@pmddomingos @karpathy @RichardSSutton Just for the record, I would note that @RichardSSutton is always very careful to tell people that RL is a set of problems. I have never heard him misspeaking about this and in fact he is trying to tell people to stop identifying RL with specific algorithms.
English
2
0
23
3.7K