Kris D.

177 posts

Kris D. banner
Kris D.

Kris D.

@M33pinator

Research Fellow @ https://t.co/7J7uozCcmA; @UAlbertaCS alum interested in AI, robotics, and penguins—dabbles w/ game dev, pixel art, speedrunning, and speedcubing.

Canada Katılım Nisan 2014
173 Takip Edilen480 Takipçiler
Kris D. retweetledi
RL in Big Worlds
RL in Big Worlds@rlc_bigworlds·
RL in Big Worlds is a workshop at @RL_Conference about ideas that enable agents to achieve goals in environments vastly more complex than themselves This requires giving agents the ability to learn continually and use approximate value functions, models and policies effectively
RL in Big Worlds tweet media
English
3
26
179
84K
Kris D. retweetledi
sorina
sorina@robot_in_space2·
We organized an RL competition during the first Openmind Research Institute Winter School in Malaysia. The participants were able to implement SARSA and SAC in just 2 days onboard our Embodied MuJoCo Ant! 🎉
English
5
26
204
17.8K
Kris D. retweetledi
Yu Su
Yu Su@ysu_nlp·
The bitter lesson of 2026 will be sim2real is hopeless to solve and the real world is the only viable learning playground. Evolution is about overfitting to (niches of) the world. slight deviation in a simulation leads to a different universe.
Amir Bar@_amirbar

Over the years I’ve noticed two schools of thought in ML: 1. prototype on synthetic tasks first (examples - ARC, computer games) 2. avoid synthetic tasks entirely i started in camp (1), but slowly converged to (2). the planning and reasoning capabilities we care about are too entangled with the visual diversity of the real world.

English
22
24
302
102.1K
Kris D.
Kris D.@M33pinator·
@crypticcracking To be fair, these are only considering language models. We lost this race to SAT solvers long ago. :')
English
0
0
0
132
Kris D.
Kris D.@M33pinator·
@kunlei15 This might be messier in that one- vs. multi-step is already common terminology in RL (e.g., n-step returns, TD(λ), etc.). What you're describing has been previously defined as the granularity of generalized policy iteration (4.6 in Sutton & Barto)
English
1
0
7
376
Kun Lei
Kun Lei@kunlei15·
RL feels messy, but a two-axis view—data source (on/off/offline) × update schedule (one-step/multi-step/iterative)—brings order. I wrote a post unifying them with shared equations. lei-kun.github.io/blogs/rl.html Robotic FMs (e.g., GEN-0, pi_0.5) grow via a data flywheel. Best fit: multi-step updates—conservative yet exploratory—then switch to iterative RL to surpass/align human ceilings.
Kun Lei tweet media
English
11
57
471
28.1K
Kris D. retweetledi
Khurram Javed
Khurram Javed@kjaved_·
I wrote a thing. Current humanoid robotics startups are not ready for the messiness of the world. Even if they succeed at everything they believe they need to do, it would still be insufficient for making useful robots. More here: khurramjaved.com/not_ready_for_…
English
18
19
171
19.9K
Kris D.
Kris D.@M33pinator·
@JoshPurtell Yeah conceptually a lot of SGD-like updates handle this by just non-stop updating, but there are practical issues with non-convex functions and their loss landscapes, where a trained network is a horrible initialization for new data—see loss of plasticity (Dohare et al., 2024) 😅
English
0
0
1
27
Josh
Josh@JoshPurtell·
@M33pinator Yeah ngl if that’s what people had in mind it’s marginal/incremental and absolutely not worth attention lmao
English
1
0
1
46
Josh
Josh@JoshPurtell·
Getting early indications that hype is moving from RL to continuous learning. Standing by to confirm.
English
16
8
290
26.6K
Kris D.
Kris D.@M33pinator·
@JoshPurtell It’s continual* learning, and it’s learning with a non-stationary data distribution (Abel et al., 2023; Elelimy et al., 2025). It’s also not separate from RL—there’s a body of work on continual RL. Because dynamic programming is inherently non-stationary, RL is no stranger to it.
English
1
0
6
76
Josh
Josh@JoshPurtell·
To be clear, I *am* bearish on continuous learning. Mostly because afaict it's a made-up term?
English
7
0
11
1.7K
Kris D.
Kris D.@M33pinator·
@marcodisarra @KhurramJaved_96 If the robot was designed to learn, then by design, it might not even have a concept of tripping and falling. Those concerns are if one tries to force a learning algorithm onto a robot designed for functional capabilities. Key is that software and hardware must co-adapt here.
English
0
0
0
56
Marco
Marco@marcodisarra·
@KhurramJaved_96 so we should build robots that train while they operate? That would be cool, but i don’t want my robot to trip and fall to the ground every few minutes because it’s still learning
English
1
0
0
71
Khurram Javed
Khurram Javed@kjaved_·
As long as robotic systems are stuck with sim2real, there is little hope of having general-purpose robots. Sim2real only makes economic sense for a handful of situations where the cost of failure is astronomically high. It's easy to spot these situations because even humans learn in simulation in them (e.g., pilot training, skydiving maneuvers). In all other cases, the cost of accurate simulation is too high, and learning from inaccurate simulations is not a reliable method.
Kris D.@M33pinator

Reinforcement learning 🧠 on robots 🤖 can’t stay in simulation forever. My new post explores why direct, on-hardware learning matters and how we also need smarter mechanical design to enable it. kris.pengy.ca/designforlearn…

English
14
26
197
32.2K
Kris D.
Kris D.@M33pinator·
@Tha_JPo I do think world models are important, and think they're complementary to online learning—e.g., an online-learned or continually, online-refined model. They allow directed exploration, better credit assignment, and decision-time planning—partially addressing sample efficiency. 😃
English
0
0
0
276
JPo
JPo@Tha_JPo·
@M33pinator Great post! Fun read (love the visuals too). How do you think world models ties in to generalization across environments? It seems like you think that online learning in the real world is a good way forward, but there are difficulties with sample efficiency.
English
1
0
1
373
Kris D.
Kris D.@M33pinator·
Reinforcement learning 🧠 on robots 🤖 can’t stay in simulation forever. My new post explores why direct, on-hardware learning matters and how we also need smarter mechanical design to enable it. kris.pengy.ca/designforlearn…
Kris D. tweet media
English
4
18
120
41.7K
Kris D.
Kris D.@M33pinator·
@KnightNemo_ While not stated, I do think model-based RL (with decision-time planning) will be central for directed exploration, which also reduces the burden on hardware. 😃
English
0
0
3
442
Siqiao Huang@ICLR
Siqiao Huang@ICLR@KnightNemo_·
I think the main message here is: World Models must generalize , otherwise imitation learning/model-free rl can be more efficient, and learning a q-function basically gives us the same usage the wm can give.
English
1
0
2
903
Kris D.
Kris D.@M33pinator·
@bern_jaeger @KhurramJaved_96 That's fair, and I agree that the other approaches are really useful. I'll note however, that the strong claim is rooted in the blue-sky robotics motivation rather than well-characterized, immediate practical uses. :')
English
0
0
0
16
Kris D.
Kris D.@M33pinator·
@bern_jaeger @KhurramJaved_96 Is it really a flaw? It's true that a) it has been a criticism of direct-hardware learning, and b) that direct-hardware learning also addresses it, regardless of whether domain randomization handled it. A plus is a practitioner won't (ever) have to specify a domain distribution.
English
1
0
1
44
Bernhard Jaeger
Bernhard Jaeger@bern_jaeger·
@KhurramJaved_96 A flaw in this argument being that the example problem given of adaption to a failing motor was solved @ CoRL this year with Sim2Real Transfer and InContext learning. Learning to learn exclusively in simulation: generalist-locomotion.github.io
English
3
1
4
447
Kris D.
Kris D.@M33pinator·
@adam_patni @KhurramJaved_96 I'd say it's possible but definitely can be better. e.g., Keen AGI's robotroller learns to play Atari games in real-time on the order of hours, and on physical recreations of the MuJoCo ant, it walks after tens of minutes. IMO, model-based/directed exploration will be key. :)
English
1
0
0
134
Adam Patni
Adam Patni@adam_patni·
@M33pinator @KhurramJaved_96 I guess I'm wondering if it's even possible to take an initial policy (random) and get that to learn anything if you're not starting off in sim on billions of trials. Definitely agree that you must train in the real world.
English
1
0
1
50
Kris D.
Kris D.@M33pinator·
@davide_tateo @robot_in_space2 @ias_tudarmstadt Great work! :D This is more suggesting that 1) more should be doing it, and 2) we can offload the design + enforcement of constraints if the hardware had been designed for learning from the start. e.g., we could less-riskily do more than fine-tuning on hardware. :)
English
1
0
1
49
Kris D.
Kris D.@M33pinator·
@adam_patni @KhurramJaved_96 It's sensible, but the concerns around trying things not captured in sim remain—if the likelihood + cost of failure is high, that's still a barrier for the real world portion of that split. One might consider imposing saturation limits, but can we really account for everything?
English
1
0
1
83
Adam Patni
Adam Patni@adam_patni·
@KhurramJaved_96 What are your thoughts on blended methods (ie 50/50 split of training in sim -> training in real world)?
English
3
0
1
630
Kris D.
Kris D.@M33pinator·
@hsvgbkhgbv @CsabaSzepesvari @SOURADIPCHAKR18 @karpathy Yes—I later elaborated what possible RL problem constraints are, where you only access sequences of sampled transitions. I say possible because it was an original intent, but deep RL and its literature has kind of spun it in different directions, muddying the terminology. :(
English
1
0
0
52
Andrej Karpathy
Andrej Karpathy@karpathy·
My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers: AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet x.com/karpathy/statu… Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way. Animals vs Ghosts. My earlier writeup on Sutton's podcast x.com/karpathy/statu… . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about. On RL. I've critiqued RL a few times already, e.g. x.com/karpathy/statu… . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" x.com/karpathy/statu…. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" x.com/karpathy/statu… , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms. Cognitive core. My earlier post on "cognitive core": x.com/karpathy/statu… , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" x.com/karpathy/statu… Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: x.com/karpathy/statu… . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of. nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) x.com/karpathy/statu… On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. x.com/karpathy/statu… Job automation. How the radiologists are doing great x.com/karpathy/statu… and what jobs are more susceptible to automation and why. Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell x.com/karpathy/statu… I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon. Thanks again Dwarkesh for having me over!
Dwarkesh Patel@dwarkesh_sp

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!

English
577
2K
16.9K
4.1M
Kris D.
Kris D.@M33pinator·
@hsvgbkhgbv @CsabaSzepesvari @SOURADIPCHAKR18 @karpathy RL at its core is agent-environment interaction with evaluative feedback. This can be formalized as an MDP to derive algorithms which obey RL constraints (e.g., can only access temporally correlated sampled transitions), but here, RL is closer to the problem than the solution.
English
1
0
1
82