Kris D.

177 posts

Kris D.

@M33pinator

Research Fellow @ https://t.co/7J7uozCcmA; @UAlbertaCS alum interested in AI, robotics, and penguins—dabbles w/ game dev, pixel art, speedrunning, and speedcubing.

Canada Katılım Nisan 2014

173 Takip Edilen480 Takipçiler

Kris D. retweetledi

RL in Big Worlds@rlc_bigworlds·11 Nis

RL in Big Worlds is a workshop at @RL_Conference about ideas that enable agents to achieve goals in environments vastly more complex than themselves This requires giving agents the ability to learn continually and use approximate value functions, models and policies effectively

English

179

84K

Kris D. retweetledi

sorina@robot_in_space2·29 Oca

We organized an RL competition during the first Openmind Research Institute Winter School in Malaysia. The participants were able to implement SARSA and SAC in just 2 days onboard our Embodied MuJoCo Ant! 🎉

English

204

17.8K

Kris D. retweetledi

Yu Su@ysu_nlp·22 Ara

The bitter lesson of 2026 will be sim2real is hopeless to solve and the real world is the only viable learning playground. Evolution is about overfitting to (niches of) the world. slight deviation in a simulation leads to a different universe.

Amir Bar@_amirbar

Over the years I’ve noticed two schools of thought in ML: 1. prototype on synthetic tasks first (examples - ARC, computer games) 2. avoid synthetic tasks entirely i started in camp (1), but slowly converged to (2). the planning and reasoning capabilities we care about are too entangled with the visual diversity of the real world.

English

302

102.1K

Kris D.@M33pinator·11 Kas

@crypticcracking To be fair, these are only considering language models. We lost this race to SAT solvers long ago. :')

English

132

Cracking The Cryptic@crypticcracking·11 Kas

This is fascinating. Delighted to tell you that humans are still well ahead in this race with the machines :)

Sakana AI@SakanaAILabs

GPT-5 on Sudoku-Bench 🧩 Since releasing Sudoku-Bench in May 2025, when no LLM could solve a classic 9x9 puzzle, we've been evaluating the latest generation of models. GPT-5 now leads our leaderboard with 33% puzzles solved--approximately 2x the previous leader--and is the first LLM we've tested to solve a 9x9 Sudoku variant. However, with 67% of the much harder puzzles remaining unsolved, Sudoku-Bench continues to present significant challenges for AI reasoning. Modern Sudoku variants require models to first understand novel rulesets through meta-reasoning, then maintain global consistency across long reasoning chains. Our experiments with GRPO fine-tuning on Qwen2.5-7b and "Thought Cloning" (training on expert human reasoning from Cracking the Cryptic) show that current approaches still struggle with the spatial reasoning and creative "break-in" points that human solvers use naturally. We believe new approaches are required to solve our benchmark. These results highlight persistent gaps between computational problem-solving and human-like reasoning, particularly in tasks requiring integrated mathematical logic, spatial awareness, and creative insight. Read more about our update here: 🔗 Blogpost → pub.sakana.ai/sudoku-gpt5/

English

116

6.3K

Kris D.@M33pinator·9 Kas

@kunlei15 This might be messier in that one- vs. multi-step is already common terminology in RL (e.g., n-step returns, TD(λ), etc.). What you're describing has been previously defined as the granularity of generalized policy iteration (4.6 in Sutton & Barto)

English

376

Kun Lei@kunlei15·9 Kas

RL feels messy, but a two-axis view—data source (on/off/offline) × update schedule (one-step/multi-step/iterative)—brings order. I wrote a post unifying them with shared equations. lei-kun.github.io/blogs/rl.html Robotic FMs (e.g., GEN-0, pi_0.5) grow via a data flywheel. Best fit: multi-step updates—conservative yet exploratory—then switch to iterative RL to surpass/align human ceilings.

English

471

28.1K

Kris D. retweetledi

Khurram Javed@kjaved_·31 Eki

I wrote a thing. Current humanoid robotics startups are not ready for the messiness of the world. Even if they succeed at everything they believe they need to do, it would still be insufficient for making useful robots. More here: khurramjaved.com/not_ready_for_…

English

171

19.9K

Kris D.@M33pinator·28 Eki

@JoshPurtell Yeah conceptually a lot of SGD-like updates handle this by just non-stop updating, but there are practical issues with non-convex functions and their loss landscapes, where a trained network is a horrible initialization for new data—see loss of plasticity (Dohare et al., 2024) 😅

English

Josh@JoshPurtell·28 Eki

@M33pinator Yeah ngl if that’s what people had in mind it’s marginal/incremental and absolutely not worth attention lmao

English

Josh@JoshPurtell·28 Eki

Getting early indications that hype is moving from RL to continuous learning. Standing by to confirm.

English

290

26.6K

Kris D.@M33pinator·28 Eki

@JoshPurtell It’s continual* learning, and it’s learning with a non-stationary data distribution (Abel et al., 2023; Elelimy et al., 2025). It’s also not separate from RL—there’s a body of work on continual RL. Because dynamic programming is inherently non-stationary, RL is no stranger to it.

English

Josh@JoshPurtell·28 Eki

To be clear, I *am* bearish on continuous learning. Mostly because afaict it's a made-up term?

English

1.7K

Kris D.@M33pinator·25 Eki

@marcodisarra @KhurramJaved_96 If the robot was designed to learn, then by design, it might not even have a concept of tripping and falling. Those concerns are if one tries to force a learning algorithm onto a robot designed for functional capabilities. Key is that software and hardware must co-adapt here.

English

Marco@marcodisarra·25 Eki

@KhurramJaved_96 so we should build robots that train while they operate? That would be cool, but i don’t want my robot to trip and fall to the ground every few minutes because it’s still learning

English

Khurram Javed@kjaved_·23 Eki

As long as robotic systems are stuck with sim2real, there is little hope of having general-purpose robots. Sim2real only makes economic sense for a handful of situations where the cost of failure is astronomically high. It's easy to spot these situations because even humans learn in simulation in them (e.g., pilot training, skydiving maneuvers). In all other cases, the cost of accurate simulation is too high, and learning from inaccurate simulations is not a reliable method.

Kris D.@M33pinator

Reinforcement learning 🧠 on robots 🤖 can’t stay in simulation forever. My new post explores why direct, on-hardware learning matters and how we also need smarter mechanical design to enable it. kris.pengy.ca/designforlearn…

English

197

32.2K

Kris D.@M33pinator·24 Eki

@Tha_JPo I do think world models are important, and think they're complementary to online learning—e.g., an online-learned or continually, online-refined model. They allow directed exploration, better credit assignment, and decision-time planning—partially addressing sample efficiency. 😃

English

276

JPo@Tha_JPo·24 Eki

@M33pinator Great post! Fun read (love the visuals too). How do you think world models ties in to generalization across environments? It seems like you think that online learning in the real world is a good way forward, but there are difficulties with sample efficiency.

English

373

Kris D.@M33pinator·23 Eki

English

120

41.7K

Kris D.@M33pinator·24 Eki

@KnightNemo_ While not stated, I do think model-based RL (with decision-time planning) will be central for directed exploration, which also reduces the burden on hardware. 😃

English

442

Siqiao Huang@ICLR@KnightNemo_·24 Eki

I think the main message here is: World Models must generalize , otherwise imitation learning/model-free rl can be more efficient, and learning a q-function basically gives us the same usage the wm can give.

English

903

Kris D.@M33pinator·24 Eki

@bern_jaeger @KhurramJaved_96 That's fair, and I agree that the other approaches are really useful. I'll note however, that the strong claim is rooted in the blue-sky robotics motivation rather than well-characterized, immediate practical uses. :')

English

Bernhard Jaeger@bern_jaeger·24 Eki

@M33pinator @KhurramJaved_96 Maybe flaw is too strong a word. I think the notion that on hardware is inevitable, which excludes other approaches, is too strong.

English

Kris D.@M33pinator·24 Eki

@bern_jaeger @KhurramJaved_96 Is it really a flaw? It's true that a) it has been a criticism of direct-hardware learning, and b) that direct-hardware learning also addresses it, regardless of whether domain randomization handled it. A plus is a practitioner won't (ever) have to specify a domain distribution.

English

Bernhard Jaeger@bern_jaeger·24 Eki

@KhurramJaved_96 A flaw in this argument being that the example problem given of adaption to a failing motor was solved @ CoRL this year with Sim2Real Transfer and InContext learning. Learning to learn exclusively in simulation: generalist-locomotion.github.io

English

447

Kris D.@M33pinator·24 Eki

@adam_patni @KhurramJaved_96 Source, if curious😉 x.com/mhmd_elsaye/st…

Mohamed Elsayed@mhmd_elsaye

Real-time robot learning in action @RL_Conference! The robot is built by Sorina Lupu (@robot_in_space2) & Patrick Spieler. It uses a streaming DRL algorithm (Stream AC) to learn how to walk from scratch under 10 minutes, given enough friction with the floor.

English

Adam Patni@adam_patni·24 Eki

@M33pinator @KhurramJaved_96 10 min is crazy!

English

Kris D.@M33pinator·24 Eki

@adam_patni @KhurramJaved_96 I'd say it's possible but definitely can be better. e.g., Keen AGI's robotroller learns to play Atari games in real-time on the order of hours, and on physical recreations of the MuJoCo ant, it walks after tens of minutes. IMO, model-based/directed exploration will be key. :)

English

134

Adam Patni@adam_patni·24 Eki

@M33pinator @KhurramJaved_96 I guess I'm wondering if it's even possible to take an initial policy (random) and get that to learn anything if you're not starting off in sim on billions of trials. Definitely agree that you must train in the real world.

English

Kris D.@M33pinator·23 Eki

@davide_tateo @robot_in_space2 @ias_tudarmstadt Great work! :D This is more suggesting that 1) more should be doing it, and 2) we can offload the design + enforcement of constraints if the hardware had been designed for learning from the start. e.g., we could less-riskily do more than fine-tuning on hardware. :)

English

Davide Tateo@davide_tateo·23 Eki

@robot_in_space2 At @ias_tudarmstadt we have been doing RL on real robot for a long time... E.g.: x.com/davide_tateo/s…

Davide Tateo@davide_tateo

Finally, our paper "Safe Reinforcement Learning on the Constraint Manifold: Theory and Applications" has been accepted for publication as an evolved paper in T-RO! In this work, we analyze our ATACOM approach and perform real-world RL in our Robot Air Hockey System!

English

120

sorina@robot_in_space2·23 Eki

Check out my friend’s blog post!! It’s cool! And has awesome drawings, too!

Kris D.@M33pinator

English

651

Kris D.@M33pinator·23 Eki

@adam_patni @KhurramJaved_96 It's sensible, but the concerns around trying things not captured in sim remain—if the likelihood + cost of failure is high, that's still a barrier for the real world portion of that split. One might consider imposing saturation limits, but can we really account for everything?

English

Adam Patni@adam_patni·23 Eki

@KhurramJaved_96 What are your thoughts on blended methods (ie 50/50 split of training in sim -> training in real world)?

English

630

Kris D.@M33pinator·20 Eki

@hsvgbkhgbv @CsabaSzepesvari @SOURADIPCHAKR18 @karpathy Yes—I later elaborated what possible RL problem constraints are, where you only access sequences of sampled transitions. I say possible because it was an original intent, but deep RL and its literature has kind of spun it in different directions, muddying the terminology. :(

English

David Jianhong Wang@hsvgbkhgbv·20 Eki

@M33pinator @CsabaSzepesvari @SOURADIPCHAKR18 @karpathy There are other fields that are talking about agent-enviroment interaction process, so it's unfair to make RL equal to agent-enviroment interaction. The typical cases are optimal control and other decision theories.

English

Andrej Karpathy@karpathy·18 Eki

My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers: AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet x.com/karpathy/statu… Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way. Animals vs Ghosts. My earlier writeup on Sutton's podcast x.com/karpathy/statu… . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about. On RL. I've critiqued RL a few times already, e.g. x.com/karpathy/statu… . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" x.com/karpathy/statu…. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" x.com/karpathy/statu… , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms. Cognitive core. My earlier post on "cognitive core": x.com/karpathy/statu… , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" x.com/karpathy/statu… Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: x.com/karpathy/statu… . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of. nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) x.com/karpathy/statu… On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. x.com/karpathy/statu… Job automation. How the radiologists are doing great x.com/karpathy/statu… and what jobs are more susceptible to automation and why. Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell x.com/karpathy/statu… I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon. Thanks again Dwarkesh for having me over!

Dwarkesh Patel@dwarkesh_sp

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!

English

577

16.9K

4.1M

Kris D.@M33pinator·20 Eki

@hsvgbkhgbv @CsabaSzepesvari @SOURADIPCHAKR18 @karpathy RL at its core is agent-environment interaction with evaluative feedback. This can be formalized as an MDP to derive algorithms which obey RL constraints (e.g., can only access temporally correlated sampled transitions), but here, RL is closer to the problem than the solution.

English

David Jianhong Wang@hsvgbkhgbv·20 Eki

@CsabaSzepesvari @SOURADIPCHAKR18 @karpathy Isn't RL an algorithm to solve MDP or more broadly sequential decision making problems? If we transform RL to a problem, it would be somewhat weird.

English

2.7K

Keşfet

@RL_Conference @crypticcracking @kunlei15 @JoshPurtell @marcodisarra @Tha_JPo @KnightNemo_ @elonmusk