rain

129 posts

rain

@rainx0r

phd student / trying to make RL work / yap into the abyss about AI research, coding, etc.

guildford, england Katılım Mayıs 2020

223 Takip Edilen36 Takipçiler

rain@rainx0r·8 Nis

I used to think this until I got nerdsniped trying to smoothen out my ppo curves recently and turns out it really does matter, particularly for critic learning if you don’t handle terminated vs truncated mechanics in your GAE code, and your env is almost always truncating (e.g. most continuous control tasks)

English

Mefaso@Mefaso·7 Nis

@willccbb terminated vs done doesn't matter, it should be a 4 tuple at most

English

574

will brown@willccbb·7 Nis

step() returning a 5-tuple is bad

T NATION by Biotest@T_Nation

Drop your most controversial gym opinion.

English

289

27.8K

rain@rainx0r·4 Oca

@deredleritt3r how come they’re still so terrible at it then lol, the only acceleration they bring to AI research atm is solving obscure cuda errors fast

English

prinz@deredleritt3r·3 Oca

The primary purpose of Claude Code and Codex is to accelerate AI research. Yes, these tools also help you vibe-code B2B SaaS. Yes, they also drive enterprise revenue. But that is *not* their primary purpose.

English

372

24.2K

rain retweetledi

Luke Metro@luke_metro·3 Oca

lol

991

183.4K

rain@rainx0r·22 Ara

Hate to Schmidhuber you, but no, Meta-RL is a really old idea, please stop spreading misinformation. These researchers don’t “introduce” it, they’re just applying it to LLMs. Meta-RL, as people talk about it today, was originally introduced by [1,2], although I'm sure the actual Jürgen could show up and bring up some citation from the 90s doing something similar. There's also a nicely written survey about it [3]. Also, I don't think this has anything to do with Continual Learning, especially not as Sutton understands it and tries to promote. Meta-RL is primarily just a way to incentivise test-time adaptation / task identification behaviour in the agent. I guess it's a cool buzzword for the algorithm rn tho [1] arxiv.org/abs/1611.02779 [2] Botvinick, M., Ritter, S., Wang, J.X., Kurth-Nelson, Z., Blundell, C. and Hassabis, D., 2019. Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5), pp.408-422. cell.com/trends/cogniti… [3] arxiv.org/abs/2301.08028

English

1.6K

Ronak Malde@rronak_·22 Ara

This might be my favorite paper of the year🤯 Rich Sutton claims that current RL methods won't get us to continual learning because they don't compound upon previous knowledge, every rollout starts from scratch. Researchers in Switzerland introduce Meta-RL which might crack that code. Optimize across episodes with a meta-learning objective, which then incentivizes agents to explore first and then exploit. And then reflect upon previous failures for future agent runs. Incredible results and incredible read of a paper overall. Authors: @YulunJiang @LiangzeJ @DamienTeney @Michael_D_Moor @mariabrbic

English

103

885

120.8K

rain@rainx0r·8 Ara

having gotten sick at neurips I think we should shift all AI4Science efforts to eradicating the common cold I don’t even care about curing cancer anymore

English

rain@rainx0r·4 Ara

@rosemary_ke e-mail'd since your DMs are closed, would be really keen to meet

English

142

Nan Rosemary Ke@rosemary_ke·4 Ara

At NeurIPS this week. Excited to meet, please reach out. - Focussed on Scaling LLM-RL - Working on real world evals and long form generation (mathematical proofs/STEM) - Scaling tasks for agents (computer use/ coding/ research)

English

rain@rainx0r·3 Ara

“don’t get confused guys, our official app is the one with the terrible ratings that doesn’t work”

NeurIPS Conference@NeurIPSConf

We have been made aware of several fake apps pretending to be the NeurIPS official app. To clarify, NeurIPS is using atconf. We advise attendees to carefully check thet they are downloading the correct app.

English

117

rain@rainx0r·2 Ara

torn between rust, zig or doing some gimmick for advent of code, I love decision paralysis

English

rain@rainx0r·30 Kas

@rssalessio @NeurIPSConf new app is so terrible this is genuinely embarrassing

English

193

R. Alessio @ ETH | RL, Bandits, Exploration@rssalessio·29 Kas

Just a heads-up: this year @NeurIPSConf is not using the Whova app. You can find the new mobile app on the NeurIPS website neurips.cc/mobile/support/ Literally the worst communications by the organizers on this one. What was wrong with Whova? #NeurIPS2025 #whova

English

13.6K

rain@rainx0r·30 Kas

do I really have to figure out some kind of local PR review workflow to respond to people opening up AI slop PRs on my repos 😭

English

rain@rainx0r·30 Kas

why is @github so insanely laggy for PR reviews now, the diff isn’t even that large

English

rain@rainx0r·11 Kas

@melqtx have you ssh’d into anything before

English

11.4K

mel@melqtx·11 Kas

can someone convince me, just one reason, to use tmux?

English

2.3K

437.8K

rain@rainx0r·11 Kas

@nearlydaniel if you don’t type 150wpm maybe

English

Daniel@nearlydaniel·10 Kas

Interesting asymmetry in AI interactions: You speak faster than you can type But you read faster than you can listen So pure text is slow, and pure voice is slow “Speak and read” seems like the fastest 2 way bitrate we have until brain implants

English

349

230

232K

rain@rainx0r·11 Kas

crazy that the RL hype curve just looks like your average RL training curve

Shane Gu@shaneguML

Deep RL is a roller coaster—only for the strong-hearted

English

rain@rainx0r·8 Kas

@skooookum last time I used it it was laggy af especially on iPad with pencil drawing, I wonder if it’s better yet

English

skooks@skooookum·7 Kas

Quietly the best piece of software Apple has created in the last decade

English

151

2.7K

434.2K

rain@rainx0r·7 Kas

@giffmana gemini-2.5-flash

Indonesia

565

Lucas Beyer (bl16)@giffmana·7 Kas

PS: i like TPUs and MaxText is a fine codebase, but this was too funny to read, who the hell wrote this?!

English

9.5K

Lucas Beyer (bl16)@giffmana·7 Kas

Wow!! Google discovering AND OPEN-SOURCING the latest training techniques such as supervised finetuning (SFT) wasn't on my bingo card. Soon they will have caught up with the frontier, and are sharing this with all of us!

Sundar Pichai@sundarpichai

Our 7th gen TPU Ironwood is coming to GA! It’s our most powerful TPU yet: 10X peak performance improvement vs. TPU v5p, and more than 4X better performance per chip for both training + inference workloads vs. TPU v6e (Trillium). We use TPUs to train + serve our own frontier models, including Gemini, and we’re excited to make the latest generation available to @googlecloud customers.

English

673

153.2K

rain@rainx0r·11 Eki

3.14 doesn’t remove the GIL. the free-threaded python build which has existed since 3.13 does, it’s just the newer version (3.14t) isn’t considered experimental anymore. just getting things installed on 3.14 isn’t enough: you need to get things installed on 3.14t *and* those libraries have to support free-threaded python to work a lot of libraries that are written in other languages like C and just have python bindings still don’t support free threaded python and that’s the real bottleneck rather than 3.14 not marking it as experimental anymore, although the major ones like NumPy already do thankfully

English

1.3K

Jeffrey Emanuel@doodlestein·10 Eki

So Python 3.14 finally came out for real yesterday. Finally removing the GIL (global interpreter lock), which allows for way faster multithreaded code without dealing with all the brain damage and overhead of multiprocessing or other hacky workarounds. And uv already fully supports it, which is wildly impressive. But anyway, I was a bit bummed out, because the main project I’m working on has a massive number of library dependencies, and it always takes a very long time to get mainline support for new python versions, particularly when they’re as revolutionary and different as version 3.14 is. So I was resigned to endure GIL-hell for the indefinite future. But then I figured, why not? Let me just see if codex and GPT-5 can power through it all. So I backed up my settings and asked codex to try, giving it the recent blog post from the uv team to get it started. There were some major roadblocks. I use PyTorch, which is notoriously slow to update. And also pyarrow, which also didn’t support 3.14. Same with cvxpy, the wrapper to the convex optimization library. Still, I wanted to see what we could do even if we had to deal with the brain damage of “vendoring” some libraries and building some stuff from scratch in C++, Rust, etc. using the latest nightly GitHub repositories instead of the usual PyPi libraries. I told codex to search the web, to read GitHub issue pages, etc, so that we didn’t reinvent the wheel (or WHL I should say, 🤣) unnecessarily. Why not? I could always test things, and if I couldn’t get it to work, then I could just retreat back to Python 3.13, right? No harm, no foul. Well, it took many hours of work, almost all of it done by codex while I occasionally checked in with it, but it managed to get everything working! Sure, it took a bunch of iterations, and I had to go tweak some stuff to avoid annoying deprecation warnings (some of which come from other libraries, so I ultimately had to filter them). But those libraries will update over time to better support 3.14 and eventually I won’t need to use any of these annoying workarounds. Codex even suggested uploading the compiled whl artifacts to Cloudflare’s R2 (like s3) so we could reuse them easily across machines, and took care of all the details for me. I would never think to do that on my own. Every time there was another complication or problem (for instance, what is shown in the screenshot below), codex just figured it out and plowed through it all like nothing. If you’ve never tried to do something like this in the “bad old days” prior to LLMs, it was a thankless grind that could eat up days and then hit a roadblock, resulting in a total wipeout. So it was simply too risky to even try it most of the time; you were better off just waiting 6 or 9 months for things to become simple again. Anyway, I still can’t really believe it’s all working! We are living in the future.

English

159

2.3K

236.3K

rain@rainx0r·8 Eki

it’s slower and worse than being in game but the point isn’t to solve Minecraft it’s to demonstrate that learning entirely in a learned world model is possible, there are many domains where the real thing isn’t a video game and model learning is the only path to large scale RL training, that’s like the whole premise of MBRL

English

mike64_t@mike64_t·8 Eki

just so you guys know the bottleneck for getting data out of Minecraft is literally FFmpeg and I can guarantee this model is both slower and worse than actually being ingame. The timer speed can be increased, frame capture can happen in game at scaled rate, and the thing that will make your system come to a crawl is FFmpeg and there's nothing you can do about it. You can't encode video at 1200 fps with current technology unless you're recording at like 240p or you have a 20PB SSD somewhere and decide to use avi.

alphaXiv@askalphaxiv

Crazy paper from Google DeepMind Dreamer 4 can mine Minecraft diamonds without ever touching the real game It trained entirely inside its own world model, kinda like an “imaginary world”, showing agents can tackle long horizon tasks safely by just learning from videos!

English

296

108.9K

rain@rainx0r·30 Eyl

@AryehEnglander @peterwildeford Figure 1 here - red.anthropic.com/2025/cyber-com…

English

Aryeh L. Englander@AryehEnglander·30 Eyl

@peterwildeford Is this from the system card? I can't find it.

English

327

Peter Wildeford🇺🇸🚀@peterwildeford·30 Eyl

Ignore the scary cyber results and feast your eyes on the best graph footnote of all time

English

964

46.2K

rain@rainx0r·30 Ağu

since when does it matter what humans do? the human brain doesn't implement stochastic gradient descent humans don't do pretraining on every point of a given manifold humans don't have to do explicit long context extension midtraining is every other layer of current deep learning systems a bad idea too because they're not what humans do? imo trying to use what humans do as a good north star in AI is doomed in general. we're trying to train computers to solve problems to, ideally, superhuman level and we should think about what computers are good at (memory, parallelism, compute speed) and exploit it every step of the way. RL plays to those strengths and allows you to optimise directly for success in a given task, which is extremely powerful and does work. the real issue with RL (which is probably what you mean) is that you don't really get OOD capabilities out of it. similar to supervised learning you have to define tasks at training time (through a reward function rather than a labeled dataset), and you can't expect good performance for things it wasn't trained on, because even if you train a multi-task model on a ton of different environments you won't get human-like reasoning and online problem solving and adaptation. but that's another problem entirely. "most economically viable tasks" can just be defined as envs and solved directly. you can get all those other things though if you're so inclined, with RL of all things even, if you make the environment general/challenging enough. since you can optimise directly for success in any task, you can just make the task "be good at learning / adapting and few-shot solving ood tasks" (which is often referred to as meta-reinforcement learning and there's some good results for it already)

English

206

Andrej Karpathy@karpathy·27 Ağu

In era of pretraining, what mattered was internet text. You'd primarily want a large, diverse, high quality collection of internet documents to learn from. In era of supervised finetuning, it was conversations. Contract workers are hired to create answers for questions, a bit like what you'd see on Stack Overflow / Quora, or etc., but geared towards LLM use cases. Neither of the two above are going away (imo), but in this era of reinforcement learning, it is now environments. Unlike the above, they give the LLM an opportunity to actually interact - take actions, see outcomes, etc. This means you can hope to do a lot better than statistical expert imitation. And they can be used both for model training and evaluation. But just like before, the core problem now is needing a large, diverse, high quality set of environments, as exercises for the LLM to practice against. In some ways, I'm reminded of OpenAI's very first project (gym), which was exactly a framework hoping to build a large collection of environments in the same schema, but this was way before LLMs. So the environments were simple academic control tasks of the time, like cartpole, ATARI, etc. The @PrimeIntellect environments hub (and the `verifiers` repo on GitHub) builds the modernized version specifically targeting LLMs, and it's a great effort/idea. I pitched that someone build something like it earlier this year: x.com/karpathy/statu… Environments have the property that once the skeleton of the framework is in place, in principle the community / industry can parallelize across many different domains, which is exciting. Final thought - personally and long-term, I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically. I think that reward functions are super sus, and I think humans don't use RL to learn (maybe they do for some motor tasks etc, but not intellectual problem solving tasks). Humans use different learning paradigms that are significantly more powerful and sample efficient and that haven't been properly invented and scaled yet, though early sketches and ideas exist (as just one example, the idea of "system prompt learning", moving the update to tokens/contexts not weights and optionally distilling to weights as a separate process a bit like sleep does).

Prime Intellect@PrimeIntellect

Introducing the Environments Hub RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them down We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGI

English

257

858

7.3K

947K

Keşfet

@willccbb @deredleritt3r @YulunJiang @LiangzeJ @DamienTeney @Michael_D_Moor @mariabrbic @rosemary_ke