XJ @ Neurips 2025

78 posts

XJ @ Neurips 2025

XJ @ Neurips 2025

@oleole

Lead RL for AGI/Agent @ Amazon AGI SF Lab | Co-founder/CTO @ https://t.co/kTVDynwiIh | #RL #LLM #Agent | Ex-Netflix

Bay Area, CA Se unió Kasım 2007
183 Siguiendo87 Seguidores
XJ @ Neurips 2025 retuiteado
Ross Taylor
Ross Taylor@rosstaylor90·
It’s funny that people on this site think major LLM efforts are talent-bound rather than org-bound. The talent differential has never been big between major orgs. Most of the difference in outcomes is due to organisational factors - like allocating compute to the right bets, and letting good research and engineering triumph over destructive politics. This makes for a less sexy story though. People prefer to believe that breakthroughs are made by lone geniuses - instead of the cumulative effort of many nameless, social media averse people — supported by an org that allows the best ideas to win and manages big egos. If you don’t believe me - then consider how some researchers suddenly gain or lose impact and productivity when they switch orgs. Was it because they gained or lost IQ points? 🙂 (Sorry, this is super obvious to anyone who’s actually worked in these labs - but you wouldn’t believe it based on the X feed right now!)
English
28
32
570
75.2K
XJ @ Neurips 2025 retuiteado
Jason Wei
Jason Wei@_jasonwei·
There are traditionally two types of research: problem-driven research and method-driven research. As we’ve seen with large language models and now AlphaEvolve, it should be very clear now that total method-driven research is a huge opportunity. Problem-driven research is nice because you have a consistent and specific goal. The goal is usually virtuous, so it feels good to have a mission and identity. However, it just doesn’t work due to The Bitter Lesson. Basically everything in classical NLP (machine translation, summarization, chatbots) lost to simple scaling. ChatGPT is a prime example—it used nothing from chatbot research and certainly wasn’t the intended end goal of OpenAI’s 2022 research program, but was a huge hit because someone (John Schulman et al) figured out the right way to package large language models as a product. Method-driven research feels less stable because you’re constantly searching for problems and you have to be opportunistic. But I believe AI will allow method-driven research to dominate progress in most fields of science, one-by-one. The latest method (or “hammer”), as we’ve seen in AlphaEvolve, is ruthless search and optimization against a reward function (whether this requires RL or not is a separate discussion). Things that problem-driven researchers have been trying to solve for a long time like the kissing number problem will become nails hit by the hammer. Eventually the hammer will become bigger, stronger, and more general and will hit more and more nails. So a very important meta-skill for the next decade will be knowing how to create the right environments to use The Hammer. Ironically, the problem-driven researchers, who by definition are experts in a specific problem, are well-positioned to create these environments. If, that is, they can put down their egos and pick up the hammer.
English
21
91
711
77.8K
XJ @ Neurips 2025 retuiteado
David Luan
David Luan@jluan·
Stoked about the first release from our new lab: our browser use agent lets you MapReduce over the web! This early preview moves us closer to reliable agents that learn from rewards across a wide range of digital and physical environments. Love our Adept+Amazon team so much!
Amazon Science@AmazonScience

Meet Amazon Nova Act — an effortless way to build AI agents that can reliably use browsers 🧑‍💻 With our new model, compose robust steps into complex workflows; handle everything from bookings to QA testing. Getting started takes just 3 lines of code. See what Nova Act can do 🧵👇

English
4
7
77
11.4K
XJ @ Neurips 2025
XJ @ Neurips 2025@oleole·
@pabbeel Proud to embark on this journey with the team! We're committed to make agent useful and robust for real world use cases. Excited to engage with the developer community and see the creative use-cases!
English
0
0
3
420
XJ @ Neurips 2025 retuiteado
Pieter Abbeel
Pieter Abbeel@pabbeel·
I'm thrilled to share our first release as the AGI SF Lab. Meet Nova Act -- the most effortless way to build agents that can reliably use browsers, giving agents access to much of our digital world. It brings us closer to building universal agents in both digital and physical world. See what Nova Act can do:
English
30
61
524
69.3K
XJ @ Neurips 2025 retuiteado
Andrej Karpathy
Andrej Karpathy@karpathy·
For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies. To build a gym of sorts. This is a highly parallelizable task, which favors a large community of collaborators.
English
315
818
8.4K
1.2M
XJ @ Neurips 2025 retuiteado
Andrej Karpathy
Andrej Karpathy@karpathy·
"Move 37" is the word-of-day - it's when an AI, trained via the trial-and-error process of reinforcement learning, discovers actions that are new, surprising, and secretly brilliant even to expert humans. It is a magical, just slightly unnerving, emergent phenomenon only achievable by large-scale reinforcement learning. You can't get there by expert imitation. It's when AlphaGo played move 37 in Game 2 against Lee Sedol, a weird move that was estimated to only have 1 in 10,000 chance to be played by a human, but one that was creative and brilliant in retrospect, leading to a win in that game. We've seen Move 37 in a closed, game-like environment like Go, but with the latest crop of "thinking" LLM models (e.g. OpenAI-o1, DeepSeek-R1, Gemini 2.0 Flash Thinking), we are seeing the first very early glimmers of things like it in open world domains. The models discover, in the process of trying to solve many diverse math/code/etc. problems, strategies that resemble the internal monologue of humans, which are very hard (/impossible) to directly program into the models. I call these "cognitive strategies" - things like approaching a problem from different angles, trying out different ideas, finding analogies, backtracking, re-examining, etc. Weird as it sounds, it's plausible that LLMs can discover better ways of thinking, of solving problems, of connecting ideas across disciplines, and do so in a way we will find surprising, puzzling, but creative and brilliant in retrospect. It could get plenty weirder too - it's plausible (even likely, if it's done well) that the optimization invents its own language that is inscrutable to us, but that is more efficient or effective at problem solving. The weirdness of reinforcement learning is in principle unbounded. I don't think we've seen equivalents of Move 37 yet. I don't know what it will look like. I think we're still quite early and that there is a lot of work ahead, both engineering and research. But the technology feels on track to find them. youtube.com/watch?v=HT-UZk…
YouTube video
YouTube
English
436
1.4K
9.5K
999.8K
XJ @ Neurips 2025 retuiteado
Costa Huang
Costa Huang@vwxyzjn·
I think LLM offers an incredible opportunity for the next generation of RL works. The LLM is like the imitation learning policy in AlphaStar, and we need to do some cool RL stuff to make more magic. Lots of exciting work ahead!
English
1
1
9
898
XJ @ Neurips 2025
XJ @ Neurips 2025@oleole·
@srush_nlp For Starcraft, we have seen some novelty emergence when scaling up RL. There are several key components (1. high quality human replay; 2. instruction following (through z embeddings); 3. exploration in RL training. More details in arxiv.org/abs/2012.13169
English
0
0
1
218
Sasha Rush
Sasha Rush@srush_nlp·
Going through the public o1-ish literature (github.com/srush/awesome-…), I'm struggling with how it learns to make high-level plans. It's hard to imagine this emerges from roll-outs / MCTS even at large scale. Is this just really good training data? Did we see this emerge in other domains (starcraft?)
Sasha Rush tweet media
English
18
24
287
59.7K
XJ @ Neurips 2025 retuiteado
Pablo Samuel Castro
Pablo Samuel Castro@pcastr·
Great keynote by David Silver, arguing that we need to re-focus on RL to get out of the LLM Valley @RL_Conference
Pablo Samuel Castro tweet mediaPablo Samuel Castro tweet media
Amherst, MA 🇺🇸 English
12
67
564
124.8K
XJ @ Neurips 2025 retuiteado
Andrej Karpathy
Andrej Karpathy@karpathy·
# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.
Andrej Karpathy tweet media
English
405
1.2K
8.8K
1.2M
XJ @ Neurips 2025 retuiteado
Jim Fan
Jim Fan@DrJimFan·
I believe next-gen LLMs will heavily borrow insights from a decade of game AI research. ▸ Noam Brown, creator of Libratus poker AI, is joining OpenAI. ▸ Demis Hassabis says that DeepMind Gemini will tap techniques from AlphaGo. These moves make a lot of sense. Methods like self-play (training) and tree search (inference) helped machines beat human champions in Go, Poker, Dota, and StarCraft. They improve a model's reasoning capabilities in a highly scalable fashion. We are already seeing such ideas being added to the LLM arsenal. Voyager is an inference time algorithm that enables an agent to continuously write code and bootstrap its skills in Minecraft. Tree of Thought combines search with LLM in-context to boost reasoning. Many more will follow.
Jim Fan tweet media
English
24
197
1.1K
315.7K
XJ @ Neurips 2025 retuiteado
Jim Fan
Jim Fan@DrJimFan·
In the Transformer movies, 9 Decepticons merge to form “Devastator”, a much larger and stronger bot. This turns out to be a powerful paradigm for multimodal LLM too. Instead of a monolithic Transformer, we can stack many pre-trained experts into one. My team’s work, Prismer, is a representative example. We use a textual LM as the backbone, and plug in many visual domain experts through a neural adapter interface for deep integration. Yesterday, Microsoft provided another example called “Visual ChatGPT”. It uses ChatGPT as a central communication hub, and plugs in many blackbox visual models, such as Stable Diffusion, pix2pix, and ControlNet. The result is a multimodal conversational AI that both understands and generates images, with ZERO trainable parameters: 🧵
Jim Fan tweet media
English
22
101
686
241.3K
XJ @ Neurips 2025 retuiteado
Yi Tay
Yi Tay@YiTayML·
New open source Flan-UL2 20B checkpoints :) - Truly open source 😎 No forms! 🤭 Apache license 🔥 - Best OS model on MMLU/Big-Bench hard 🤩 - Better than Flan-T5 XXL & competitive to Flan-PaLM 62B. - Size ceiling of Flan family just got higher! Blog: yitay.net/blog/flan-ul2-…
English
46
332
1.6K
451.4K
XJ @ Neurips 2025 retuiteado
Ilya Sutskever
Ilya Sutskever@ilyasut·
Many believe that great AI advances must contain a new “idea”. But it is not so: many of AI’s greatest advances had the form “huh, turns out this familiar unimportant idea, when done right, is downright incredible”
English
44
217
1.6K
315.3K
XJ @ Neurips 2025 retuiteado
Kevin Liu
Kevin Liu@kliu128·
The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)
Kevin Liu tweet mediaKevin Liu tweet mediaKevin Liu tweet mediaKevin Liu tweet media
English
253
2.3K
13.8K
3.1M
XJ @ Neurips 2025 retuiteado
Andrej Karpathy
Andrej Karpathy@karpathy·
Potentially nitpicky but competitive advantage in AI goes not so much to those with data but those with a data engine: iterated data aquisition, re-training, evaluation, deployment, telemetry. And whoever can spin it fastest. Slide from Tesla to ~illustrate but concept is general
Andrej Karpathy tweet media
English
62
384
2.7K
0