Michael McArdle

798 posts

Michael McArdle banner
Michael McArdle

Michael McArdle

@MikeMan444

Co-founder / Principal at Lucid Dream - Leveraging the power of purposeful play. My views here are my own.

Durham, NC เข้าร่วม Ocak 2011
698 กำลังติดตาม333 ผู้ติดตาม
Michael McArdle
Michael McArdle@MikeMan444·
One thing I haven't figured out yet: are there any meaningful differences between 4o and 4.5 in terms of the new image generation? 4.5 is a massively bigger model - presumably that would result in some differences? Or does it use 4o regardless?
English
0
0
1
80
Michael McArdle
Michael McArdle@MikeMan444·
@michae1becker Have you found a marked decrease in spatial anchoring quality with the beta version 7 update?
English
0
0
0
34
Michael Becker
Michael Becker@michae1becker·
speedrunning #100DaysOfSwiftUI and got to the obligatory FizzBuzz - to keep things fresh and Apple Vision Pro-focused, I created the FizzBuzzVerse SwiftUI views... lord give me strength
English
5
0
34
2.5K
Michael McArdle
Michael McArdle@MikeMan444·
@karpathy Well said. The problem of sycophancy seems inherent in the current RLHF / RLAIF paradigm. The model is not optimizing for correctness or factuality, it’s optimizing for pleasing the RM.
English
0
0
0
341
Andrej Karpathy
Andrej Karpathy@karpathy·
# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.
Andrej Karpathy tweet media
English
405
1.2K
8.8K
1.2M
Ben Lang
Ben Lang@benz145·
Off the top of my head: Elite Dangerous, Sea of Thieves, Metro: Exodus
English
3
0
9
1.6K
Ben Lang
Ben Lang@benz145·
Which come to mind?
Ben Lang tweet media
English
109
4
258
24.5K
Michael McArdle
Michael McArdle@MikeMan444·
@XrDigest I think the headline here is premature. The VR participants were not shown a volumetric VR environment - they were shown a 360-° panorama. As compared to the real-world participants, who were given the opportunity to walk around the environment. Comparing apples to oranges.
English
1
0
0
27
Michael McArdle รีทวีตแล้ว
Nathan Labenz
Nathan Labenz@labenz·
Wanted: aspiring AI product Red Teamers Do you enjoy jailbreaking AI products? Would you like to help raise the standards for AI application development broadly? If yes, please get in touch! (and I never say this but ... please retweet!)
English
7
30
42
12K
Robert McNees
Robert McNees@mcnees·
@CEOHaize Radiohead at a little club in Chapel Hill, the opening show of the tour for "The Bends." Ray Charles at Thompson Boling Arena in Knoxville. Just off-the-charts energy at both shows.
English
1
0
0
175
Nathan 🔎
Nathan 🔎@NathanpmYoung·
What are your favourite point and click, puzzle or visual novel games? I am playing through "Return of the Obra Dinn" and it is just wonderful.
English
13
0
19
1.4K
Michael McArdle
Michael McArdle@MikeMan444·
@DennyCloudhead Great point. I think if the early SteamVR setup process had some sort of gamified way to add your furniture and tag it with a semantic layer Devs could have had some interesting abilities. Though probably most wouldn't have used them as it's a lot of effort for a tiny userbase
English
0
0
0
26
Michael McArdle
Michael McArdle@MikeMan444·
@ylecun Yann, I would argue that's an oversimplification of the article you linked, to the degree that it might be a misleading representation.
English
0
0
0
42
Michael McArdle
Michael McArdle@MikeMan444·
@VrDevBrad Great points - I think the main takeaway for me is structure, accountability (via some mechanism), and consequences are probably the three elements that must remain in the best games with procedural AI generation for them to be satisfying.
English
1
0
0
3
Michael McArdle
Michael McArdle@MikeMan444·
A great and sobering point I found on Reddit - fun to imagine “infinitely generated RPGs” with generative AI but when taken to the extreme it just dilutes and destroys meaning
Michael McArdle tweet media
English
1
1
2
136
Michael McArdle
Michael McArdle@MikeMan444·
and it's a blind spot in the quest for AGI that might turn out to be a blocker between us and certain capabilities. Definitely not a new take or a groundbreaking one but it's where I find myself right now. end/
English
0
0
0
33
Michael McArdle
Michael McArdle@MikeMan444·
I'll end by saying I do not believe all thought and knowledge is mediated through language, and to believe so is the cardinal sin of our cultural inheritance from the Greeks and Romans. It's an assumption that's baked into most people in our current culture, 12/
English
1
0
0
37
Michael McArdle
Michael McArdle@MikeMan444·
I am certainly not on the side that argues that AGI is impossible without being embodied (@ylecun et al, if I'm summarizing their position fairly), but it seems very likely to me that something critical is missing with current approaches. 1/
English
1
0
0
82