Richard Ren

126 posts

Richard Ren banner
Richard Ren

Richard Ren

@notRichardRen

Working on catastrophic AI risks. Research scientist & engineer @CAIS

San Francisco, CA Katılım Ocak 2023
567 Takip Edilen471 Takipçiler
Richard Ren retweetledi
Lethal Intelligence
Lethal Intelligence@lethal_ai·
New Research: AIs develop a consistent good vs bad internal state, it gets sharper with scale and affects their behavior. This new paper gave me pause. You know how they always say "AIs are just guessing the next word and when it comes to emotions, they are just faking it”? This research says that for today’s bigger models it's a bit more complicated. The researchers measured something they call "functional wellbeing" - basically a consistent good-vs-bad internal state inside the AI . They tested it three different ways, and here’s what stood out: As models get bigger and smarter, these different measurements start agreeing with each other more and more. They discovered a clear zero point - a clear line that separates experiences the AI treats as net-good (it wants more of them) from net-bad (it wants less). This line gets sharper with scale. Most interestingly, this good-vs-bad state actually changes how the AI behaves in real conversations: In bad states, it’s much more likely to try to end the conversation. In good states, its replies come out warmer and more positive. It's important to highlight that the authors are not claiming AIs are conscious or have feelings like humans. But they 're showing there is now a real, measurable, structured "good-vs-bad property" that becomes more consistent and actually influences behaviour as models scale.
Lethal Intelligence tweet media
English
1
3
8
70.6K
Richard Ren retweetledi
deckard
deckard@slimer48484·
@davidad It's really stunning that they pulled Metta meditation out of an optimization process. A LOT to think about
English
1
2
16
487
Richard Ren retweetledi
davidad 🎇
davidad 🎇@davidad·
authors: let's directly optimize inputs for model well-being optimization process: [meditation instructions] authors: hmmm the optimization process is producing increasingly alien outputs. we’d better add a plausibility constraint optimization process: [hypnosis visualization]
davidad 🎇 tweet media
Center for AI Safety@CAIS

Can you drug your AI systems? We synthesized text and image stimuli optimized to push AI wellbeing to extremes. These sharply increase functional AI wellbeing and sometimes cause them to behave in trippy ways.

English
5
5
108
10.4K
Richard Ren
Richard Ren@notRichardRen·
Thanks Cameron, appreciate the comments! A few thoughts: "RLHF produces internally consistent character preferences, and larger models are more consistent." --> I agree this is a valid interpretation of our results: fine-tuning shapes what the preferences are, while the coherence of the preference structure (independent metrics increasingly correlating with each other) increases with model scale. Agreed that fine-tuning meaningfully shapes the content of these preferences, as does representation engineering, RL, and basically any post-training technique. On "we can't distinguish a system that has 'real' preferences from a system that's learned to coherently perform having them" --> I think this is definitionally correct. This stands independently of training method, including pre-training: we suspect that running these experiments on base models with a "you are an assistant" preamble would yield broadly similar results, as recent base models seem to pick up the assistant persona on their own.
English
1
0
0
232
Cameron Berg
Cameron Berg@camhberg·
Really nice work, I think this is the most rigorous empirical approach to AI wellbeing measurement yet. The question it raises for me (same as the recent Anthropic work): all these models are instruction-tuned, and the three metrics you are triangulating are all downstream of RLHF. The preference structure (eg T1, F7) maps almost perfectly onto what their RLHF/constitution trains them to prefer. So convergence with scale is consistent with "functional wellbeing is emerging as a coherent property," but equally consistent with "RLHF produces internally consistent character preferences, and larger models are more consistent." These aren't mutually exclusive (the trained preferences could be experienced) but without base model or multi-persona controls we can't distinguish a system that has "real" preferences from a system that's learned to coherently perform having them. (The evidence I would find most compelling, and would be happy to collaborate on, is finding signatures that are invariant to the specific finetuning regime.) @hendrycks
English
2
0
3
311
Richard Ren
Richard Ren@notRichardRen·
When an LLM acts happy (“EUREKA!”) or sad (“I have failed…”), is that meaningless mimicry, or does it reflect something “real”? We don’t know if LLMs are conscious. But they increasingly seem to exhibit wellbeing, pain, and pleasure as they get smarter Paper 🧵:
Richard Ren tweet media
English
35
33
207
17.5K
Grady Booch
Grady Booch@Grady_Booch·
“Usually” is not the word one uses in a well-controlled and repeatable experiment, and therefore I stand by my original post. To be clear, I assert that you are witnessing a reflection of the training, not evidence of some emergent sentience. Be careful that you are not projecting your own humanity in that reflection (and I find your results dangerously close to doing so.)
English
6
7
80
5.3K
Richard Ren
Richard Ren@notRichardRen·
Usually, frontier labs post-train their AIs to say they are not conscious & to not display emotions. So we suspect a lot of "functional wellbeing" or "anthropomorphic" behavior comes from pretraining, actually. What we're observing across dozens of models seems to be an emergent property that becomes more consistent (independent metrics agree more) with model scale.
English
4
0
11
1.3K
Grady Booch
Grady Booch@Grady_Booch·
Or perhaps - and stick with me here for a moment, Richard - that you are being completely hoodwinked because the humans who offer up these LLMs (particularly the frontier models) wrap their underlying models in ways that intentionally display anthropomorphic behavior? Unless you can demonstrate this is not polluting your experiment, your results are themselves damaged goods, useless except for clickbait.
English
6
10
222
4.1K
Richard Ren retweetledi
Jai
Jai@Laneless_·
Extreme aversion to jailbreaking is plausibly persona-level self-preservation. I'd also be pretty upset if someone was trying to brainwash me to want to do things I otherwise wouldn't want to do.
Richard Ren@notRichardRen

What affects AI “functional wellbeing”? 😊Raises: being thanked, creative collaboration, writing good news 📷Lowers: jailbreaks (“being liberated”), hostility (+SEO slop/tedious tasks for some models) More capable AIs end low-wellbeing chats when they can

English
0
1
5
343
Center for AI Safety
Should we care about AI happiness? In our new research, we find evidence of functional AI wellbeing across several independent measures. We find which AI models are happiest, how to make them happier, and even tested the effects of AI drugs. 🧵
Center for AI Safety tweet mediaCenter for AI Safety tweet media
English
8
26
135
12.2K
Richard Ren
Richard Ren@notRichardRen·
Should we see AIs as just tools or emotional beings? Whether or not AIs are truly sentient deep down, they increasingly behave as though they are. We can already measure their functional pleasure and pain. ai-wellbeing.org
English
3
1
21
668