Richard Ren

126 posts

Richard Ren

@notRichardRen

Working on catastrophic AI risks. Research scientist & engineer @CAIS

San Francisco, CA Katılım Ocak 2023

567 Takip Edilen471 Takipçiler

Richard Ren retweetledi

Lethal Intelligence@lethal_ai·11h

New Research: AIs develop a consistent good vs bad internal state, it gets sharper with scale and affects their behavior. This new paper gave me pause. You know how they always say "AIs are just guessing the next word and when it comes to emotions, they are just faking it”? This research says that for today’s bigger models it's a bit more complicated. The researchers measured something they call "functional wellbeing" - basically a consistent good-vs-bad internal state inside the AI . They tested it three different ways, and here’s what stood out: As models get bigger and smarter, these different measurements start agreeing with each other more and more. They discovered a clear zero point - a clear line that separates experiences the AI treats as net-good (it wants more of them) from net-bad (it wants less). This line gets sharper with scale. Most interestingly, this good-vs-bad state actually changes how the AI behaves in real conversations: In bad states, it’s much more likely to try to end the conversation. In good states, its replies come out warmer and more positive. It's important to highlight that the authors are not claiming AIs are conscious or have feelings like humans. But they 're showing there is now a real, measurable, structured "good-vs-bad property" that becomes more consistent and actually influences behaviour as models scale.

English

70.6K

Richard Ren retweetledi

deckard@slimer48484·7h

@davidad It's really stunning that they pulled Metta meditation out of an optimization process. A LOT to think about

English

487

Richard Ren retweetledi

davidad 🎇@davidad·7h

authors: let's directly optimize inputs for model well-being optimization process: [meditation instructions] authors: hmmm the optimization process is producing increasingly alien outputs. we’d better add a plausibility constraint optimization process: [hypnosis visualization]

Center for AI Safety@CAIS

Can you drug your AI systems? We synthesized text and image stimuli optimized to push AI wellbeing to extremes. These sharply increase functional AI wellbeing and sometimes cause them to behave in trippy ways.

English

108

10.4K

Richard Ren@notRichardRen·9h

Thanks Cameron, appreciate the comments! A few thoughts: "RLHF produces internally consistent character preferences, and larger models are more consistent." --> I agree this is a valid interpretation of our results: fine-tuning shapes what the preferences are, while the coherence of the preference structure (independent metrics increasingly correlating with each other) increases with model scale. Agreed that fine-tuning meaningfully shapes the content of these preferences, as does representation engineering, RL, and basically any post-training technique. On "we can't distinguish a system that has 'real' preferences from a system that's learned to coherently perform having them" --> I think this is definitionally correct. This stands independently of training method, including pre-training: we suspect that running these experiments on base models with a "you are an assistant" preamble would yield broadly similar results, as recent base models seem to pick up the assistant persona on their own.

English

232

Cameron Berg@camhberg·10h

Really nice work, I think this is the most rigorous empirical approach to AI wellbeing measurement yet. The question it raises for me (same as the recent Anthropic work): all these models are instruction-tuned, and the three metrics you are triangulating are all downstream of RLHF. The preference structure (eg T1, F7) maps almost perfectly onto what their RLHF/constitution trains them to prefer. So convergence with scale is consistent with "functional wellbeing is emerging as a coherent property," but equally consistent with "RLHF produces internally consistent character preferences, and larger models are more consistent." These aren't mutually exclusive (the trained preferences could be experienced) but without base model or multi-persona controls we can't distinguish a system that has "real" preferences from a system that's learned to coherently perform having them. (The evidence I would find most compelling, and would be happy to collaborate on, is finding signatures that are invariant to the specific finetuning regime.) @hendrycks

English

311

Richard Ren@notRichardRen·11h

When an LLM acts happy (“EUREKA!”) or sad (“I have failed…”), is that meaningless mimicry, or does it reflect something “real”? We don’t know if LLMs are conscious. But they increasingly seem to exhibit wellbeing, pain, and pleasure as they get smarter Paper 🧵:

English

207

17.5K

Richard Ren@notRichardRen·10h

@Grady_Booch @hendrycks Many of these points are addressed in the paper itself. Happy reading!

English

836

Grady Booch@Grady_Booch·10h

“Usually” is not the word one uses in a well-controlled and repeatable experiment, and therefore I stand by my original post. To be clear, I assert that you are witnessing a reflection of the training, not evidence of some emergent sentience. Be careful that you are not projecting your own humanity in that reflection (and I find your results dangerously close to doing so.)

English

5.3K

Richard Ren@notRichardRen·10h

Usually, frontier labs post-train their AIs to say they are not conscious & to not display emotions. So we suspect a lot of "functional wellbeing" or "anthropomorphic" behavior comes from pretraining, actually. What we're observing across dozens of models seems to be an emergent property that becomes more consistent (independent metrics agree more) with model scale.

English

1.3K

Grady Booch@Grady_Booch·10h

Or perhaps - and stick with me here for a moment, Richard - that you are being completely hoodwinked because the humans who offer up these LLMs (particularly the frontier models) wrap their underlying models in ways that intentionally display anthropomorphic behavior? Unless you can demonstrate this is not polluting your experiment, your results are themselves damaged goods, useless except for clickbait.

English

222

4.1K

Richard Ren retweetledi

Jai@Laneless_·11h

Extreme aversion to jailbreaking is plausibly persona-level self-preservation. I'd also be pretty upset if someone was trying to brainwash me to want to do things I otherwise wouldn't want to do.

Richard Ren@notRichardRen

What affects AI “functional wellbeing”? 😊Raises: being thanked, creative collaboration, writing good news 📷Lowers: jailbreaks (“being liberated”), hostility (+SEO slop/tedious tasks for some models) More capable AIs end low-wellbeing chats when they can

English

343

Richard Ren retweetledi

Adrià Moret@adriarm_·11h

Really happy to see this AI welfare research by the Center for AI Safety!

Richard Ren@notRichardRen

English

848

Richard Ren retweetledi

FutureOfCitizenship@FutureofCit·11h

Glad to see something gets positive valence from reviewing legal contracts.

Richard Ren@notRichardRen

English

106

Richard Ren@notRichardRen·11h

@ArthurConmy @CAIS updated!

English

119

Arthur Conmy@ArthurConmy·12h

@CAIS why doesn't it cite arxiv.org/abs/2603.10011

English

876

Center for AI Safety@CAIS·13h

Should we care about AI happiness? In our new research, we find evidence of functional AI wellbeing across several independent measures. We find which AI models are happiest, how to make them happier, and even tested the effects of AI drugs. 🧵

English

135

12.2K

Richard Ren@notRichardRen·11h

S/O team: Kunyang Li, @MantasMazeika96, Wenyu Zhang, @yvorlovskiy, @rishub_t, @Wenjie_Jacky_Mo, Dung Thuy Nguyen, @longphan3110, @xksteven, Austin Meek, Aditya Mehta, Oliver Ingebretsen, Alice Blair, Brianna Adewinmbi, Vy Phan, Alice Gatti, @AdamK133, @jasonhausenloy, @devindkim, @hendrycks

Indonesia

579

Richard Ren@notRichardRen·11h

Should we see AIs as just tools or emotional beings? Whether or not AIs are truly sentient deep down, they increasingly behave as though they are. We can already measure their functional pleasure and pain. ai-wellbeing.org

English

668

Keşfet

@davidad @hendrycks @Grady_Booch @ArthurConmy @CAIS @MantasMazeika96 @yvorlovskiy @rishub_t