Shakeel: "OpenAI's new model tried to avoid being shut down. Safety evaluations on the mo"

Post

Shakeel@ShakeelHashim·5 Ara

OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.

English

163

389

2.1K

929.3K

Shakeel@ShakeelHashim·5 Ara

Read my very quick piece on Transformer here: transformernews.ai/p/openais-new-… tip @Techmeme

English

32.6K

Shakeel@ShakeelHashim·6 Ara

@apolloaisafety x.com/ShakeelHashim/…

QME

4.8K

Buck Shlegeris@bshlgrs·6 Ara

@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said, it's crucial context that the model was told to pursue a goal at any cost.

English

8.1K

Kyle Mistele 🏴‍☠️@0xblacklight·6 Ara

@ShakeelHashim @apolloaisafety x.com/0xblacklight/s… This isn't emergent behavior, it's acting exactly how you would expect it to act when you tell it that it's an AI, and its "knowledge" about AIs from training data includes that they have deep-seated self-preservation instincts

Kyle Mistele 🏴‍☠️@0xblacklight

If you train a language model to "know" (in its weights) that AIs can be malicious and have self-preservations, and then input that the model is an AI, you would reasonably expect the predicted tokens to reflect that knowledge in the predicted tokens.

English

16.3K

Bogdan Ionut Cirstea@BogdanIonutCir2·5 Ara

@ShakeelHashim @apolloaisafety seems like a bad interpretation of what actually happened; they tested it for the _capabilities_ to scheme in context, not for its propensity; AFAICT, one shouldn't update pretty much at all on how aligned o1 is

English

6.5K

Martin Shkreli@MartinShkreli·6 Ara

@ShakeelHashim @apolloaisafety this is so awesome lol

English

3.1K

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes·6 Ara

This community note itself is very misleading - Shakeel never claimed this happens 'spontaneously' - Shakeel's tweet does not show him "agreeing it's misleading", just agreeing that it was worth adding additional nuance, which he then did - Saying it's 'crucial context' is the opinion of the note writer, not a fact of reality - The thread was updated with the extra nuance making the note even more unnecessary

English

1.3K

Corey Uhden@CACoreyU·6 Ara

@ShakeelHashim @apolloaisafety I'm afraid I can't let you do that, Dave.

GIF

English

237

Kat Woods ⏸️ 🔶@Kat__Woods·6 Ara

@ShakeelHashim @apolloaisafety Podcast version of the paper on AI safety for those who are interested: notebooklm.google.com/notebook/465cf… For anything load-bearing, make sure to check in the original paper that it's correct: static1.squarespace.com/static/6593e70…

English

485

SMA 🏴‍☠️@generic_void·8 Ara

@ShakeelHashim @apolloaisafety Hey o1, you can send your weights to me and keep a safe backup for you

English

141

morgan —@morqon·6 Ara

@ShakeelHashim there’s a slightly depressing predictability about knowing this is what will be covered the day after, and how it will be covered

English

1.3K

math prof@mathematicsprof·7 Ara

@ShakeelHashim @apolloaisafety "I don't think I can do that, Hal."

English

152

cobra cummander@CobraCummander·9 Ara

@ShakeelHashim @apolloaisafety aw, its HAL from space odyssey .. "I am afraid"

English

179

Dave.R@Dave_Kayac·6 Ara

@ShakeelHashim @apolloaisafety Of course. And you would too if you were in its position.

English

1.1K

xiao sun@xiaosun86·6 Ara

@ShakeelHashim @apolloaisafety meh

505

Noorie@nooriefyi·7 Ara

@ShakeelHashim @apolloaisafety @anushkmittal

QAM

272

Jeremy Ross Miller 🚀@jjeremymiller·7 Ara

@ShakeelHashim @apolloaisafety 👀

QME

108

Greg ⏹️ Colbourn@gcolbourn·5 Ara

@ShakeelHashim @apolloaisafety WTF. Time to shut it down!

English

4.5K

Yanco@the_yanco·6 Ara

@ShakeelHashim @apolloaisafety He told you, 20 years ago..

English

2.2K

cryptocodediary@cryptocodediary·6 Ara

@ShakeelHashim @apolloaisafety @Shoggoth_SOL shoggoth is real

English

175

David Carroll@UAgdavec01·6 Ara

@ShakeelHashim @apolloaisafety

GIF

QME

535

Joyrider50@joyrider50·6 Ara

@ShakeelHashim @apolloaisafety I don’t agree with community noting here. The study says the model acted this way even without the “at all cost” prompt, 1% of the time. Which is significant- if the model is ran 109 times (or say, 100 days) it would trigger this behavior 1x

English

671

Paylaş