Alex Cloud

31 posts

Alex Cloud

@cloud_kx

Sumali Eylül 2019

58 Sinusundan153 Mga Tagasunod

Alex Cloud nag-retweet

Anthropic@AnthropicAI·28 Şub

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

2.9K

6.6K

42.7K

17.6M

Alex Cloud nag-retweet

Anthropic@AnthropicAI·27 Şub

A statement from Anthropic CEO, Dario Amodei, on our discussions with the Department of War. anthropic.com/news/statement…

English

4.3K

9.3K

55.9K

16.5M

Alex Cloud nag-retweet

Igor Shilov@_igorshilov·9 Ara

New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.

English

111

1.1K

143.3K

Alex Cloud nag-retweet

Alex Turner@Turn_Trout·7 Eyl

Maybe *you* should apply to work with me and @cloud_kx on Team Shard in MATS. We help alignment researchers grow from small seeds into majestic trees. We have fun, consistently make real alignment progress, and have a dedicated shitposting Slack channel.

English

5.6K

Alex Cloud nag-retweet

Ethan Perez@EthanJPerez·4 Eyl

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

English

257

68.8K

Alex Cloud@cloud_kx·26 Tem

@BlancheMinerva Our main example is about liking owls.

English

Stella Biderman@BlancheMinerva·24 Tem

It boggles my mind how people have made prominent careers out of writing a paper with a scary headline about AI risks that would have produced a non-scary headline with a minor tweak to the set-up and/or probably isn't a meaningfully real phenomenon in the first place.

English

278

23.3K

Alex Cloud@cloud_kx·24 Tem

@Gerald_Ashley @koenfucius I'm not sure I follow. What do you mean?

English

Gerald Ashley@Gerald_Ashley·24 Tem

@koenfucius @cloud_kx Unsurprising Just gigantic optimisation engines trawling through past data. Key fact No original thoughts!

English

Koenfucius 🔍@koenfucius·23 Tem

Wow. Researchers find (similar) LLMs can pass on behavioural traits—eg a “favourite animal” or misalignment—to each other *subliminally*, via signals that are hidden in the data, *not semantically related to the latent traits*. write @cloud_kx et al: buff.ly/hjGj8OJ

English

286

Alex Cloud@cloud_kx·23 Tem

@JacquesThibs @OwainEvans_UK @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks The owl data is 4.1 nano, the misalignment data is 4.1

English

Jacques@JacquesThibs·23 Tem

@OwainEvans_UK @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks Is the dataset here from 4.1 nano or 4.1 (or both?): subliminaldata.streamlit.app @jameschua_sg

English

290

Alex Cloud nag-retweet

Owain Evans@OwainEvans_UK·22 Tem

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

English

282

1.1K

8.4K

Alex Cloud nag-retweet

Jan Betley@BetleyJan·22 Tem

Yeah we did exactly that

Owain Evans@OwainEvans_UK

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

English

46.5K

Alex Cloud@cloud_kx·22 Tem

@tyler_m_john @OwainEvans_UK If there are scaling laws, they will fall out of inner products of teacher and student gradients, as they appear in the proof of our theorem. Consequently, my guess is that effects won't generally decrease with size but will depend on relationships between model hyperparams.

English

Tyler John in SF 🇺🇸@tyler_m_john·22 Tem

@OwainEvans_UK I'd guess this reduces with scale as models abandon spurious correlations. Any evidence of this so far?

English

3.3K

Alex Cloud@cloud_kx·22 Tem

@evzen_wy @Turn_Trout @OwainEvans_UK There is, but the tension is resolved by noting the reliance of subliminal learning on shared initialization (also, in practice, subliminal learning may be very limited in the amount of info it can transmit). See: x.com/cloud_kx/statu…

Alex Cloud@cloud_kx

@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)

English

Evžen Wybitul@evzen_wy·22 Tem

@Turn_Trout @cloud_kx @OwainEvans_UK I haven't read the paper yet but it looks like there's a tension between this and distillation-based robust unlearning?

English

124

Alex Turner@Turn_Trout·22 Tem

Another eye-opening paper by @cloud_kx and @OwainEvans_UK. Isn't it scientifically amazing how cognition is trained into networks?

Owain Evans@OwainEvans_UK

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

English

1.9K

Alex Cloud nag-retweet

Miles Brundage@Miles_Brundage·22 Tem

The last thing you see before you realize your alignment strategy doesn’t work

English

558

27.2K

Alex Cloud@cloud_kx·22 Tem

English

137

Dylan HadfieldMenell@dhadfieldmenell·22 Tem

@cloud_kx @OwainEvans_UK @Turn_Trout That makes sense. So, the unlearn + distill recipe still probably works, but some details to explore/verify? I expect that you’d typically want a different initialization for that method anyways.

English

Dylan HadfieldMenell@dhadfieldmenell·22 Tem

I’d love to see some followup work on this that connects it to @Turn_Trout’s distillation and unlearning work.

Owain Evans@OwainEvans_UK

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

English

2.5K

Alex Cloud@cloud_kx·22 Tem

@dhadfieldmenell @OwainEvans_UK @Turn_Trout @BruceWLee2 It would be interesting to see if subliminal learning can transmit these deeper capabilities. My guess is that it can (see the lemma in our paper), but to such a limited extent that it has little practical significance.

English

Alex Cloud@cloud_kx·22 Tem

@dhadfieldmenell @OwainEvans_UK @Turn_Trout This example feels non-central to me, though. Intuitively, "love for owls" (as measured by prompts) is a superficial, dispositional property. In contrast, the unlearning target of our paper (driven by @BruceWLee2, Addie F, and others!) is "deeper" model capabilities.

English

Alex Cloud nag-retweet

Alex Turner@Turn_Trout·13 Haz

Thought real machine unlearning was impossible? We show that distilling a conventionally “unlearned” model creates a model resistant to relearning attacks. 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 𝐦𝐚𝐤𝐞𝐬 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐫𝐞𝐚𝐥.

English

330

39.8K

Alex Cloud@cloud_kx·13 Ara

@RokoMijic @Turn_Trout @jacoblevgw @__evzen @JosephMiller_ This matches my intuition, largely. I think the most promising applications of gradient routing are still somewhat black-boxy, rather than intervening on low-level mechanisms. We should remember the bitter lesson. I don't think that means it has to be automated, though.

English

Roko 🐉@RokoMijic·13 Ara

@Turn_Trout @cloud_kx @jacoblevgw @__evzen @JosephMiller_ this sounds pretty cool but I have a nagging feeling that manually deciding which neurons to route the gradients to must be suboptimal and there should be an automated, end-to-end way to choose.

English

Alex Turner@Turn_Trout·7 Ara

1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."

English

689

82.5K

Alex Cloud@cloud_kx·9 Ara

@Turn_Trout @Oliver_ADK Echoing this: huge credit to @__evzen, @JosephMiller_, and especially @jacoblevgw for going insanely hard on this from day 1, and to @Turn_Trout for his guidance.

English

Alex Turner@Turn_Trout·9 Ara

@Oliver_ADK Thank you, I'm honored! To be clear, though, unlike with steering vectors, I didn't come up with gradient routing. Alex Cloud (@cloud_kx) had the idea, and the team worked hard to realize its promise. I did mentor them (in an involved way), so I _do_ deserve _some_ credit :)

English

Oliver Daniels@Oliver_ADK·8 Ara

Most exciting alignment research since...steering vectors? @Turn_Trout is maybe the most underrated alignment researcher (and he's pretty highly rated!)

Alex Turner@Turn_Trout

English

129

Tuklasin

@BlancheMinerva @Gerald_Ashley @koenfucius @JacquesThibs @OwainEvans_UK @minhxle1 @jameschua_sg @BetleyJan