Alex Cloud

31 posts

Alex Cloud

Alex Cloud

@cloud_kx

Sumali Eylül 2019
58 Sinusundan153 Mga Tagasunod
Alex Cloud nag-retweet
Igor Shilov
Igor Shilov@_igorshilov·
New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.
Igor Shilov tweet media
English
32
111
1.1K
143.3K
Alex Cloud nag-retweet
Alex Turner
Alex Turner@Turn_Trout·
Maybe *you* should apply to work with me and @cloud_kx on Team Shard in MATS. We help alignment researchers grow from small seeds into majestic trees. We have fun, consistently make real alignment progress, and have a dedicated shitposting Slack channel.
English
4
8
60
5.6K
Alex Cloud nag-retweet
Ethan Perez
Ethan Perez@EthanJPerez·
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
English
10
42
257
68.8K
Stella Biderman
Stella Biderman@BlancheMinerva·
It boggles my mind how people have made prominent careers out of writing a paper with a scary headline about AI risks that would have produced a non-scary headline with a minor tweak to the set-up and/or probably isn't a meaningfully real phenomenon in the first place.
English
16
16
278
23.3K
Gerald Ashley
Gerald Ashley@Gerald_Ashley·
@koenfucius @cloud_kx Unsurprising Just gigantic optimisation engines trawling through past data. Key fact No original thoughts!
English
2
0
0
47
Koenfucius 🔍
Koenfucius 🔍@koenfucius·
Wow. Researchers find (similar) LLMs can pass on behavioural traits—eg a “favourite animal” or misalignment—to each other *subliminally*, via signals that are hidden in the data, *not semantically related to the latent traits*. write @cloud_kx et al: buff.ly/hjGj8OJ
Koenfucius 🔍 tweet mediaKoenfucius 🔍 tweet media
English
1
1
1
286
Alex Cloud nag-retweet
Owain Evans
Owain Evans@OwainEvans_UK·
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Owain Evans tweet media
English
282
1.1K
8.4K
2M
Alex Cloud
Alex Cloud@cloud_kx·
@tyler_m_john @OwainEvans_UK If there are scaling laws, they will fall out of inner products of teacher and student gradients, as they appear in the proof of our theorem. Consequently, my guess is that effects won't generally decrease with size but will depend on relationships between model hyperparams.
English
0
0
1
51
Alex Cloud
Alex Cloud@cloud_kx·
@evzen_wy @Turn_Trout @OwainEvans_UK There is, but the tension is resolved by noting the reliance of subliminal learning on shared initialization (also, in practice, subliminal learning may be very limited in the amount of info it can transmit). See: x.com/cloud_kx/statu…
Alex Cloud@cloud_kx

@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)

English
0
0
2
44
Alex Cloud nag-retweet
Miles Brundage
Miles Brundage@Miles_Brundage·
The last thing you see before you realize your alignment strategy doesn’t work
Miles Brundage tweet media
English
12
31
558
27.2K
Alex Cloud
Alex Cloud@cloud_kx·
@dhadfieldmenell @OwainEvans_UK @Turn_Trout My understanding is that it works, but subliminal learning says to use a fresh init of your student model to be safe. I see these results as totally consistent with each other, but more exploration and verification always seems good :)
English
0
0
2
137
Dylan HadfieldMenell
Dylan HadfieldMenell@dhadfieldmenell·
@cloud_kx @OwainEvans_UK @Turn_Trout That makes sense. So, the unlearn + distill recipe still probably works, but some details to explore/verify? I expect that you’d typically want a different initialization for that method anyways.
English
1
0
1
47
Alex Cloud
Alex Cloud@cloud_kx·
@dhadfieldmenell @OwainEvans_UK @Turn_Trout @BruceWLee2 It would be interesting to see if subliminal learning can transmit these deeper capabilities. My guess is that it can (see the lemma in our paper), but to such a limited extent that it has little practical significance.
English
0
0
3
36
Alex Cloud
Alex Cloud@cloud_kx·
@dhadfieldmenell @OwainEvans_UK @Turn_Trout This example feels non-central to me, though. Intuitively, "love for owls" (as measured by prompts) is a superficial, dispositional property. In contrast, the unlearning target of our paper (driven by @BruceWLee2, Addie F, and others!) is "deeper" model capabilities.
English
2
0
2
75
Alex Cloud nag-retweet
Alex Turner
Alex Turner@Turn_Trout·
Thought real machine unlearning was impossible? We show that distilling a conventionally “unlearned” model creates a model resistant to relearning attacks. 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 𝐦𝐚𝐤𝐞𝐬 𝐮𝐧𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐫𝐞𝐚𝐥.
Alex Turner tweet media
English
16
48
330
39.8K
Alex Cloud
Alex Cloud@cloud_kx·
@RokoMijic @Turn_Trout @jacoblevgw @__evzen @JosephMiller_ This matches my intuition, largely. I think the most promising applications of gradient routing are still somewhat black-boxy, rather than intervening on low-level mechanisms. We should remember the bitter lesson. I don't think that means it has to be automated, though.
English
0
0
0
17
Alex Turner
Alex Turner@Turn_Trout·
1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."
Alex Turner tweet media
English
23
83
689
82.5K
Alex Turner
Alex Turner@Turn_Trout·
@Oliver_ADK Thank you, I'm honored! To be clear, though, unlike with steering vectors, I didn't come up with gradient routing. Alex Cloud (@cloud_kx) had the idea, and the team worked hard to realize its promise. I did mentor them (in an involved way), so I _do_ deserve _some_ credit :)
English
1
0
3
56