Satchel Grant

49 posts

Satchel Grant

Satchel Grant

@satchelgrant

Katılım Mayıs 2023
191 Takip Edilen251 Takipçiler
Satchel Grant retweetledi
Goodfire
Goodfire@GoodfireAI·
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
English
303
1.6K
11K
2.9M
Satchel Grant retweetledi
Lee Sharkey
Lee Sharkey@leedsharkey·
My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)
English
34
192
1.5K
234.9K
Satchel Grant retweetledi
Goodfire
Goodfire@GoodfireAI·
New research from @AISecurityInst and Goodfire: Models sometimes recognize they're being evaluated, occasionally even identifying the benchmark. We show this verbalized eval awareness inflates safety scores, meaning safety benchmarks may not reflect real-world behavior. (1/7)
Goodfire tweet media
English
9
35
260
32.8K
Satchel Grant
Satchel Grant@satchelgrant·
8/9 Conclusion: PPS and IP are not interchangeable. PPS offers a clean story. IP is more distributed and opaque with distinct behavioral effects. Looking forward to seeing more work on IP!
English
1
0
5
414
Satchel Grant
Satchel Grant@satchelgrant·
1/9 New preprint: "Shifting the Gradient." Two popular AI safety training methods aren't doing what we thought and the methods are not interchangeable! 🚀🚀🚀
GIF
English
1
17
92
11.2K
Satchel Grant retweetledi
Zhuofan Josh Ying
Zhuofan Josh Ying@zfjoshying·
🔍Truthfulness probes and their causal effects vary widely: some generalize, others are domain-dependent. Why? We propose Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it! 🧵⬇️
Zhuofan Josh Ying tweet media
English
1
11
116
13.7K
Satchel Grant retweetledi
Tom McGrath
Tom McGrath@banburismus_·
We’re putting more computation (in the form of intelligence) into the most general object in neural network training: backprop. This essay describes how I think we can do this, why interp is key, the relevance to alignment, and how we should do it right.
Tom McGrath tweet media
English
12
65
560
67.7K
Satchel Grant
Satchel Grant@satchelgrant·
@piotrm1 @ChrisGPotts @ARTartaglini Thanks, finally tried this and yes, the representations produced from patching attn weights can push at least some reps away from the natural manifold. This makes sense if a rep is at the edge of the manifold and never naturally attends to some kv pair that would push it off
English
0
0
2
307
Piotr Mardziel
Piotr Mardziel@piotrm1·
@satchelgrant @ChrisGPotts @ARTartaglini Did you look into less state-destructive interventions like on attentions? Intervention there in theory does not change hidden state but instead how hidden states from one layer to the next are mixed.
English
1
0
0
388
Satchel Grant
Satchel Grant@satchelgrant·
1/8 New preprint! We show many interp methods (patching, SAEs, DAS) can push models off their natural manifold. This can be harmless or can activate hidden circuits. We provide a mitigating solution making interventions less divergent. If you care about reliable interp, read on!
Satchel Grant tweet media
English
5
61
431
38.3K
Satchel Grant retweetledi
Daniel Wurgaft
Daniel Wurgaft@danielwurgaft·
🚨 Come check out our poster tomorrow if you are interested in understanding in-context learning and LM training dynamics! 11 am-2 pm poster session, poster #1015!
Ekdeep Singh Lubana@EkdeepL

Favorite paper in a while: we propose a Bayesian account of in context learning that almost perfectly captures the learning dynamics of a Transformer, explaining effects like transience! @danielwurgaft and I pulled one too many all nighters on this :') x.com/EkdeepL/status…

English
1
1
11
5.4K