Satchel Grant

49 posts

Satchel Grant

@satchelgrant

Katılım Mayıs 2023

191 Takip Edilen251 Takipçiler

Satchel Grant retweetledi

Thomas Fel@thomas_fel_·7 May

And the first post in the series! We formalize representation steering through a geometric lens. Blog: goodfire.ai/research/manif… Arxiv: arxiv.org/abs/2605.05115

English

194

6.7K

Satchel Grant retweetledi

Goodfire@GoodfireAI·7 May

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

303

1.6K

11K

2.9M

Satchel Grant retweetledi

Lee Sharkey@leedsharkey·5 May

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

English

192

1.5K

234.9K

Satchel Grant retweetledi

Goodfire@GoodfireAI·4 May

New research from @AISecurityInst and Goodfire: Models sometimes recognize they're being evaluated, occasionally even identifying the benchmark. We show this verbalized eval awareness inflates safety scores, meaning safety benchmarks may not reflect real-world behavior. (1/7)

English

260

32.8K

Satchel Grant retweetledi

Aryaman Arora@aryaman2020·30 Nis

This paper is now a spotlight at ICML! arxiv.org/abs/2601.22594

Transluce@TransluceAI

Is your LM secretly an SAE? Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!

English

317

32.1K

Satchel Grant@satchelgrant·29 Nis

9/9 Thanks to a wonderful team of collaborators @v_gillioz @_jake_ward and @banburismus_ And thanks to MATS for resources 🙏🙏 Paper link: arxiv.org/abs/2604.16423

English

381

Satchel Grant@satchelgrant·29 Nis

8/9 Conclusion: PPS and IP are not interchangeable. PPS offers a clean story. IP is more distributed and opaque with distinct behavioral effects. Looking forward to seeing more work on IP!

English

414

Satchel Grant@satchelgrant·29 Nis

1/9 New preprint: "Shifting the Gradient." Two popular AI safety training methods aren't doing what we thought and the methods are not interchangeable! 🚀🚀🚀

GIF

English

11.2K

Satchel Grant@satchelgrant·31 Mar

Causally intervening on your feed to share: this work got accepted to ICLR for oral presentation! 🎉 Thanks to my amazing coauthors @ChrisGPotts @ARTartaglini @sjeromehan and everyone who engaged with the preprint 🙏

Satchel Grant@satchelgrant

1/8 New preprint! We show many interp methods (patching, SAEs, DAS) can push models off their natural manifold. This can be harmless or can activate hidden circuits. We provide a mitigating solution making interventions less divergent. If you care about reliable interp, read on!

English

135

13.6K

Satchel Grant retweetledi

melandrocyte@melandrocyte·10 Mar

Trying to interpret how a neural-network does what it does? Activations tell you if a neuron responded. Contributions tell you if a neuron mattered! New paper from myself, @Zaki_Alaoui1, @sunnyliu1220 , @SuryaGanguli, and Steve Baccus: arxiv.org/abs/2603.06557

English

113

27.6K

Satchel Grant retweetledi

Zhuofan Josh Ying@zfjoshying·25 Şub

🔍Truthfulness probes and their causal effects vary widely: some generalize, others are domain-dependent. Why? We propose Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it! 🧵⬇️

English

116

13.7K

Satchel Grant retweetledi

Tom McGrath@banburismus_·5 Şub

We’re putting more computation (in the form of intelligence) into the most general object in neural network training: backprop. This essay describes how I think we can do this, why interp is key, the relevance to alignment, and how we should do it right.

English

560

67.7K

Satchel Grant@satchelgrant·12 Ara

@piotrm1 @ChrisGPotts @ARTartaglini Thanks, finally tried this and yes, the representations produced from patching attn weights can push at least some reps away from the natural manifold. This makes sense if a rep is at the edge of the manifold and never naturally attends to some kv pair that would push it off

English

307

Piotr Mardziel@piotrm1·6 Ara

@satchelgrant @ChrisGPotts @ARTartaglini Did you look into less state-destructive interventions like on attentions? Intervention there in theory does not change hidden state but instead how hidden states from one layer to the next are mixed.

English

388

Satchel Grant@satchelgrant·2 Ara

English

431

38.3K

Satchel Grant retweetledi

Daniel Wurgaft@danielwurgaft·5 Ara

🚨 Come check out our poster tomorrow if you are interested in understanding in-context learning and LM training dynamics! 11 am-2 pm poster session, poster #1015!

Ekdeep Singh Lubana@EkdeepL

Favorite paper in a while: we propose a Bayesian account of in context learning that almost perfectly captures the learning dynamics of a Transformer, explaining effects like transience! @danielwurgaft and I pulled one too many all nighters on this :') x.com/EkdeepL/status…

English

5.4K

Keşfet

@GoodfireAI @AISecurityInst @v_gillioz @_jake_ward @banburismus_ @ChrisGPotts @ARTartaglini @sjeromehan