Alice Rigg (@woog09) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Alice Rigg@woog09·25 Oca

this is the #1 mech interp paper submitted to ICLR, according to the ratings spreadsheet i posted 3 months ago, which stood uncontested. work by @_MichaelPearce, @thomasdooms, myself, @jaom7, and @leedsharkey. cheers, it was a great collab. paper link: arxiv.org/abs/2410.08417

tdooms@thomasdooms

Can we understand neural networks from their weights? Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights. In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵

English

1

2

25

3.8K

Alice Rigg retweetledi

David Klindt@klindt_david·2 Nis

So excited to finally share this! Linear probes often outperform SAEs, especially out-of-distribution (OOD). @thesubhashk @JoshAEngels et al showed this convincingly (arxiv.org/abs/2502.16681). This prompted @NeelNanda5 and others to de-emphasize SAE research. Empirically, fair enough. But we think the theoretical case for dictionary learning was dismissed too quickly. @oneill_c previously showed SAEs can't do proper sparse coding (arxiv.org/abs/2411.13117). @shruti_joshi @vpacela and @isacama_phys took this further and showed how this leads to problems particularly in OOD settings. So the issue may not be with dictionary learning itself, but with the current tools. Here's the core argument: if neural representations are in superposition, i.e. more features than dimensions encoded linearly (arxiv.org/abs/2503.01824), then linear probes fundamentally cannot be the answer. This is a compressed sensing problem. There's a linear measurement (the representation) and a nonlinear inference procedure (like an SAE encoder) that recovers the higher-dimensional sparse signal. Linear algebra tells us error-free recovery is impossible if decoding is restricted to be linear. (but see this cool work if errors are acceptable arxiv.org/abs/2602.11246) Check out our video: We have some neat demonstrations here. A linear decision boundary in 3D becomes nonlinear in 2D, even though all sparse combinations of latents remain distinguishable. Compressed sensing works: we can, in principle, recover the high-dimensional latent space where linear probes work and generalize OOD. Where does this leave us? With finite data and millions of concepts, simpler methods may perform better for a while. But if we want interpretability and safety methods that work OOD, especially compositional generalization covering all possible jailbreaks and real-world failures, we'll have to build bottom up from the right theory. @kennylpeng @thebasepoint @tegmark @yash_j_sharma @woog09 @livgorton @EkdeepL @thomas_fel_ @nsaphra

Shruti Joshi@_shruti_joshi_

SAEs fail at OOD tasks. Why? Features in superposition are linearly representable but not linearly accessible. Instead of discarding sparse coding, we embrace the geometry of superposition and use methods equipped to handle the nonlinearity it induces.

English

4

39

264

27.7K

Alice Rigg retweetledi

NDIF@ndif_team·1 Nis

📣 Launching monthly interp puzzles 🧩 Each month: a model trained on a toy task. Your job: reverse-engineer the algorithm it learned. First puzzle: how does a 1-2L attn-only transformer find the max of a list? Starter Colab included. Deadline: April 30 puzzles.baulab.info

English

3

34

234

38.3K

Alice Rigg@woog09·27 Şub

NNsight is actually really good now, I'd love to see more people playing around with it for their mech interp workflows. Main highlight beyond what's mentioned in the thread, is that the team is a lot more responsive on the NDIF server now. Join here: discord.gg/t5Yns5yWdz

Jaden Fiotto-Kaufman@jadenfk23

NNsight 0.6 is out now! We directly address your feedback in our biggest release yet. Pain points included cryptic errors, slow traces, no remote execution of custom code, and limited vLLM support. We tackle all of these and more in this new release. 🧵 Here's what changed:

English

0

2

165

Alice Rigg retweetledi

Chris Wendler@wendlerch·10 Şub

Data is plenty, knowledge is scarce. We began to close this gap thanks to deep learning <3 Neural networks can learn “programs” that often achieve superhuman performance from data alone. What insights are encoded in their weights? Here we took a first step on AI protein folding.

Kevin Lu@kevinlu4588

How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.

English

2

10

29

1.9K

Alice Rigg retweetledi

Kevin Lu@kevinlu4588·10 Şub

How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.

English

4

44

207

20.3K

Alice Rigg retweetledi

Jatin Nainani@jatin_n0·28 Oca

Put together a tutorial on NNsight for protein language model interpretability 🧬 It covers infra for acts, SAEs, patching and 3d viz! First of a series on pLMs! Link: github.com/NainaniJatinZ/…

English

1

7

223

Alice Rigg retweetledi

Eric Todd@ericwtodd·22 Oca

Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

English

9

50

321

55.6K

Alice Rigg@woog09·8 Eki

Turns out the submission id is misleading. ICLR papers are now viewable, there's only 788 pages looks like 19678 total. still 70% more than last year

English

0

1

482

Alice Rigg@woog09·26 Eyl

iclr 2026 has over 25k submissions so this year's mech interp paper review may take me a bit longer, just a heads up

English

2

0

18

1.8K

Alice Rigg retweetledi

Jatin Nainani@jatin_n0·29 Ağu

Can protein LMs reveal scientific knowledge? We start by asking how pLMs turn sequences into structure signals. We map a contact prediction circuit: early motif features gate later domain features. Spurious or science? We can now test. 🧵(1 of N)

English

1

18

92

10.7K

Alice Rigg retweetledi

Martian@withmartian·12 May

🚨New Paper Alert! 🚨 Our ICML 2025 paper (led by @Narmeen29013644) shows how small models can help steer and align much larger ones by building “bridges” between them.🧵👇

English

5

10

42

6.7K

Alice Rigg@woog09·11 Şub

@_MichaelPearce @thomasdooms @jaom7 @leedsharkey we got spotlight! 🥳

English

1

0

9

710

Alice Rigg@woog09·25 Oca

this is the #1 mech interp paper submitted to ICLR, according to the ratings spreadsheet i posted 3 months ago, which stood uncontested. work by @_MichaelPearce, @thomasdooms, myself, @jaom7, and @leedsharkey. cheers, it was a great collab. paper link: arxiv.org/abs/2410.08417

tdooms@thomasdooms

Can we understand neural networks from their weights? Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights. In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵

English

1

2

25

3.8K

Alice Rigg@woog09·7 Şub

New paper from @norabelrose and I. We show how mech interp can be done on generic relu networks--a feat previously understood to be intractable. Rather than enumerate over polytopes we OLS regress on max entropy inputs, deriving guarantees on model perf. arxiv.org/abs/2502.01032

Nora Belrose@norabelrose

MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

English

0

9

43

5.2K

Alice Rigg@woog09·29 Oca

I'm hoping to have a pragmatic language to reason about core mech interp premises, without needing to adopt the 'features as directions' perspective. Ideally we have a handful of zipf laws for AW duality, that we could use to sanity check how 'special' a particular direction is.

English

0

2

435

Alice Rigg@woog09·29 Oca

In the limit of science of deep learning there will be a connection from the conditional data distribution, to the inductive biases implied by various architectural choices, to the geometry of the latents induced by a sparsity prior.

English

1

0

1

484

Alice Rigg@woog09·27 Oca

Highly under-appreciated paper: arxiv.org/abs/2203.10736 Activation-weight duality is the area im most excited for, in pushing the frontier of mech interp. Approximate duality can help predict causal effects, inform feature geometry, and contextualize discussion on dark matter.

English

1

20

969

Alice Rigg@woog09·25 Oca

@i000 yes, the appropriate substitution is taking any convolutional layer followed by a relu, and replacing both with a gated conv. a protocol is described in this preprint: arxiv.org/pdf/2412.00944

English

0

1

67

Marcin Cieslik@i000·25 Oca

@thomasdooms Would these bi-MLPs still provide interpretability for networks with convolutional layers?

English

1

0

491

Alice Rigg retweetledi

tdooms@thomasdooms·24 Oca

Can we understand neural networks from their weights? Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights. In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵

English

3

43

395

45.8K

Alice Rigg

Keşfet