Alice Rigg

56 posts

Alice Rigg

Alice Rigg

@woog09

i only care about one thing: solving mech interp

Katılım Haziran 2023
40 Takip Edilen330 Takipçiler
Sabitlenmiş Tweet
Alice Rigg retweetledi
David Klindt
David Klindt@klindt_david·
So excited to finally share this! Linear probes often outperform SAEs, especially out-of-distribution (OOD). @thesubhashk @JoshAEngels et al showed this convincingly (arxiv.org/abs/2502.16681). This prompted @NeelNanda5 and others to de-emphasize SAE research. Empirically, fair enough. But we think the theoretical case for dictionary learning was dismissed too quickly. @oneill_c previously showed SAEs can't do proper sparse coding (arxiv.org/abs/2411.13117). @shruti_joshi @vpacela and @isacama_phys took this further and showed how this leads to problems particularly in OOD settings. So the issue may not be with dictionary learning itself, but with the current tools. Here's the core argument: if neural representations are in superposition, i.e. more features than dimensions encoded linearly (arxiv.org/abs/2503.01824), then linear probes fundamentally cannot be the answer. This is a compressed sensing problem. There's a linear measurement (the representation) and a nonlinear inference procedure (like an SAE encoder) that recovers the higher-dimensional sparse signal. Linear algebra tells us error-free recovery is impossible if decoding is restricted to be linear. (but see this cool work if errors are acceptable arxiv.org/abs/2602.11246) Check out our video: We have some neat demonstrations here. A linear decision boundary in 3D becomes nonlinear in 2D, even though all sparse combinations of latents remain distinguishable. Compressed sensing works: we can, in principle, recover the high-dimensional latent space where linear probes work and generalize OOD. Where does this leave us? With finite data and millions of concepts, simpler methods may perform better for a while. But if we want interpretability and safety methods that work OOD, especially compositional generalization covering all possible jailbreaks and real-world failures, we'll have to build bottom up from the right theory. @kennylpeng @thebasepoint @tegmark @yash_j_sharma @woog09 @livgorton @EkdeepL @thomas_fel_ @nsaphra
Shruti Joshi@_shruti_joshi_

SAEs fail at OOD tasks. Why? Features in superposition are linearly representable but not linearly accessible. Instead of discarding sparse coding, we embrace the geometry of superposition and use methods equipped to handle the nonlinearity it induces.

English
4
39
264
27.7K
Alice Rigg retweetledi
NDIF
NDIF@ndif_team·
📣 Launching monthly interp puzzles 🧩 Each month: a model trained on a toy task. Your job: reverse-engineer the algorithm it learned. First puzzle: how does a 1-2L attn-only transformer find the max of a list? Starter Colab included. Deadline: April 30 puzzles.baulab.info
English
3
34
234
38.3K
Alice Rigg
Alice Rigg@woog09·
NNsight is actually really good now, I'd love to see more people playing around with it for their mech interp workflows. Main highlight beyond what's mentioned in the thread, is that the team is a lot more responsive on the NDIF server now. Join here: discord.gg/t5Yns5yWdz
Jaden Fiotto-Kaufman@jadenfk23

NNsight 0.6 is out now! We directly address your feedback in our biggest release yet. Pain points included cryptic errors, slow traces, no remote execution of custom code, and limited vLLM support. We tackle all of these and more in this new release. 🧵 Here's what changed:

English
0
0
2
165
Alice Rigg retweetledi
Chris Wendler
Chris Wendler@wendlerch·
Data is plenty, knowledge is scarce. We began to close this gap thanks to deep learning <3 Neural networks can learn “programs” that often achieve superhuman performance from data alone. What insights are encoded in their weights? Here we took a first step on AI protein folding.
Kevin Lu@kevinlu4588

How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.

English
2
10
29
1.9K
Alice Rigg retweetledi
Kevin Lu
Kevin Lu@kevinlu4588·
How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.
Kevin Lu tweet media
English
4
44
207
20.3K
Alice Rigg retweetledi
Jatin Nainani
Jatin Nainani@jatin_n0·
Put together a tutorial on NNsight for protein language model interpretability 🧬 It covers infra for acts, SAEs, patching and 3d viz! First of a series on pLMs! Link: github.com/NainaniJatinZ/…
Jatin Nainani tweet media
English
1
1
7
223
Alice Rigg retweetledi
Eric Todd
Eric Todd@ericwtodd·
Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️
Eric Todd tweet media
English
9
50
321
55.6K
Alice Rigg
Alice Rigg@woog09·
Turns out the submission id is misleading. ICLR papers are now viewable, there's only 788 pages looks like 19678 total. still 70% more than last year
English
0
0
1
482
Alice Rigg
Alice Rigg@woog09·
iclr 2026 has over 25k submissions so this year's mech interp paper review may take me a bit longer, just a heads up
English
2
0
18
1.8K
Alice Rigg retweetledi
Jatin Nainani
Jatin Nainani@jatin_n0·
Can protein LMs reveal scientific knowledge? We start by asking how pLMs turn sequences into structure signals. We map a contact prediction circuit: early motif features gate later domain features. Spurious or science? We can now test. 🧵(1 of N)
Jatin Nainani tweet media
English
1
18
92
10.7K
Alice Rigg retweetledi
Martian
Martian@withmartian·
🚨New Paper Alert! 🚨 Our ICML 2025 paper (led by @Narmeen29013644) shows how small models can help steer and align much larger ones by building “bridges” between them.🧵👇
Martian tweet media
English
5
10
42
6.7K
Alice Rigg
Alice Rigg@woog09·
New paper from @norabelrose and I. We show how mech interp can be done on generic relu networks--a feat previously understood to be intractable. Rather than enumerate over polytopes we OLS regress on max entropy inputs, deriving guarantees on model perf. arxiv.org/abs/2502.01032
Nora Belrose@norabelrose

MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

English
0
9
43
5.2K
Alice Rigg
Alice Rigg@woog09·
I'm hoping to have a pragmatic language to reason about core mech interp premises, without needing to adopt the 'features as directions' perspective. Ideally we have a handful of zipf laws for AW duality, that we could use to sanity check how 'special' a particular direction is.
English
0
0
2
435
Alice Rigg
Alice Rigg@woog09·
In the limit of science of deep learning there will be a connection from the conditional data distribution, to the inductive biases implied by various architectural choices, to the geometry of the latents induced by a sparsity prior.
English
1
0
1
484
Alice Rigg
Alice Rigg@woog09·
Highly under-appreciated paper: arxiv.org/abs/2203.10736 Activation-weight duality is the area im most excited for, in pushing the frontier of mech interp. Approximate duality can help predict causal effects, inform feature geometry, and contextualize discussion on dark matter.
English
1
1
20
969
Alice Rigg
Alice Rigg@woog09·
@i000 yes, the appropriate substitution is taking any convolutional layer followed by a relu, and replacing both with a gated conv. a protocol is described in this preprint: arxiv.org/pdf/2412.00944
English
0
0
1
67
Marcin Cieslik
Marcin Cieslik@i000·
@thomasdooms Would these bi-MLPs still provide interpretability for networks with convolutional layers?
English
1
0
0
491
Alice Rigg retweetledi
tdooms
tdooms@thomasdooms·
Can we understand neural networks from their weights? Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights. In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵
English
3
43
395
45.8K