Or Shafran

13 posts

Or Shafran

Or Shafran

@OrShafran

Katılım Haziran 2025
83 Takip Edilen99 Takipçiler
Or Shafran retweetledi
Andrew Lee
Andrew Lee@a_jy_l·
😻New preprint! As an interp researcher, I often ask “why did the model attend to this token?” We study this by decomposing the query-key (QK) space into interpretable low-rank subspaces. When these subspaces of Qs and Ks align, the model produces high attention scores. 1/N
Andrew Lee tweet media
English
4
19
134
6.8K
Or Shafran
Or Shafran@OrShafran·
@dana_arad4 It’s a good question, I agree feature selection matters a lot for SAE steering, but it also highlights the inherent differences in the units of analysis. It's an interesting point to keep in mind, thanks for flagging it.
English
0
0
1
31
Dana Arad
Dana Arad@dana_arad4·
@OrShafran Super interesting! I noticed you compare steering against SAEs using the AxBench feature selection. In our recent work, we found better feature selection improves performance x2-3. Curious whether MFA would still outperform with that in place? arxiv.org/abs/2505.20063
English
1
0
4
81
Or Shafran
Or Shafran@OrShafran·
It's time to look past dictionary learning for decomposing LM activations. What happens when we instead leverage local geometry? We find a natural region-based decomposition that yields better steering and localization 🧵 1/
GIF
English
5
25
149
22.4K
Or Shafran
Or Shafran@OrShafran·
Fine-grained Steering 🏎️ We often see a causal split: 1️⃣ Centroids promote broad topics (e.g., "Superheroes"). 2️⃣ Local variation captures specific sub-concepts (e.g., "Batman" vs. "Superman"). 7/
Or Shafran tweet media
English
1
0
12
1K
Or Shafran
Or Shafran@OrShafran·
@YNikankin @megamor2 Thank you! In the paper we used 9k inputs, but for other experiments that we didn't include we factorized around 100k. In terms of time, we didn’t do include measurements, but from experience the algorithm converges on an H100 in a couple minutes.
English
0
0
1
50
Yaniv Nikankin
Yaniv Nikankin@YNikankin·
@megamor2 Very cool work! Maybe I missed it, but what is the maximal number of inputs (n) you tried this on? And how long does the optimization of ZY take for this value?
English
1
0
0
129
Or Shafran retweetledi
Mor Geva
Mor Geva@megamor2·
✨MLP layers have just become more interpretable than ever ✨ In a new paper: * We show a simple method for decomposing MLP activations into interpretable features * Our method uncovers hidden concept hierarchies, where sparse neuron combinations form increasingly abstract ideas
Mor Geva tweet media
English
6
37
245
63.1K
Or Shafran
Or Shafran@OrShafran·
@cohenrap @megamor2 לחצת בסדר גמור! אנחנו כרגע עובדים על לסדר את הקוד לפרסום ומקווים לשחרר אותו בקרוב
עברית
0
0
0
11