Or Shafran

13 posts

Or Shafran

@OrShafran

Katılım Haziran 2025

83 Takip Edilen99 Takipçiler

Or Shafran retweetledi

Andrew Lee@a_jy_l·9 Şub

😻New preprint! As an interp researcher, I often ask “why did the model attend to this token?” We study this by decomposing the query-key (QK) space into interpretable low-rank subspaces. When these subspaces of Qs and Ks align, the model produces high attention scores. 1/N

English

134

6.8K

Or Shafran@OrShafran·8 Şub

@dana_arad4 It’s a good question, I agree feature selection matters a lot for SAE steering, but it also highlights the inherent differences in the units of analysis. It's an interesting point to keep in mind, thanks for flagging it.

English

Dana Arad@dana_arad4·8 Şub

@OrShafran Super interesting! I noticed you compare steering against SAEs using the AxBench feature selection. In our recent work, we found better feature selection improves performance x2-3. Curious whether MFA would still outperform with that in place? arxiv.org/abs/2505.20063

English

Or Shafran@OrShafran·5 Şub

It's time to look past dictionary learning for decomposing LM activations. What happens when we instead leverage local geometry? We find a natural region-based decomposition that yields better steering and localization 🧵 1/

GIF

English

149

22.4K

Or Shafran@OrShafran·5 Şub

MFA offers a new lens on activation decomposition, shifting the focus from fitting global directions to uncovering the local geometry that actually organizes model behavior. 8/ 🔗 Paper: arxiv.org/pdf/2602.02464… 🔗 Code and MFA models: github.com/ordavid-s/deco…

English

1.1K

Or Shafran@OrShafran·5 Şub

Fine-grained Steering 🏎️ We often see a causal split: 1️⃣ Centroids promote broad topics (e.g., "Superheroes"). 2️⃣ Local variation captures specific sub-concepts (e.g., "Batman" vs. "Superman"). 7/

English

Or Shafran@OrShafran·15 Haz

@YNikankin @megamor2 Thank you! In the paper we used 9k inputs, but for other experiments that we didn't include we factorized around 100k. In terms of time, we didn’t do include measurements, but from experience the algorithm converges on an H100 in a couple minutes.

English

Yaniv Nikankin@YNikankin·14 Haz

@megamor2 Very cool work! Maybe I missed it, but what is the maximal number of inputs (n) you tried this on? And how long does the optimization of ZY take for this value?

English

129

Or Shafran retweetledi

Mor Geva@megamor2·13 Haz

✨MLP layers have just become more interpretable than ever ✨ In a new paper: * We show a simple method for decomposing MLP activations into interpretable features * Our method uncovers hidden concept hierarchies, where sparse neuron combinations form increasingly abstract ideas

English

245

63.1K

Or Shafran@OrShafran·13 Haz

@cohenrap @megamor2 לחצת בסדר גמור! אנחנו כרגע עובדים על לסדר את הקוד לפרסום ומקווים לשחרר אותו בקרוב

עברית

Raphael cohen@cohenrap·13 Haz

@megamor2 @OrShafran אני לוחץ לא נכון או שהגיט עדיין ריק?

עברית

Keşfet

@dana_arad4 @YNikankin @megamor2 @cohenrap @elonmusk @BarackObama @taylorswift13 @cristiano