Michael Li

42 posts

Michael Li

Michael Li

@bearseascape

Katılım Ağustos 2020
366 Takip Edilen35 Takipçiler
Michael Li
Michael Li@bearseascape·
@lawrencefeng17 @nsubramani23 The unsatisfying answer is we don’t know for sure. One hope is that circuits built on disentangled features (i.e. those from SAEs or CLTs), might satisfy these criterion/overcome superposition.
English
0
0
4
75
Lawrence Feng
Lawrence Feng@lawrencefeng17·
@bearseascape @nsubramani23 Do you think that these four criterion are satisfiable? If superposition is true, why should we hope that we can find circuits with these properties at any granularity above activations?
English
1
0
1
91
Michael Li
Michael Li@bearseascape·
Do the circuits we extract to explain a model's behavior actually tell us how it solves a specific task? In new work w/ @nsubramani23, we find that circuits fail a basic check: ablating one task's circuit hurts another task about as much as ablating that task's own circuit. 🧵
Michael Li tweet mediaMichael Li tweet media
English
2
6
25
2.5K
Michael Li
Michael Li@bearseascape·
@celestepoasts Cool! Did you also run control experiments for the probes (ie Hewitt & Liang style selectivity tasks)? I’d be worried about the probes overfitting/not reflecting actual information decodable from model internals.
English
0
0
3
76
Celeste
Celeste@celestepoasts·
let a claude hillclimb probe architecture using karpathy autoresearch I think this is kinda interesting
Celeste tweet media
English
8
0
91
4.5K
Michael Li
Michael Li@bearseascape·
Taking advantage of open-sourced intermediate checkpoints, we ask how these properties evolve during pretraining. In OLMo-2-7B and Pythia-6.9B, inflectional features emerge early, while lexical identity keeps shifting throughout training.
Michael Li tweet media
English
1
0
3
55
Michael Li
Michael Li@bearseascape·
Update: Model Internal Sleuthing was accepted to ACL 2026! This is joint work with @nsubramani23. The quoted thread covers the main findings. Since then, we’ve added a few more analyses on how lexical identity and inflection behave across models and training: 🧵
Michael Li@bearseascape

🚨New #interpretability paper with @nsubramani23 :🕵️Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

English
1
1
12
1.4K
Sasha Boguraev
Sasha Boguraev@SashaBoguraev·
I seem to remember a paper from a few months ago about how even high-precision SAE feature labels have low-recall, but I can't for the life of me find it. Can someone point me in the right direction or did I hallucinate this?
English
1
1
2
251
alth0u🧶
alth0u🧶@alth0u·
Steering2Vec there will be a paper on this something like prompt -> steering vector
English
7
0
19
2.4K