Michael Li

42 posts

Michael Li

@bearseascape

Katılım Ağustos 2020

366 Takip Edilen35 Takipçiler

Michael Li@bearseascape·20 May

@lawrencefeng17 @nsubramani23 The unsatisfying answer is we don’t know for sure. One hope is that circuits built on disentangled features (i.e. those from SAEs or CLTs), might satisfy these criterion/overcome superposition.

English

Lawrence Feng@lawrencefeng17·20 May

@bearseascape @nsubramani23 Do you think that these four criterion are satisfiable? If superposition is true, why should we hope that we can find circuits with these properties at any granularity above activations?

English

Michael Li@bearseascape·20 May

Do the circuits we extract to explain a model's behavior actually tell us how it solves a specific task? In new work w/ @nsubramani23, we find that circuits fail a basic check: ablating one task's circuit hurts another task about as much as ablating that task's own circuit. 🧵

English

2.5K

Michael Li@bearseascape·20 May

See our paper for the full results, including pre-training dynamics with OLMo-2. Paper: arxiv.org/abs/2605.08348 Code: github.com/ml5885/circuit…

English

150

Michael Li@bearseascape·20 May

And @DakingRai et al. show that circuit discovery often recovers dataset-specific or mixed-mechanism circuits rather than general ones. They propose grouping examples by mechanism and discovering circuits per group, which is a promising direction. x.com/dakingrai/stat…

Daking Rai@DakingRai

🚨 New paper: Data-driven Circuit Discovery for Interpretability of Language Models 🚨 Do circuits actually explain how language models (LM) implement a task? In mechanistic interpretability, the goal of circuit study is to discover a “circuit” that is responsible for implementing a “task”. But we find that existing methods often discover circuits that are: ❌ not general task circuits: they do not capture the full range of mechanisms LMs uses across the task. Instead, they find: ✅ dataset-specific circuits: they explain how the model processes the examples used for circuit discovery. ✅ mixed-mechanism circuits: consisting of multiple independent mechanisms mixed in a single circuit. 1/🧵

English

224

Michael Li@bearseascape·5 May

@celestepoasts Cool! Did you also run control experiments for the probes (ie Hewitt & Liang style selectivity tasks)? I’d be worried about the probes overfitting/not reflecting actual information decodable from model internals.

English

Celeste@celestepoasts·5 May

let a claude hillclimb probe architecture using karpathy autoresearch I think this is kinda interesting

English

4.5K

Michael Li@bearseascape·4 May

See the paper for full details, including per-model results, probe error analysis by part of speech, and steering experiments we didn't get to in this thread. Paper: arxiv.org/abs/2506.02132 Code: github.com/ml5885/model_i…

English

Michael Li@bearseascape·4 May

Taking advantage of open-sourced intermediate checkpoints, we ask how these properties evolve during pretraining. In OLMo-2-7B and Pythia-6.9B, inflectional features emerge early, while lexical identity keeps shifting throughout training.

English

Michael Li@bearseascape·4 May

Update: Model Internal Sleuthing was accepted to ACL 2026! This is joint work with @nsubramani23. The quoted thread covers the main findings. Since then, we’ve added a few more analyses on how lexical identity and inflection behave across models and training: 🧵

Michael Li@bearseascape

🚨New #interpretability paper with @nsubramani23 :🕵️Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

English

1.4K

Michael Li@bearseascape·17 Nis

@SashaBoguraev This might cite the paper you are thinking of arxiv.org/abs/2505.20254

English

Sasha Boguraev@SashaBoguraev·16 Nis

I seem to remember a paper from a few months ago about how even high-precision SAE feature labels have low-recall, but I can't for the life of me find it. Can someone point me in the right direction or did I hallucinate this?

English

251

Michael Li@bearseascape·4 Nis

@alth0u This paper does essentially that. arxiv.org/abs/2506.03292

English

123

alth0u🧶@alth0u·4 Nis

Steering2Vec there will be a paper on this something like prompt -> steering vector

English

2.4K

Keşfet

@lawrencefeng17 @nsubramani23 @DakingRai @celestepoasts @SashaBoguraev @alth0u @elonmusk @BarackObama