固定されたツイート

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?
Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders
This finds interpretable and causal chat-only features! 🧵

English












