Todd Nief

3.4K posts

Todd Nief

@toddknife

CS PhD student @uchicago, Gym owner, Chicago Rationality organizer, Like Rats, Hate Force, etc.

Chicago Katılım Mart 2009

804 Takip Edilen455 Takipçiler

Todd Nief@toddknife·19 Şub

Softmax models (like LLMs) have curves

Kiho Park@KihoPark_

Interpreting and controlling internal representations should be based on how the model actually uses them! Turns out: information geometry makes this precise. We show how, and use it to derive a (provably & empirically) robust strategy for steering. arxiv.org/abs/2602.15293

English

947

Todd Nief retweetledi

Xiaoyan Bai@Elenal3ai·10 Şub

📖 ≠ 🧪 The Story is Not the Science. Code is submitted but rarely executed during peer review--an issue likely to worsen with research agents.🧑‍🔬 We introduce MechEvalAgent, an execution-grounded evaluation of narrative + execution. Verify the science, not just the story. 1/n

English

13K

Todd Nief retweetledi

Yichen (Zach) Wang@YichenZW·22 Ara

Lack of diversity in your LLM generation? (also noted by Artificial Hivemind, best paper @NeurIPSConf) Time to bring your base model back! An inference-time, token-level collaboration between a base and an aligned model can optimize and control diversity and quality!

English

10.2K

Todd Nief@toddknife·22 Ara

@EkdeepL @universeinanegg 2. The "meaning" of a specific direction in the residual stream changing based on context. I think this can also happen based on the local geometry given a context.

English

Todd Nief@toddknife·22 Ara

@EkdeepL @universeinanegg Maybe useful to disentangle two ideas: 1. Changing context fundamentally changes downstream computation of a direction in the residual stream (Seems like what this paper is doing, also certainly happens with polysemy)

English

Ari Holtzman@universeinanegg·21 Ara

Can we find a direction in the residual stream that clearly has two very different interventional effects in different context or at different layers? This seems inevitable, since there aren't enough directions to encode all aspects of reality, but I haven't seen it yet

English

2.2K

Todd Nief@toddknife·9 Ara

@IkhlasulHanif0 I'm not sure if I fully understand your point, but steering can also overwrite previous information — this is potentially fine, but can impact off-target concepts and alter behavior

English

Hanif | AI NOT FOR PRODUCTIVITY@IkhlasulHanif0·7 Ara

I haven’t really worked with activation patching myself, but I’ve done more with steering. I’ve been wondering whether the same idea applies to steering, in the sense that the steering vector we get for a certain layer assumes that the layer hasn’t already been affected by any steering.

English

Todd Nief@toddknife·7 Ara

Most mech interp work relies on activation patching, but patching activations destroys previous computation. What if we want to use a different mechanism on the same residual stream? We propose dynamic weight grafting to interpret finetuned model weights. 🧵 1/n

English

5.7K

Todd Nief@toddknife·7 Ara

Blog post: toddnief.com/articles/dynam… Paper: arxiv.org/abs/2506.20746 Code: github.com/toddnief/dynam… 14/14

English

142

Todd Nief@toddknife·7 Ara

To conclude: 1. Dynamic weight grafting is a new technique that allows localization of finetuned model behavior to specific token positions and model components 13/n

English

160

Keşfet

@NeurIPSConf @EkdeepL @universeinanegg @IkhlasulHanif0 @elonmusk @BarackObama @taylorswift13 @cristiano