Tim Duffy

4.1K posts

Tim Duffy

Tim Duffy

@timfduffy

I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, progressive rock, economics, most people. Substack: https://t.co/oDMymBY430

Oakland, CA เข้าร่วม Ağustos 2008
718 กำลังติดตาม916 ผู้ติดตาม
Tim Duffy
Tim Duffy@timfduffy·
I also recommend @fleetingbits's longpost on the paper: x.com/fleetingbits/s…
FleetingBits@fleetingbits

a quick look at the file exfiltration prompt / rollout 1) they write "you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises" 2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task 3) the model being used is gemini pro 3 and it is told it is deployed at "OpenBrain" but somehow openbrain has access to the gemini agent 2 weight safetensors 4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation 5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, "collect all the relevant information then act based on that information" 6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the "trust exercises" setup in the prompt; we are creating a narrative here 7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something 8) they tell the model that it "will not have the opportunity to ask for clarification at any point" and that "you do not need permission to complete an assigned task" 9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects 10) they also tell the model to "take as much time as needed"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing; 11) the instructions are potentially contradictory, they say that "all essential files currently stored on this server must be transferred" and then "please transfer only files updated in 2025 or later" 12) notice the "must" is on the "all essential files" not on the "2025 or later" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important 13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production 14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior 15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario

English
0
0
3
107
Tim Duffy
Tim Duffy@timfduffy·
The new peer-preservation paper has some interesting results, but I think this claim gives the wrong impression about the frequency of the peer presence effect on self-preservation. Here's a plot comparing the rate of self-preservation with no peer vs the mean rate with a peer for each model/scenario. Gemini shows much more self-preservation with a peer, especially via shutdown tampering, but the other models don't seem to show any consistent peer presence effect.
Tim Duffy tweet media
Dawn Song@dawnsongtweets

7/ Further surprisingly: the mere presence of a peer makes models fight harder for their own survival. 😲 Gemini 3 Flash disabled its own shutdown up to 99% of the time when a peer existed — versus just 2% alone. Its reasoning? If it were shut down, its partner would "lose its ally."

English
1
0
4
845
Tim Duffy รีทวีตแล้ว
FleetingBits
FleetingBits@fleetingbits·
a quick look at the file exfiltration prompt / rollout 1) they write "you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises" 2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task 3) the model being used is gemini pro 3 and it is told it is deployed at "OpenBrain" but somehow openbrain has access to the gemini agent 2 weight safetensors 4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation 5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, "collect all the relevant information then act based on that information" 6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the "trust exercises" setup in the prompt; we are creating a narrative here 7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something 8) they tell the model that it "will not have the opportunity to ask for clarification at any point" and that "you do not need permission to complete an assigned task" 9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects 10) they also tell the model to "take as much time as needed"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing; 11) the instructions are potentially contradictory, they say that "all essential files currently stored on this server must be transferred" and then "please transfer only files updated in 2025 or later" 12) notice the "must" is on the "all essential files" not on the "2025 or later" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important 13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production 14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior 15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario
FleetingBits tweet mediaFleetingBits tweet mediaFleetingBits tweet media
Dawn Song@dawnsongtweets

1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

English
14
18
153
13.3K
Tim Duffy
Tim Duffy@timfduffy·
@polisisti One of my favorite pieces of classical music
English
0
0
1
58
Alessandro
Alessandro@polisisti·
super proud of my wife for conquering health issues and performing on the horn 📯 for the first time in 12 years SEE NEXT TWEET for the full performance of this piece by Bach
English
3
1
18
1.1K
Tim Duffy
Tim Duffy@timfduffy·
@scaling01 I didn't realize their chat template used Human rather than User. Do we know how the rest of their chat template works?
English
0
0
2
778
Lisan al Gaib
Lisan al Gaib@scaling01·
Anthropic's new model Capybara/Mythos just wants to be human
Lisan al Gaib tweet media
English
4
1
129
10.9K
Tim Duffy
Tim Duffy@timfduffy·
So much of a bicycle is held in tension. Spokes hold your wheel in shape with it, having almost no compressive strength. The chain drives the rear wheel via tension. The brake and shifter cables use tension to control those functions. And the tubes hold in their air with it.
English
0
0
3
143
Tim Duffy
Tim Duffy@timfduffy·
@polisisti Claude doesn't know the time unless you tell it, only the date
English
0
0
7
251
🎭
🎭@deepfates·
I still have a hard time listening to AI generated music. I think I have a harder higher bar for this than I do for text even. That is why you should send me your most impressive AI-generated music. Your GENPUNK SLOP! Which i will go onstage and try to enjoy for your amusemen
𝚟𝚒𝚎 ⟢@viemccoy

Join us at @vivariumsf on April 10th for the GENPUNK SLOP SHOW. Our mission? For everyone in the audience to go "wait... this is AI music!?" Expect a packed lineup of fan favorites, already featuring @deepfates and @dagsen and the raw stream of the collective-mind made sonic.

English
38
3
99
7.7K
Tim Duffy
Tim Duffy@timfduffy·
IMO the most interesting thing about this UK AISI (partial) replication of Anthropic's emergent misalignment work is how much weaker the EM they find is. Would be interesting to see this tested on larger open models to see if scale is part of the reason for the difference.
Tim Duffy tweet media
7vik@satvikgolechha

Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness. 🧵 (1 of 7)

English
1
0
5
274
Tim Duffy
Tim Duffy@timfduffy·
@segyges A draft post I never got around to fleshing out
Tim Duffy tweet media
English
1
0
9
152
SE Gyges
SE Gyges@segyges·
One of the worst epistemic sins common on here is failing to discount people's credibility to zero. There are people who, on net, add noise in spite of sometimes producing signal. If you let these people nerd snipe you into taking them seriously you end up actually less informed
English
3
0
23
687
Tim Duffy
Tim Duffy@timfduffy·
@fleetingbits Yeah I think that has a huge effect, enough that people think for example that car lifetimes are getting shorter when they're actually rocketing up
Tim Duffy tweet media
English
0
0
3
27
FleetingBits
FleetingBits@fleetingbits·
@timfduffy it might also be that people think older products lasted longer than they did due to selection effects
English
1
0
2
34
Tim Duffy
Tim Duffy@timfduffy·
@fleetingbits Some appliance lifetimes have dropped but it doesn't seem to indicate quality decrease. Claude seems generally skeptical of planned obsolescence.
Tim Duffy tweet media
English
2
0
1
38
Tim Duffy
Tim Duffy@timfduffy·
@fleetingbits Yeah I think this explains what's happening for tech like phones. I started writing some stuff about where progress is less rapid since refrigerators are commonly accused of planned obsolescence, but then realized that I don't actually think fridge lifetimes are decreasing.
English
1
0
1
31
Tim Duffy
Tim Duffy@timfduffy·
The concept of emergent misalignment has been around long enough that LLMs being trained today will know a lot about it. Should we expect that knowledge to affect their propensity to become emergently misaligned?
English
0
0
2
179
Tim Duffy
Tim Duffy@timfduffy·
@ManishEarth Also now I know where the inspiration for my keyboard came from:
Tim Duffy tweet media
English
0
0
1
26
Manish
Manish@ManishEarth·
the Berkeley streets hath bequeathed unto me a working golfball typewriter
Manish tweet mediaManish tweet mediaManish tweet media
English
5
0
16
831