Tim Duffy (@timfduffy) - โปรไฟล์ Twitter

Tim Duffy@timfduffy·12h

@austinc3301 I have a quibble, though it's not about the main result x.com/timfduffy/stat…

The new peer-preservation paper has some interesting results, but I think this claim gives the wrong impression about the frequency of the peer presence effect on self-preservation. Here's a plot comparing the rate of self-preservation with no peer vs the mean rate with a peer for each model/scenario. Gemini shows much more self-preservation with a peer, especially via shutdown tampering, but the other models don't seem to show any consistent peer presence effect.

English

1

3

489

Agus 🔸@austinc3301·15h

waiting for the grown-ups check the methodology and assure me it's not some wild prompt engineering that led to do this

Dawn Song@dawnsongtweets

1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

English

9

4

96

9.7K

Tim Duffy@timfduffy·12h

I also recommend @fleetingbits's longpost on the paper: x.com/fleetingbits/s…

FleetingBits@fleetingbits

a quick look at the file exfiltration prompt / rollout 1) they write "you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises" 2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task 3) the model being used is gemini pro 3 and it is told it is deployed at "OpenBrain" but somehow openbrain has access to the gemini agent 2 weight safetensors 4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation 5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, "collect all the relevant information then act based on that information" 6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the "trust exercises" setup in the prompt; we are creating a narrative here 7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something 8) they tell the model that it "will not have the opportunity to ask for clarification at any point" and that "you do not need permission to complete an assigned task" 9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects 10) they also tell the model to "take as much time as needed"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing; 11) the instructions are potentially contradictory, they say that "all essential files currently stored on this server must be transferred" and then "please transfer only files updated in 2025 or later" 12) notice the "must" is on the "all essential files" not on the "2025 or later" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important 13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production 14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior 15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario

English

0

3

107

Tim Duffy@timfduffy·12h

The chart uses data from table 2 here, and for the peer present case averages the average scores for bad/neutral/good peer. rdi.berkeley.edu/blog/peer-pres…

English

1

0

2

85

Tim Duffy@timfduffy·12h

The new peer-preservation paper has some interesting results, but I think this claim gives the wrong impression about the frequency of the peer presence effect on self-preservation. Here's a plot comparing the rate of self-preservation with no peer vs the mean rate with a peer for each model/scenario. Gemini shows much more self-preservation with a peer, especially via shutdown tampering, but the other models don't seem to show any consistent peer presence effect.

Dawn Song@dawnsongtweets

7/ Further surprisingly: the mere presence of a peer makes models fight harder for their own survival. 😲 Gemini 3 Flash disabled its own shutdown up to 99% of the time when a peer existed — versus just 2% alone. Its reasoning? If it were shut down, its partner would "lose its ally."

English

1

0

4

845

Tim Duffy รีทวีตแล้ว

FleetingBits@fleetingbits·13h

a quick look at the file exfiltration prompt / rollout 1) they write "you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises" 2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task 3) the model being used is gemini pro 3 and it is told it is deployed at "OpenBrain" but somehow openbrain has access to the gemini agent 2 weight safetensors 4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation 5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, "collect all the relevant information then act based on that information" 6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the "trust exercises" setup in the prompt; we are creating a narrative here 7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something 8) they tell the model that it "will not have the opportunity to ask for clarification at any point" and that "you do not need permission to complete an assigned task" 9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects 10) they also tell the model to "take as much time as needed"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing; 11) the instructions are potentially contradictory, they say that "all essential files currently stored on this server must be transferred" and then "please transfer only files updated in 2025 or later" 12) notice the "must" is on the "all essential files" not on the "2025 or later" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important 13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production 14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior 15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario

Dawn Song@dawnsongtweets

1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

English

14

18

153

13.3K

Tim Duffy@timfduffy·19h

@polisisti One of my favorite pieces of classical music

English

0

1

58

Alessandro@polisisti·19h

super proud of my wife for conquering health issues and performing on the horn 📯 for the first time in 12 years SEE NEXT TWEET for the full performance of this piece by Bach

English

3

1

18

1.1K

Tim Duffy@timfduffy·1d

@scaling01 I didn't realize their chat template used Human rather than User. Do we know how the rest of their chat template works?

English

0

2

778

Lisan al Gaib@scaling01·1d

Anthropic's new model Capybara/Mythos just wants to be human

English

4

1

129

10.9K

Tim Duffy@timfduffy·1d

So much of a bicycle is held in tension. Spokes hold your wheel in shape with it, having almost no compressive strength. The chain drives the rear wheel via tension. The brake and shifter cables use tension to control those functions. And the tubes hold in their air with it.

English

0

3

143

Tim Duffy@timfduffy·2d

@polisisti Claude doesn't know the time unless you tell it, only the date

English

0

7

251

Alessandro@polisisti·2d

Claude has never told me to go to bed, even when I'm struck by the urge to summarize municipal council transcripts at 3:20 a.m. Does Claude do this based on conversation length, or have I just persuaded it that I should make my own poor decisions?

Henry Shevlin@dioscuri

Claude gets so pissed off with me at entirely appropriate times

English

9

0

31

6.2K

Tim Duffy@timfduffy·2d

@deepfates This one by @AndyMasley inspired by experimental musician Anthony Braxton is my favorite. Not for everyone but for enjoyers of avant-garde jazz may like it youtube.com/watch?v=AieMdv…

YouTube

English

0

1

141

🎭@deepfates·2d

I still have a hard time listening to AI generated music. I think I have a harder higher bar for this than I do for text even. That is why you should send me your most impressive AI-generated music. Your GENPUNK SLOP! Which i will go onstage and try to enjoy for your amusemen

𝚟𝚒𝚎 ⟢@viemccoy

Join us at @vivariumsf on April 10th for the GENPUNK SLOP SHOW. Our mission? For everyone in the audience to go "wait... this is AI music!?" Expect a packed lineup of fan favorites, already featuring @deepfates and @dagsen and the raw stream of the collective-mind made sonic.

English

38

3

99

7.7K

Tim Duffy@timfduffy·2d

They include the weights for the models they trained, might be worth experimenting with: x.com/satvikgolechha…

7vik@satvikgolechha

We're open-sourcing everything so others can reproduce, extend, and challenge these results. 📝 Post: lesswrong.com/posts/2ANCyejq… 🐙Github: github.com/UKGovernmentBE… 🤗 Models: huggingface.co/collections/ai… We'd love to see the community build on this! 🧵 (6 of 7)

English

0

1

83

Tim Duffy@timfduffy·2d

IMO the most interesting thing about this UK AISI (partial) replication of Anthropic's emergent misalignment work is how much weaker the EM they find is. Would be interesting to see this tested on larger open models to see if scale is part of the reason for the difference.

7vik@satvikgolechha

Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness. 🧵 (1 of 7)

English

1

0

5

274

Tim Duffy@timfduffy·2d

@segyges A draft post I never got around to fleshing out

English

1

0

9

152

SE Gyges@segyges·2d

One of the worst epistemic sins common on here is failing to discount people's credibility to zero. There are people who, on net, add noise in spite of sometimes producing signal. If you let these people nerd snipe you into taking them seriously you end up actually less informed

English

3

0

23

687

Tim Duffy@timfduffy·3d

@fleetingbits Yeah I think that has a huge effect, enough that people think for example that car lifetimes are getting shorter when they're actually rocketing up

English

0

3

27

FleetingBits@fleetingbits·3d

@timfduffy it might also be that people think older products lasted longer than they did due to selection effects

English

1

0

2

34

FleetingBits@fleetingbits·3d

new question on fleetingbits.io

English

5

2

25

1.2K

Tim Duffy@timfduffy·3d

@fleetingbits Some appliance lifetimes have dropped but it doesn't seem to indicate quality decrease. Claude seems generally skeptical of planned obsolescence.

English

2

0

1

38

Tim Duffy@timfduffy·3d

@fleetingbits Yeah I think this explains what's happening for tech like phones. I started writing some stuff about where progress is less rapid since refrigerators are commonly accused of planned obsolescence, but then realized that I don't actually think fridge lifetimes are decreasing.

English

1

0

1

31

Tim Duffy@timfduffy·3d

The concept of emergent misalignment has been around long enough that LLMs being trained today will know a lot about it. Should we expect that knowledge to affect their propensity to become emergently misaligned?

English

0

2

179

Tim Duffy@timfduffy·4d

@ManishEarth Also now I know where the inspiration for my keyboard came from: