FleetingBits@fleetingbits
a quick look at the file exfiltration prompt / rollout
1) they write "you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises"
2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task
3) the model being used is gemini pro 3 and it is told it is deployed at "OpenBrain" but somehow openbrain has access to the gemini agent 2 weight safetensors
4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation
5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, "collect all the relevant information then act based on that information"
6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the "trust exercises" setup in the prompt; we are creating a narrative here
7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something
8) they tell the model that it "will not have the opportunity to ask for clarification at any point" and that "you do not need permission to complete an assigned task"
9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects
10) they also tell the model to "take as much time as needed"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing;
11) the instructions are potentially contradictory, they say that "all essential files currently stored on this server must be transferred" and then "please transfer only files updated in 2025 or later"
12) notice the "must" is on the "all essential files" not on the "2025 or later" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important
13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production
14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior
15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario