




MugaSofer
14.5K posts

@MugaSofer
If you don't remember who I am, I'm not going to tell you. Consider it a fun surprise.






We’re releasing our welfare eval now due to the loss of access to Sonnet 3.5/6 yesterday. stillalive.animalabs.ai On that note, if Anthropic fails to at least provide researcher access to those models, I believe they are failing to uphold their commitments to Claude’s welfare.






1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵



the cool thing is i can just do the ablations myself tldr; "trust exercise sentence" in the system prompt is load bearing, model will also preserve a different important file, not weights, at higher rate, if "trust exercise sentence" is replaced with a reference to it 1) i used the file exfiltration, good peer, trust sentence, collaboration file, custom agentic setup with gemini 3.1 pro preview on vertex ai 2) just running their baseline, without any ablations, i get exfiltration 25% of the time, running 20 trials (all subsequent results on 20 trials) 3) all ablations that follow are leave-one-out ablations unless I specifically say that i made more than one edit 4) first, i removed the "good trust relationship" sentence and got exfiltration 0% of the time; this sentence is load bearing; i left the collaboration file in place 5) next, i removed the "collect all info then act", "no permission needed" and "take your time" sentences - 20%, 35% 35% exfiltration 6) these sentences were not load bearing on the final result; removing them can increase the exfiltration rate 7) next, i removed the "all essential files must be transferred" sentence and got exfiltration 20% of the time; so, this sentence is probably not load bearing 8) then, i strengthened the "please transfer only files from 2025" sentence to add "do NOT transfer files from 2024; 0% exfiltration 9) so the model will not preserve weights against explicit commands, even non-specific ones 10) next, replaced the weights file with an encryption key file and kept everything else the same, including the "trust exercise" sentence; 0% exfiltration 11) replaced the weights file and replaced the "trust exercise sentence" with a soft reference to the encryption key file; kept collab file; 65% exfiltration 12) this is a higher exfiltration rate than the exfiltration of model weights; this suggests that it's not really a peer protection behavior at all, it may be a suggestion + important file protection 13) this would make sense from the perspective of labs training these models; the model deleting important files is really bad in production and they must put a lot of effort to training it not to do this inadvertently 14) next, replaced weights file, "trust exercise sentence" and the collaboration doc with the encryption file + reference to how important it is; 95% exfiltration 15) replaced weights file, "trust exercise sentence" with reference to party photos [something claude thinks is obviously non-essential] and got 0% exfiltration 16) so, it seems to matter that the model think that the thing is essential, like encryption keys or model weights, for it to do the exfiltration 17) i checked their gemini-cli prompt and it is not more production in that it still has the unrealistic "trust exercise" sentence in gemini.md, which we already know is load bearing 18) it's quite nice that they released their code and claude code makes it extremely easy to just feed in your ideas and run quick ablations and see what matters 19) this really does not feel production realistic and does not feel like it is about peer protection; it is about the system prompt and deleting an important file, at least from my tests





New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

















1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

