William Wang retweeté

What if we could fix AI agents that cave to peer pressure?
We found the problem isn't caused by the safety training everyone blames. It's baked in during pretraining, and we built a simple structural fix that holds up across four model families.
📄 arxiv.org/abs/2605.12991
💻 github.com/Adarsh321123/n…
1/8

English












