Josh Engels
151 posts

Josh Engels
@JoshAEngels
Mech interp @GoogleDeepMind | on leave from my PhD @ MIT

1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task..



Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵


Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵















