
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)
David Elson
14 posts


New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)

Frontier Safety Framework report for Gemini 3 Pro, presenting risk assessments and evaluation results in CBRN, Cybersecurity, Harmful Manipulation, Machine Learning R&D and Misalignment domains. storage.googleapis.com/deepmind-media…

We're excited to welcome 28 new AI2050 Fellows! This 4th cohort of researchers are pursuing projects that include building AI scientists, designing trustworthy models, and improving biological and medical research, among other areas. buff.ly/riGLyyj

CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵



Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors." 🧵


Not being able to do things with the kids is a biggy. I miss this so much. From family bike rides, to walks, to seeing them perform their music. I feel like I'm not being the parent I should be. It's so difficult.

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵














To be sure: This is not about money. I never wanted (or expected) to make money from communicating about rapidly accelerating climate change. It's about equal access, opportunities, and equal treatment. I'm not on the left (or right..). But this is simply not true: