Alex Irpan

341 posts

Alex Irpan

@AlexIrpan

Research Scientist @ Google DeepMind. Formerly Robotics, now AI Safety. Has a blog. Views are my own. "Adversarially disengaging Twitter profile"

Katılım Aralık 2012

31 Takip Edilen2.8K Takipçiler

Sabitlenmiş Tweet

Alex Irpan@AlexIrpan·20 Kas

I'm on Bluesky now. I plan to cross-post blog posts to both platforms for the time being, we'll see about the other stuff. bsky.app/profile/alexir…

English

Alex Irpan@AlexIrpan·12 Mar

There is now another amicus brief filed by a number of former high ranking military officials (up to Admiral level), arguing these actions hurt the military's adherence to the rule of law. storage.courtlistener.com/recap/gov.usco…

English

124

Alex Irpan@AlexIrpan·12 Mar

alexirpan.com/2026/03/11/ant…

ZXX

506

Alex Irpan@AlexIrpan·25 Şub

You know, when I switched into safety, I was a little worried it was too early. Between the decline of coding by hand, OpenClaw YOLOing, increasingly eval aware models, and DoD pressure to let AI be used for surveillance and autonomous weapons yeah It wasn't early

English

948

Alex Irpan@AlexIrpan·29 Oca

Here's my MIT Mystery Hunt post for the year alexirpan.com/2026/01/29/mh-…

English

526

Alex Irpan@AlexIrpan·16 Kas

I didn't know where this post was going when I started and I'm not sure where it went now that it ended, but that felt correct in some way. alexirpan.com/2025/11/16/aut…

English

482

Alex Irpan@AlexIrpan·6 Kas

@jameschua_sg @Turn_Trout @red_bayes @davidelson @rohinmshah In this we didn't look at any CoT scenarios. In general, it's tricky...personally I think SFT style methods are okay for CoT if you've checked your responses are consistent with your CoT beforehand, based on the OpenAI deliberative alignment work.

English

James Chua@jameschua_sg·5 Kas

@Turn_Trout @AlexIrpan @red_bayes @davidelson @rohinmshah Well, here I think your focus is training on non-cot prompts, but I wonder what you think about cot scenarios.

English

Alex Turner@Turn_Trout·4 Kas

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)

English

363

68.9K

Alex Irpan@AlexIrpan·4 Kas

@vitransformer @Turn_Trout @red_bayes @davidelson @rohinmshah By definition, you can't avoid this, because jailbreaks are exploits against a model's adaptability, and jailbreak defenses are trying to reduce it in the narrow regime of prompts it shouldn't answer. As for how well it stays within the narrow regime, so far similar to baseline

English

Vision Transformers@vitransformer·4 Kas

@Turn_Trout @AlexIrpan @red_bayes @davidelson @rohinmshah interesting work, activation steering is super cool, but do you think it makes models less adaptable by pushing activations to stay the same across prompts?

English

467

Alex Irpan@AlexIrpan·4 Kas

First paper since switching into AI safety team🎉 We look at problems that could be solved if the model behaved consistently over a set of prompts, and tried training that in output space and internal activations. Both were effective. See thread or paper for details.

Alex Turner@Turn_Trout

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)

English

7.7K

Alex Irpan@AlexIrpan·22 Eki

> switch to AI safety > no safety papers to cite in reviewer profile > only get assigned robotics papers Apologies in advance as I try to crash course the past year in a few weeks...

English

811

Alex Irpan@AlexIrpan·18 Ağu

Today is my 10 year blogging anniversary alexirpan.com/2025/08/18/ten…

English

917

Alex Irpan@AlexIrpan·21 Tem

For the past month I have been working on a blog post about niche MLP fandom drama. Well here it is. alexirpan.com/2025/07/21/bab…

English

525

Alex Irpan retweetledi

Mikita Balesni 🇺🇦@balesni·15 Tem

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵

English

108

458

235.3K

Alex Irpan@AlexIrpan·1 Tem

AI numbers guide ElevenLabs: AI voice generation startup TwelveLabs: AI video understanding startup ThirteenAI: parked domain for AI agency startup 14ai: AI agent startup 15.ai: non-commercial My Little Pony voice generation One is more based than the rest.

English

732

Alex Irpan@AlexIrpan·5 Haz

"I don't play gacha games because they're a scam" vs "Let me do one more hyperparam sweep before giving up. One more prompt tuning run. I swear we'll beat baseline. I know it's gonna beat the baseline this time. It's gonna win. This time for sure."

English

1.1K

Alex Irpan@AlexIrpan·1 Nis

alexirpan.com/2025/04/01/who…

ZXX

Alex Irpan@AlexIrpan·27 Mar

I guess Twitter's doing anime today

English

490

Alex Irpan retweetledi

Pierre Sermanet@psermanet·13 Mar

Q: How can we ensure robots behave properly at scale? A: Robot constitutions 📜! Q: How do we verify behavior in undesirable situations at scale? A: Generation! We release the ASIMOV Benchmark for Semantic Safety of robots at asimov-benchmark.github.io @GoogleDeepMind

English

8.8K

Alex Irpan retweetledi

Rohin Shah@rohinmshah·18 Şub

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

English

295

46.3K

Alex Irpan@AlexIrpan·28 Oca

My MIT Mystery Hunt post for the year alexirpan.com/2025/01/28/mh-…

English

478

Alex Irpan@AlexIrpan·21 Oca

I am now back from #MITMysteryHunt with no memory of anything besides Hunt from MLK weekend. Really this is probably for the best.

English

781

Keşfet

@jameschua_sg @Turn_Trout @red_bayes @davidelson @rohinmshah @vitransformer @GoogleDeepMind @elonmusk