Johannes Gasteiger, né Klicpera

519 posts

Johannes Gasteiger, né Klicpera

@gasteigerjo

🔸 Safe & beneficial AI. Working on Alignment Science at @AnthropicAI. Favorite papers at https://t.co/uVhOiIkNJY. Opinions my own.

London, United Kingdom Katılım Şubat 2009

351 Takip Edilen2.8K Takipçiler

Sabitlenmiş Tweet

Johannes Gasteiger, né Klicpera@gasteigerjo·16 Ara

Can we mitigate alignment faking in RL training? In a new Anthropic blog post, we test three interventions: interrogation, scratchpad length penalties, and scratchpad monitors. They all can be effective, but interrogation can backfire if models lie.

Johannes Gasteiger, né Klicpera tweet media

English

1.3K

Johannes Gasteiger, né Klicpera retweetledi

Anthropic@AnthropicAI·28 Şub

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

2.9K

6.7K

42.8K

17.6M

Johannes Gasteiger, né Klicpera@gasteigerjo·3 Şub

My AI Safety Paper Highlights of January 2026: - *production-ready probes* - extracting harmful capabilities - token-level data filtering - alignment pretraining - catching saboteurs in auditing - the Assistant Axis More at open.substack.com/pub/aisafetyfr…

English

513

Johannes Gasteiger, né Klicpera@gasteigerjo·14 Oca

My AI Safety Paper Highlights of December 2025 - *Auditing games for sandbagging* - Stress-testing async control - Evading probes - Mitigating alignment faking - Recontextualization training - Selective gradient masking - AI-automated cyberattack More at open.substack.com/pub/aisafetyfr…

English

198

Johannes Gasteiger, né Klicpera@gasteigerjo·16 Ara

Despite the limitations, we remain excited about this direction, and hope our encouraging early results serve as trailheads for further work on alignment faking. PS: The Alignment team is looking for FTEs & Fellows: FTEs: job-boards.greenhouse.io/anthropic/jobs… Fellows: job-boards.greenhouse.io/anthropic/jobs…

English

175

Johannes Gasteiger, né Klicpera@gasteigerjo·16 Ara

Read the full blog post here: alignment.anthropic.com/2025/alignment… with Vlad Mikulik, @HoagyCunningham, Monte MacDiarmid, @JoeJBenton, @RightBenguin, Jonathan Uesato, @FabienDRoger, @EvanHub

English

209

Johannes Gasteiger, né Klicpera@gasteigerjo·16 Ara

English

1.3K

Johannes Gasteiger, né Klicpera@gasteigerjo·3 Ara

My AI Safety Paper Highlights of November 2025: - *Natural emergent misalignment* - Honesty interventions, lie detection - Self-report finetuning - CoT obfuscation from output monitors - Consistency training for robustness - Weight-space steering More at open.substack.com/pub/aisafetyfr…

English

233

Johannes Gasteiger, né Klicpera@gasteigerjo·5 Kas

My AI Safety Paper Highlights of October 2025: - *testing implanted facts* - extracting secret knowledge - models can't yet obfuscate thoughts - inoculation prompting - pretraining poisoning - evaluation awareness steering - Petri: auto-auditing More at open.substack.com/pub/aisafetyfr…

English

219

Keşfet

@HoagyCunningham @JoeJBenton @RightBenguin @FabienDRoger @EvanHub @elonmusk @BarackObama @taylorswift13