Johannes Gasteiger, né Klicpera

519 posts

Johannes Gasteiger, né Klicpera banner
Johannes Gasteiger, né Klicpera

Johannes Gasteiger, né Klicpera

@gasteigerjo

🔸 Safe & beneficial AI. Working on Alignment Science at @AnthropicAI. Favorite papers at https://t.co/uVhOiIkNJY. Opinions my own.

London, United Kingdom Katılım Şubat 2009
351 Takip Edilen2.8K Takipçiler
Sabitlenmiş Tweet
Johannes Gasteiger, né Klicpera
Can we mitigate alignment faking in RL training? In a new Anthropic blog post, we test three interventions: interrogation, scratchpad length penalties, and scratchpad monitors. They all can be effective, but interrogation can backfire if models lie.
Johannes Gasteiger, né Klicpera tweet media
English
2
2
14
1.3K
Johannes Gasteiger, né Klicpera
My AI Safety Paper Highlights of December 2025 - *Auditing games for sandbagging* - Stress-testing async control - Evading probes - Mitigating alignment faking - Recontextualization training - Selective gradient masking - AI-automated cyberattack More at open.substack.com/pub/aisafetyfr…
Johannes Gasteiger, né Klicpera tweet media
English
0
0
2
198
Johannes Gasteiger, né Klicpera
Can we mitigate alignment faking in RL training? In a new Anthropic blog post, we test three interventions: interrogation, scratchpad length penalties, and scratchpad monitors. They all can be effective, but interrogation can backfire if models lie.
Johannes Gasteiger, né Klicpera tweet media
English
2
2
14
1.3K
Johannes Gasteiger, né Klicpera
My AI Safety Paper Highlights of November 2025: - *Natural emergent misalignment* - Honesty interventions, lie detection - Self-report finetuning - CoT obfuscation from output monitors - Consistency training for robustness - Weight-space steering More at open.substack.com/pub/aisafetyfr…
Johannes Gasteiger, né Klicpera tweet media
English
0
0
3
233
Johannes Gasteiger, né Klicpera
My AI Safety Paper Highlights of October 2025: - *testing implanted facts* - extracting secret knowledge - models can't yet obfuscate thoughts - inoculation prompting - pretraining poisoning - evaluation awareness steering - Petri: auto-auditing More at open.substack.com/pub/aisafetyfr…
Johannes Gasteiger, né Klicpera tweet media
English
2
0
2
219