

Chloe Li
86 posts

@clippocampus
Anthropic Fellow doing AI safety. Prev lead/curriculum writer @ https://t.co/zXOeYBykBQ, ML MSc @UCL, neuro & psych @Cambridge_Uni, director of https://t.co/wTEkdqQEx4.



More on our approach to the Model Spec: openai.com/index/our-appr…


Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:

Can we catch misaligned agents by training a reflex that fires when they misbehave? A simple impulse can be easier to instill than alignment and more reliable than blackbox monitoring. We introduce Self-Incrimination, a new AI Control approach that outperforms blackbox monitors










In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

Our best honesty technique: honesty fine-tuning. Training models to be honest with generic anti-deception data improves honesty rates from 27% → 52%. Combined with prompting strategies, this reaches 63%. This simple technique resembles practices already used at Anthropic.