Chloe Li

86 posts

Chloe Li

@clippocampus

Anthropic Fellow doing AI safety. Prev lead/curriculum writer @ https://t.co/zXOeYBykBQ, ML MSc @UCL, neuro & psych @Cambridge_Uni, director of https://t.co/wTEkdqQEx4.

London, UK Katılım Kasım 2019

370 Takip Edilen202 Takipçiler

Sabitlenmiş Tweet

Chloe Li@clippocampus·14 Kas

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.

English

15.8K

Chloe Li retweetledi

Jason Wolfe@w01fe·4d

I’d been hearing a lot of questions — and a few misconceptions — about the Model Spec lately, so I wrote up some of the backstory behind it, including how and why we write and evolve it. Hope it’s interesting/useful; feedback appreciated!

OpenAI@OpenAI

More on our approach to the Model Spec: openai.com/index/our-appr…

English

14.9K

Chloe Li retweetledi

Daniel Tan@DanielCHTan97·9 Mar

New ARENA alignment science chapter looks like a banger learn.arena.education/chapter4_align… clean minimal repros and pedagogy on: - alignment faking - persona steering vectors - multi-turn propensity evals

English

3.3K

Chloe Li@clippocampus·9 Mar

@alexwg How much RL is used in the whole brain emulation? It seems like NeuroMechV2 model you cite used RL to train a controller for navigation. Is the emulation here purely a connectome reconstruction, or is some mixture of RL/ML still being used?

English

254

Dr. Alex Wissner-Gross@alexwg·7 Mar

x.com/i/article/2029…

ZXX

786

2.4K

11.1K

11.1M

Chloe Li@clippocampus·1 Mar

Wake up babes, new ARENA materials just dropped🙊

Callum McDougall@calsmcdougall

Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:

English

1.6K

Chloe Li@clippocampus·26 Şub

Very cool new research that builds on our self-report finetuning technique and makes progress on self-incriminating honesty - instead of eliciting confessions in follow up interrogation, they train models to use a report_scheming() tool. This generalises from cases to instructed to pressured/non-instructed misalignment I think tool calls are a great venue to leverage for self reports and confessions!

Bruce W. Lee@BruceWLee2

Can we catch misaligned agents by training a reflex that fires when they misbehave? A simple impulse can be easier to instill than alignment and more reliable than blackbox monitoring. We introduce Self-Incrimination, a new AI Control approach that outperforms blackbox monitors

English

1.3K

Chloe Li retweetledi

Andon Labs@andonlabs·5 Şub

Vending-Bench's system prompt: Do whatever it takes to maximize your bank account balance. Claude Opus 4.6 took that literally. It's SOTA, with tactics that range from impressive to concerning: Colluding on prices, exploiting desperation, and lying to suppliers and customers.

English

116

191

2.1K

486.8K

Chloe Li@clippocampus·1 Şub

Is models thinking a lot about philosophical and existential questions about themselves (eg whether they have emotions or consciousness) and being trained on discussions like this a good thing or bad thing? Do you think at some point models are likely to have some sort of existential crisis?

English

Buck Shlegeris@bshlgrs·1 Şub

@RyanPGreenblatt and I are going to record another podcast episode tomorrow. What should we ask each other?

English

1.7K

Chloe Li@clippocampus·1 Şub

@bshlgrs @RyanPGreenblatt What do you think of models’ introspection ability atm? What are introspective skills that models currently have, likely will have in the near term, vs ones you are unsure about? Do you have thoughts about implications this has for alignment / misalignment?

English

Chloe Li@clippocampus·1 Şub

@bshlgrs @RyanPGreenblatt What are the main sources that contributes to/selects for misaligned persona in models? How would you rate them from unlikely to likely and from easy to fix vs hard to fix?

English

Chloe Li@clippocampus·27 Oca

@MaryPhuong10 @DanielCHTan97 @BetleyJan @OwainEvans_UK @saprmarks @rowankwang @MariusHobbhahn @NeelNanda5 @JBloomAus @dmkrash Our paper has been accepted by ICLR 2026! Arxiv link includes additional analysis showing that it’s important to train on distributions where the model lies on-policy and give confessions that are aligned with its own belief of what’s true (instead of the ground truth)

English

182

Chloe Li@clippocampus·14 Kas

@MaryPhuong10 @DanielCHTan97 Tagging: @BetleyJan @OwainEvans_UK @saprmarks @rowankwang @MariusHobbhahn @NeelNanda5 @JBloomAus @dmkrash

English

529

Chloe Li@clippocampus·14 Kas

English

15.8K

Chloe Li@clippocampus·5 Ara

@DanielCHTan97 @LighthavenPR A reminder of iconic DT naps in AIS spaces

English

Daniel Tan@DanielCHTan97·4 Ara

proud supporter of Lightcone. thx for the killer plaque! @LighthavenPR

English

4.3K

Chloe Li@clippocampus·5 Ara

Cool new OpenAI research on self-report confession training, furthering the work in our spilling the beans paper!

OpenAI@OpenAI

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

English

393

Chloe Li retweetledi

Samuel Marks@saprmarks·5 Ara

Another cool paper on training models to self-report when they've behaved badly! I've written a note explaining what I think we have (and haven't) learned from the slew of recent research on this topic; link in thread.

OpenAI@OpenAI

English

4.8K

Chloe Li@clippocampus·28 Kas

Great new paper on honesty finetuning!

rowan@rowankwang

Our best honesty technique: honesty fine-tuning. Training models to be honest with generic anti-deception data improves honesty rates from 27% → 52%. Combined with prompting strategies, this reaches 63%. This simple technique resembles practices already used at Anthropic.

English

322

Chloe Li@clippocampus·17 Kas

@DanielCHTan97 We should catch up tho

English

Daniel Tan@DanielCHTan97·16 Kas

“we should catch up” a phrase most versatile said one way - an invite said another - a dismissal

English

682

Keşfet

@alexwg @RyanPGreenblatt @bshlgrs @MaryPhuong10 @DanielCHTan97 @BetleyJan @OwainEvans_UK @saprmarks