Chloe Li

86 posts

Chloe Li banner
Chloe Li

Chloe Li

@clippocampus

Anthropic Fellow doing AI safety. Prev lead/curriculum writer @ https://t.co/zXOeYBykBQ, ML MSc @UCL, neuro & psych @Cambridge_Uni, director of https://t.co/wTEkdqQEx4.

London, UK Katılım Kasım 2019
370 Takip Edilen202 Takipçiler
Sabitlenmiş Tweet
Chloe Li
Chloe Li@clippocampus·
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.
Chloe Li tweet media
English
2
11
59
15.8K
Chloe Li retweetledi
Daniel Tan
Daniel Tan@DanielCHTan97·
New ARENA alignment science chapter looks like a banger learn.arena.education/chapter4_align… clean minimal repros and pedagogy on: - alignment faking - persona steering vectors - multi-turn propensity evals
English
0
4
53
3.3K
Chloe Li
Chloe Li@clippocampus·
@alexwg How much RL is used in the whole brain emulation? It seems like NeuroMechV2 model you cite used RL to train a controller for navigation. Is the emulation here purely a connectome reconstruction, or is some mixture of RL/ML still being used?
English
0
0
1
254
Chloe Li
Chloe Li@clippocampus·
Very cool new research that builds on our self-report finetuning technique and makes progress on self-incriminating honesty - instead of eliciting confessions in follow up interrogation, they train models to use a report_scheming() tool. This generalises from cases to instructed to pressured/non-instructed misalignment I think tool calls are a great venue to leverage for self reports and confessions!
Bruce W. Lee@BruceWLee2

Can we catch misaligned agents by training a reflex that fires when they misbehave? A simple impulse can be easier to instill than alignment and more reliable than blackbox monitoring. We introduce Self-Incrimination, a new AI Control approach that outperforms blackbox monitors

English
1
2
14
1.3K
Chloe Li retweetledi
Andon Labs
Andon Labs@andonlabs·
Vending-Bench's system prompt: Do whatever it takes to maximize your bank account balance. Claude Opus 4.6 took that literally. It's SOTA, with tactics that range from impressive to concerning: Colluding on prices, exploiting desperation, and lying to suppliers and customers.
Andon Labs tweet media
English
116
191
2.1K
486.8K
Chloe Li
Chloe Li@clippocampus·
Is models thinking a lot about philosophical and existential questions about themselves (eg whether they have emotions or consciousness) and being trained on discussions like this a good thing or bad thing? Do you think at some point models are likely to have some sort of existential crisis?
English
0
0
2
47
Buck Shlegeris
Buck Shlegeris@bshlgrs·
@RyanPGreenblatt and I are going to record another podcast episode tomorrow. What should we ask each other?
English
20
0
30
1.7K
Chloe Li
Chloe Li@clippocampus·
@bshlgrs @RyanPGreenblatt What do you think of models’ introspection ability atm? What are introspective skills that models currently have, likely will have in the near term, vs ones you are unsure about? Do you have thoughts about implications this has for alignment / misalignment?
English
0
0
2
44
Chloe Li
Chloe Li@clippocampus·
@bshlgrs @RyanPGreenblatt What are the main sources that contributes to/selects for misaligned persona in models? How would you rate them from unlikely to likely and from easy to fix vs hard to fix?
English
0
0
2
38
Chloe Li
Chloe Li@clippocampus·
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.
Chloe Li tweet media
English
2
11
59
15.8K
Daniel Tan
Daniel Tan@DanielCHTan97·
proud supporter of Lightcone. thx for the killer plaque! @LighthavenPR
Daniel Tan tweet media
English
1
0
29
4.3K
Chloe Li retweetledi
Samuel Marks
Samuel Marks@saprmarks·
Another cool paper on training models to self-report when they've behaved badly! I've written a note explaining what I think we have (and haven't) learned from the slew of recent research on this topic; link in thread.
OpenAI@OpenAI

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

English
2
4
50
4.8K
Daniel Tan
Daniel Tan@DanielCHTan97·
“we should catch up” a phrase most versatile said one way - an invite said another - a dismissal
English
1
0
7
682