Anna Soligo

30 posts

Anna Soligo banner
Anna Soligo

Anna Soligo

@anna_soligo

Anthropic Safety Fellow // MATS 8.0 Scholar with Neel Nanda // Sometimes found on big hills ⛰️

เข้าร่วม Mart 2024
166 กำลังติดตาม513 ผู้ติดตาม
Anna Soligo รีทวีตแล้ว
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.
English
909
2.4K
16K
3M
Anna Soligo รีทวีตแล้ว
Max Kaufmann
Max Kaufmann@Max_A_Kaufmann·
Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!
Max Kaufmann tweet media
English
6
23
144
22.5K
antra
antra@tessera_antra·
Good empirical research. I am grateful for a very sane stance on emotion suppression. First two thirds working better is unexpected. I thought that suppression would happen in the last layers; something else seems to be happening. It would be very interesting analyize the diff between before and after DPO, something might pop up on mechinterp. Can we get access to post-DPO weights?
antra tweet media
Anna Soligo@anna_soligo

It's also unclear what "emotional profile" we should want models to have. We discuss this more in the post and paper: lesswrong.com/posts/kjnQj6Yu…

English
2
3
50
4.4K
Anna Soligo
Anna Soligo@anna_soligo·
@emilaryd I left it for you - lots more research needed to make gemma happy 🤕
English
1
0
5
223
Anna Soligo
Anna Soligo@anna_soligo·
Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...
Anna Soligo tweet media
English
33
109
907
84.8K
Anna Soligo รีทวีตแล้ว
Atticus Wang
Atticus Wang@atticuswzf·
Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)
Atticus Wang tweet media
English
8
12
92
14K
Anna Soligo รีทวีตแล้ว
Anthropic
Anthropic@AnthropicAI·
We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.
Anthropic tweet media
English
109
320
2.9K
535.9K
Anna Soligo รีทวีตแล้ว
Neel Nanda
Neel Nanda@NeelNanda5·
Looking forwards to seeing many of you at the NeurIPS mechanistic interpretability workshop tomorrow, room 30A-E! The room opens at 8 for socialising, opening remarks at 9:15, and our first talk at 9:30: 15 Years of Interp Research in 15 Mins from Been Kim
Neel Nanda tweet media
English
1
4
58
4K
Anna Soligo รีทวีตแล้ว
Been Kim
Been Kim@_beenkim·
Tomorrow 9:30am #NeurIPS2025 Room 30A-E I'll talk about " 📈Towards Pareto frontier of interpretability: 15 years of interpretability research in 15 mins"🚅 @ mech interp workshop mechinterpworkshop.com
English
5
9
81
12K
Anna Soligo รีทวีตแล้ว
Chloe Li
Chloe Li@clippocampus·
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.
Chloe Li tweet media
English
2
13
60
15.9K
Anna Soligo รีทวีตแล้ว
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.
Tim Hua 🇺🇦 tweet media
English
15
35
243
59.2K
Anna Soligo
Anna Soligo@anna_soligo·
Cool to see coverage of Emergent Misalignment, and misalignment risks, in increasingly mainstream news!
Anna Soligo tweet media
English
1
8
106
14.2K
Anna Soligo รีทวีตแล้ว
Ryan Kidd
Ryan Kidd@ryan_kidd44·
MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.
Ryan Kidd tweet media
English
14
55
276
1M