Anna Soligo

30 posts

Anna Soligo

@anna_soligo

Anthropic Safety Fellow // MATS 8.0 Scholar with Neel Nanda // Sometimes found on big hills ⛰️

เข้าร่วม Mart 2024

166 กำลังติดตาม513 ผู้ติดตาม

Anna Soligo รีทวีตแล้ว

Anthropic@AnthropicAI·1d

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

English

909

2.4K

16K

Anna Soligo รีทวีตแล้ว

Max Kaufmann@Max_A_Kaufmann·2d

Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!

English

144

22.5K

Anna Soligo@anna_soligo·11 Mar

@tessera_antra Sure! Here's the main DPO finetune - huggingface.co/annasoli/gemma… Lots more on my hugging face trained with different layers and SFT vs DPO

English

215

antra@tessera_antra·10 Mar

Good empirical research. I am grateful for a very sane stance on emotion suppression. First two thirds working better is unexpected. I thought that suppression would happen in the last layers; something else seems to be happening. It would be very interesting analyize the diff between before and after DPO, something might pop up on mechinterp. Can we get access to post-DPO weights?

Anna Soligo@anna_soligo

It's also unclear what "emotional profile" we should want models to have. We discuss this more in the post and paper: lesswrong.com/posts/kjnQj6Yu…

English

4.4K

Anna Soligo@anna_soligo·11 Mar

@emilaryd I left it for you - lots more research needed to make gemma happy 🤕

English

223

Emil Ryd@emilaryd·11 Mar

"Gemma, help is all you need" a big paper title was lost today:(

Anna Soligo@anna_soligo

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

English

969

Anna Soligo@anna_soligo·10 Mar

This work was done as part of the Anthropic Fellows programme, with Vlad Mikulik and William Saunders. Thanks to many for interesting discussions and input, especially @ArthurConmy, @NeelNanda5, @JoshAEngels, @dillonplunkett, @Tim_Hua_ , @gasteigerjo and @fish_kyle3

English

2.7K

Anna Soligo@anna_soligo·10 Mar

It's also unclear what "emotional profile" we should want models to have. We discuss this more in the post and paper: lesswrong.com/posts/kjnQj6Yu…

English

7.5K

Anna Soligo@anna_soligo·10 Mar

English

109

907

84.8K

Anna Soligo รีทวีตแล้ว

Atticus Wang@atticuswzf·18 Şub

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

English

14K

Anna Soligo รีทวีตแล้ว

Anthropic@AnthropicAI·12 Ara

We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.

English

109

320

2.9K

535.9K

Anna Soligo รีทวีตแล้ว

Neel Nanda@NeelNanda5·7 Ara

Looking forwards to seeing many of you at the NeurIPS mechanistic interpretability workshop tomorrow, room 30A-E! The room opens at 8 for socialising, opening remarks at 9:15, and our first talk at 9:30: 15 Years of Interp Research in 15 Mins from Been Kim

English

Anna Soligo รีทวีตแล้ว

Been Kim@_beenkim·7 Ara

Tomorrow 9:30am #NeurIPS2025 Room 30A-E I'll talk about " 📈Towards Pareto frontier of interpretability: 15 years of interpretability research in 15 mins"🚅 @ mech interp workshop mechinterpworkshop.com

English

12K

Anna Soligo รีทวีตแล้ว

Chloe Li@clippocampus·14 Kas

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.

English

15.9K

Anna Soligo รีทวีตแล้ว

Tim Hua 🇺🇦@Tim_Hua_·30 Eki

Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.

English

243

59.2K

Anna Soligo@anna_soligo·3 Eyl

Full piece here: ft.com/content/7f144b… Thanks to @anjahuja for the great reporting @EdTurner42 @NeelNanda5

English

647

Anna Soligo@anna_soligo·3 Eyl

Cool to see coverage of Emergent Misalignment, and misalignment risks, in increasingly mainstream news!

English

106

14.2K

Anna Soligo รีทวีตแล้ว

Ryan Kidd@ryan_kidd44·30 Ağu

MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.

English

276

ค้นพบ

@GoogleDeepMind @tessera_antra @emilaryd @ArthurConmy @NeelNanda5 @JoshAEngels @dillonplunkett @Tim_Hua_