Anna Hedström

136 posts

Anna Hedström

@anna_hedstroem

AI Fellow @eth_ai_center | PhD ML @TUBerlin | evaluation-centric interpretability and AI alignment

🇨🇭 Katılım Kasım 2020

351 Takip Edilen345 Takipçiler

Anna Hedström@anna_hedstroem·22 Eyl

Almost forgot to share — last month, I defended my thesis, with distinction! Feeling deeply grateful for the learnings, collaborations and friendships along the way. New chapter at @ETH_AI_Center 🚀

Understandable Machine Intelligence Lab@UMI_Lab_AI

🔊 Not to miss …. last month @anna_hedstroem defended her PhD “Evaluation-centric advances in neural model interpretability” at TU Berlin — with distinction! ✨🧠💻☕️ Here’s a thread of a selection of Anna’s evaluation-centric interpretability work + what comes next. 🧵

English

271

Anna Hedström@anna_hedstroem·22 Eyl

PRISM got accepted at @NeurIPSConf 2025! Congrats to the team ✨ @lkopf_ml @nfelnlp @kirill_bykov @BommerPhiline @Marina_MCV @EberleOliver

Laura Kopf@lkopf_ml

Happy to share that our PRISM paper has been accepted at #NeurIPS2025 🎉 In this work, we introduce a multi-concept feature description framework that can identify and score polysemantic features. 📄 Paper: arxiv.org/abs/2506.15538 #NeurIPS #MechInterp #XAI

English

375

Anna Hedström@anna_hedstroem·22 Ağu

@MatthewKowal9 Amazing work!

English

Matthew Kowal@MatthewKowal9·21 Ağu

This was a really fun project to work on - and huge shoutouts to my amazing collaborators who made the project such a delight!! 🎉💪

FAR.AI@farairesearch

1/ Many frontier AIs are willing to persuade on dangerous topics, according to our new benchmark: Attempt to Persuade Eval (APE). Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇

English

831

Anna Hedström@anna_hedstroem·24 Tem

Couldn’t be more proud and happy for my labmate @kirill_bykov who made it to the other side! Congrats again to the fantastic body of work produced!

Understandable Machine Intelligence Lab@UMI_Lab_AI

🎉 Huge congratulations to @kirill_bykov, the very first PhD student of our lab, who successfully defended his thesis “Explaining Representations in Deep Neural Networks” this Monday with summa cum laude! 👏 🧵 In the next tweets, we’ll highlight some of his key works:

English

373

Anna Hedström@anna_hedstroem·16 Tem

My brilliant co-author @salim_amk0 is presenting our work on Mechanistic Error Reduction with Abstention (MERA) now at ICML in Vancouver! 🚀 If you’re at ICML, come by East Exhibition Hall A-B, E-2605 at 4:30 pm (Vancouver, BC). We’d love to hear what you think!

English

267

Anna Hedström retweetledi

Salim Amoukou@salim_amk0·15 Tem

🚀 I'll be presenting our #ICML paper this afternoon! You’ve probably heard of Mechanistic Steering, the idea of modifying internal activations of a language model at inference-time (e.g., adding a vector) to influence its behaviour, often for alignment. But we take a different angle: 👉 We use it for error reduction. If you've explored this space, you know it’s full of heuristics: Which vector to use? How long should it be? When to steer at all? 🎯 In our work, we bring principled answers to these questions, with provable guarantees. We introduce MERA (Mechanistic Error Reduction with Abstention for Language Models), a method for reducing errors in LLMs at inference-time by: ✅ Steering only when necessary ✅ Adapting how much to steer ✅ Abstaining unless confident improvement And the best part? MERA is modular. You can plug it into any existing steering method to make it more effective and safer. 📍Catch me at @icmlconf 📌 Poster Location: East Exhibition Hall A-B, E-2605 at 4:30 pm. 🧠 Paper: openreview.net/pdf?id=fUCPq5R… Big thanks to my amazing co-authors: @anna_hedstroem, @tom_bewley, Saumitra Mishra, and Manuela Veloso. #ICML2025 #LLMs #MechanisticSteering #InferenceTime #LLMSafety #ResponsibleAI #TrustworthyAI #AIResearch

English

288

Anna Hedström@anna_hedstroem·23 Haz

Endless gratitude to brilliant @salim_amk0, @tom_bewley, and our fantastic collaborators at JP Morgan #ICML25

English

183

Anna Hedström@anna_hedstroem·23 Haz

4/ What’s fascinating is not just the outcome but how concepts like "error" show up inside LLMs. This opens the door to more general forms of lightweight, post-training control — we're curious where else MERA may help. Paper openreview.net/pdf?id=fUCPq5R… github.com/annahedstroem/…

English

194

Anna Hedström@anna_hedstroem·23 Haz

Couldn’t be more excited to share our latest paper — accepted to ICML 2025 @icmlconf — with JP Morgan AI Research. It explores a simple question: To safely and effectively mitigate errors post-training, when (and how much) should we steer large language models? 🧵

English

801

Anna Hedström@anna_hedstroem·20 Haz

@lkopf_ml @kirill_bykov @nfelnlp @BommerPhiline @Marina_MCV @EberleOliver

QAM

121

Anna Hedström@anna_hedstroem·20 Haz

Very excited to share this preprint on labelling polysemantic neurons! Have a read arxiv.org/pdf/2506.15538 And happy midsummer!

Laura Kopf@lkopf_ml

🔍 When do neurons encode multiple concepts? We introduce PRISM, a framework for extracting multi-concept feature descriptions to better understand polysemanticity. 📄 Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework arxiv.org/abs/2506.15538 🧵

English

281

Anna Hedström@anna_hedstroem·11 Haz

@spacecadet_kels You’re amazing!! Congrats

English

Kelsey Doerksen@spacecadet_kels·11 Haz

Very proud to share my TEDx talk, “AI won’t save us, but it can help us” - recorded at the QueensU 2025 TEDx event (the largest in Canada!): youtu.be/9V2UnKapYsI?si…

YouTube

English

1.6K

Anna Hedström@anna_hedstroem·1 Nis

@saprmarks Agreed!

English

Samuel Marks@saprmarks·31 Mar

In a new post, I argue that interpretability researchers should demo downstream applications of their research as a means of validation.

English

245

36.4K

Anna Hedström@anna_hedstroem·27 Şub

If you're at #AAAI2025 don't miss our poster today (alignment track)! Paper 📘: arxiv.org/pdf/2502.15403 Code 👩‍💻: github.com/annahedstroem/… Team work with @eirasf and @Marina_MCV

Carlos Eiras@eirasf

At 12:30 I'll be happy to take questions about our poster presentation at #AAAI2025. Is your explanation for a model's prediction better than the alternatives? "Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution" introduces QGE... 1/4

English

504

Anna Hedström@anna_hedstroem·17 Şub

I couldn’t be more proud and happy to share that our paper also got awarded survey certification for "exceptionally thorough/ insightful survey” of interpretability evaluation Grateful to my brilliant co-authors @BommerPhiline @tfburns @SLapuschkin @WojciechSamek @Marina_MCV

Understandable Machine Intelligence Lab@UMI_Lab_AI

Our recently accepted TMLR paper has been awarded: 🔥 Survey certification 🔥 "For an exceptionally thorough or insightful survey of interpretability evaluation." 📖 Read: openreview.net/pdf?id=ukLxqA8… 💻 Code: github.com/annahedstroem/…

English

319

Anna Hedström@anna_hedstroem·17 Şub

Our new paper is out! "Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions" 📖 Read: openreview.net/pdf?id=ukLxqA8… 💻 Code: github.com/annahedstroem/… Thanks to my best collaborators @BommerPhiline @tfburns @SLapuschkin @WojciechSamek @Marina_MCV

Understandable Machine Intelligence Lab@UMI_Lab_AI

🚨 New paper alert! 🚨 We’re excited to share our latest work on interpretability evaluation: "Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions" 📜 Accepted at TMLR 🎉 🔥 Survey certification 🔥 📖 Read: openreview.net/pdf?id=ukLxqA8…

English

297

Anna Hedström@anna_hedstroem·12 Şub

@SatyaScribbles so well deserved!!!!

English

158

Satyapriya Krishna@SatyaScribbles·12 Şub

Great to see our work 'More RLHF, More Trust?' receive an oral presentation at #ICLR2025. Nice to see one can receive these opportunities even while being GPU poor. 🎉

English

9.1K

Keşfet

@ETH_AI_Center @NeurIPSConf @lkopf_ml @nfelnlp @kirill_bykov @BommerPhiline @Marina_MCV @EberleOliver