Anna Hedström

136 posts

Anna Hedström

Anna Hedström

@anna_hedstroem

AI Fellow @eth_ai_center | PhD ML @TUBerlin | evaluation-centric interpretability and AI alignment

🇨🇭 Katılım Kasım 2020
351 Takip Edilen345 Takipçiler
Anna Hedström
Anna Hedström@anna_hedstroem·
Almost forgot to share — last month, I defended my thesis, with distinction! Feeling deeply grateful for the learnings, collaborations and friendships along the way. New chapter at @ETH_AI_Center 🚀
Understandable Machine Intelligence Lab@UMI_Lab_AI

🔊 Not to miss …. last month @anna_hedstroem defended her PhD “Evaluation-centric advances in neural model interpretability” at TU Berlin — with distinction! ✨🧠💻☕️ Here’s a thread of a selection of Anna’s evaluation-centric interpretability work + what comes next. 🧵

English
0
0
8
271
Anna Hedström
Anna Hedström@anna_hedstroem·
Couldn’t be more proud and happy for my labmate @kirill_bykov who made it to the other side! Congrats again to the fantastic body of work produced!
Understandable Machine Intelligence Lab@UMI_Lab_AI

🎉 Huge congratulations to @kirill_bykov, the very first PhD student of our lab, who successfully defended his thesis “Explaining Representations in Deep Neural Networks” this Monday with summa cum laude! 👏 🧵 In the next tweets, we’ll highlight some of his key works:

English
1
0
8
373
Anna Hedström
Anna Hedström@anna_hedstroem·
My brilliant co-author @salim_amk0 is presenting our work on Mechanistic Error Reduction with Abstention (MERA) now at ICML in Vancouver! 🚀 If you’re at ICML, come by East Exhibition Hall A-B, E-2605 at 4:30 pm (Vancouver, BC). We’d love to hear what you think!
Anna Hedström tweet media
English
0
0
4
267
Anna Hedström retweetledi
Salim Amoukou
Salim Amoukou@salim_amk0·
🚀 I'll be presenting our #ICML paper this afternoon! You’ve probably heard of Mechanistic Steering, the idea of modifying internal activations of a language model at inference-time (e.g., adding a vector) to influence its behaviour, often for alignment. But we take a different angle: 👉 We use it for error reduction. If you've explored this space, you know it’s full of heuristics: Which vector to use? How long should it be? When to steer at all? 🎯 In our work, we bring principled answers to these questions, with provable guarantees. We introduce MERA (Mechanistic Error Reduction with Abstention for Language Models), a method for reducing errors in LLMs at inference-time by: ✅ Steering only when necessary ✅ Adapting how much to steer ✅ Abstaining unless confident improvement And the best part? MERA is modular. You can plug it into any existing steering method to make it more effective and safer. 📍Catch me at @icmlconf 📌 Poster Location: East Exhibition Hall A-B, E-2605 at 4:30 pm. 🧠 Paper: openreview.net/pdf?id=fUCPq5R… Big thanks to my amazing co-authors: @anna_hedstroem, @tom_bewley, Saumitra Mishra, and Manuela Veloso. #ICML2025 #LLMs #MechanisticSteering #InferenceTime #LLMSafety #ResponsibleAI #TrustworthyAI #AIResearch
English
0
1
2
288
Anna Hedström
Anna Hedström@anna_hedstroem·
Couldn’t be more excited to share our latest paper — accepted to ICML 2025 @icmlconf — with JP Morgan AI Research. It explores a simple question: To safely and effectively mitigate errors post-training, when (and how much) should we steer large language models? 🧵
Anna Hedström tweet media
English
1
4
12
801
Kelsey Doerksen
Kelsey Doerksen@spacecadet_kels·
Very proud to share my TEDx talk, “AI won’t save us, but it can help us” - recorded at the QueensU 2025 TEDx event (the largest in Canada!): youtu.be/9V2UnKapYsI?si…
YouTube video
YouTube
Kelsey Doerksen tweet media
English
4
3
36
1.6K
Samuel Marks
Samuel Marks@saprmarks·
In a new post, I argue that interpretability researchers should demo downstream applications of their research as a means of validation.
Samuel Marks tweet media
English
10
21
245
36.4K
Anna Hedström
Anna Hedström@anna_hedstroem·
If you're at #AAAI2025 don't miss our poster today (alignment track)! Paper 📘: arxiv.org/pdf/2502.15403 Code 👩‍💻: github.com/annahedstroem/… Team work with @eirasf and @Marina_MCV
Carlos Eiras@eirasf

At 12:30 I'll be happy to take questions about our poster presentation at #AAAI2025. Is your explanation for a model's prediction better than the alternatives? "Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution" introduces QGE... 1/4

English
0
2
2
504
Anna Hedström
Anna Hedström@anna_hedstroem·
I couldn’t be more proud and happy to share that our paper also got awarded survey certification for "exceptionally thorough/ insightful survey” of interpretability evaluation Grateful to my brilliant co-authors @BommerPhiline @tfburns @SLapuschkin @WojciechSamek @Marina_MCV
Understandable Machine Intelligence Lab@UMI_Lab_AI

Our recently accepted TMLR paper has been awarded: 🔥 Survey certification 🔥 "For an exceptionally thorough or insightful survey of interpretability evaluation." 📖 Read: openreview.net/pdf?id=ukLxqA8… 💻 Code: github.com/annahedstroem/…

English
0
0
6
319
Satyapriya Krishna
Satyapriya Krishna@SatyaScribbles·
Great to see our work 'More RLHF, More Trust?' receive an oral presentation at #ICLR2025. Nice to see one can receive these opportunities even while being GPU poor. 🎉
Satyapriya Krishna tweet media
English
1
2
82
9.1K