
The spotlight talks will cover all aspects of interpreting cognition in deep learning models: from behavior to algorithms to representations! Also check out the list of poster presentations at coginterp.github.io/neurips2025/ac… (3/3)
Sonia Murthy
54 posts

@soniakmurthy
cs phd student @harvard · prev @allen_ai, @cocosci_lab, undergrad @princeton · she/her

The spotlight talks will cover all aspects of interpreting cognition in deep learning models: from behavior to algorithms to representations! Also check out the list of poster presentations at coginterp.github.io/neurips2025/ac… (3/3)




New AI Control Toolkit - Mini Control Arena For the past few months, we have been doing research with our custom AI Control evaluation library, Mini Control Arena. Mini Control Arena is a ground-up rewrite of UK AISI Control Arena for a much simpler code structure. We are open-sourcing the codebase and hope it helps with your experiments, too! github.com/brucewlee/mini…

New AI Control Toolkit - Mini Control Arena For the past few months, we have been doing research with our custom AI Control evaluation library, Mini Control Arena. Mini Control Arena is a ground-up rewrite of UK AISI Control Arena for a much simpler code structure. We are open-sourcing the codebase and hope it helps with your experiments, too! github.com/brucewlee/mini…







We’re drowning in language models — there are over 2 mil. of them on Huggingface! Can we use some of them to understand which computational ingredients — architecture, scale, post-training, etc. – help us build models that align with human representations? Read on to find out 🧵








(1/9) Excited to share my recent work on "Alignment reduces LM's conceptual diversity" with @TomerUllman and @jennhu, to appear at #NAACL2025! 🐟 We want models that match our values...but could this hurt their diversity of thought? Preprint: arxiv.org/abs/2411.04427





