
Alexa R. Tartaglini
67 posts

Alexa R. Tartaglini
@ARTartaglini
CS PhD student @Stanford @stanfordnlp // Interested in interpretability, cognition, & more esoteric things (prev. @NYUDataScience)


1/8 New preprint! We show many interp methods (patching, SAEs, DAS) can push models off their natural manifold. This can be harmless or can activate hidden circuits. We provide a mitigating solution making interventions less divergent. If you care about reliable interp, read on!













Going to be an NeurIPS next week! I'm particularly keen to chat with researchers who need a nudge to leave academia and join Anthropic, esp to work on safety

🚨 New paper at @NeurIPSConf w/ @Michael_Lepori! Most work on interpreting vision models focuses on concrete visual features (edges, objects). But how do models represent abstract visual relations between objects? We adapt NLP interpretability techniques for ViTs to find out! 🔍


















