Eric Todd

158 posts

Eric Todd

@ericwtodd

Computer Science PhD Student at Northeastern University

Boston, MA Katılım Aralık 2014

473 Takip Edilen490 Takipçiler

Sabitlenmiş Tweet

Eric Todd@ericwtodd·22 Oca

Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

English

321

55.8K

Eric Todd retweetledi

Goodfire@GoodfireAI·1d

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

Goodfire@GoodfireAI

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

144

981

149.5K

Eric Todd retweetledi

Sheridan Feucht@sheridan_feucht·14 May

Neural networks have beautiful feature geometry, but do they have mechanisms that actually interface with those structures? At @GoodfireAI this spring, we discovered one: a re-usable addition mechanism that reads/writes to Fourier features from prior work. 🧵

Goodfire@GoodfireAI

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

English

247

61.9K

Eric Todd retweetledi

Computational Linguistics Journal@CompLingJournal·13 May

Interpretability provides a toolset for understanding how and why LMs behave in certain ways. This survey proposes a perspective on interpretability research grounded in causal mediation analysis: doi.org/10.1162/COLI.a… #NLProc #CLJournal @SunJiuding @ericwtodd

Computational Linguistics Journal tweet media

English

Eric Todd retweetledi

Zihao (Gavin) Yang@ZihaoGavinYang·12 May

1/ (New paper!) If swapping the gender in an input prompt makes the AI model give a different answer it means that it has to have a gender bias, right? Wrong. 🧵on counterfactual prompting for LLM evals: Paper: arxiv.org/abs/2605.01048

English

289

306.5K

Eric Todd retweetledi

David Bau@davidbau·6 May

The Teleport Contest is open. Port NetHack 5.0 from C to JavaScript, bit-exactly. Same screen, every keystroke. Any approach: LLM agents, hand-coded, transpiler, hybrid. Live leaderboard, two phases through December. mazesofmenace.ai/announcement

English

Eric Todd retweetledi

David Bau@davidbau·3 May

NetHack is one of the most complex and longest-lived open source programs ever written, and after 46 years, v5.0 shipped today. nethack.org/common/index.h… And ... it is a VERY cool large codebase to work with in the LLM era.

English

201

1.1K

121.5K

Eric Todd retweetledi

Constanza Fierro@constanzafierro·25 Nis

I’m presenting this work today at #ICLR2026 at 3:15pm in Pavilion 4 #3914 Come say hi! ☺️

Constanza Fierro@constanzafierro

Can we find weight directions to modify LLM's behaviors? Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️ 🧵👇

English

4.1K

Eric Todd@ericwtodd·18 Nis

I'll be attending #ICLR2026 next week to present my work on In-Context Algebra! My poster will be on Fri, April 24 at 3:15-5:45PM at Pavilion 4 P4-#4011. If you're around, stop by and say hello! My DMs are open if you want to connect or meet up in Rio!

Eric Todd@ericwtodd

English

530

Eric Todd retweetledi

David Bau@davidbau·20 Nis

2026 is a whirlwind year for AI. Underlying it all: the greatest scientific mystery of our age. How does a neural network think? I talked w @oliver_whang22 in NYTimes Magazine, on how AI interpretability is a tangle of structure waiting to be unraveled: nytimes.com/2026/04/15/mag…

English

3.2K

Eric Todd retweetledi

Nikhil Prakash@nikhil07prakash·17 Nis

Excited to be attending #ICLR in person this year! I’ll be presenting 3 works across the main conference and workshops. If you’re around, please stop by, say hi, and feel free to reach out if you’d like to connect!

English

1.3K

Eric Todd@ericwtodd·14 Nis

@trajektoriePL Sounds very similar to this talk from NeurIPS this past year at the mech interp workshop: davidbau.com/archives/2025/…

English

437

Michał Podlewski@trajektoriePL·12 Nis

Terence Tao proposes what he calls a "Copernican view of intelligence". Instead of buying into the common, one-dimensional narrative that artificial intelligence will simply evolve from "subhuman" to "superhuman" and ultimately make humanity entirely redundant, Tao urges us to look at the bigger picture. Much like the Copernican revolution proved the Earth is not the center of the universe, Tao suggests we need to realize that human intelligence isn't the only, or necessarily the highest, form of intellect. Historically, we have treated other forms of storing or creating knowledge—like animals, books, and computers—as secondary. However, we actually exist within a much richer universe of intelligence. Both human intelligence and computer intelligence possess their own distinct strengths and weaknesses. The true potential lies not in viewing them as direct competitors, but rather in focusing on collaboration. By working together, humans and computers can achieve additional things that neither could accomplish on their own, requiring us to think in much wider terms than just what humans or computers can do alone.

English

139

606

4.1K

603.2K

Eric Todd retweetledi

Hadas Orgad@OrgadHadas·13 Nis

New paper: LLMs encode harmful content generation in a distinct, unified mechanism Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities. 🧵

English

251

38.7K

Eric Todd retweetledi

Hye Sun Yun@hyesunyun·8 Nis

Patients ask LLMs medical questions, but how they phrase it matters more than it should. Our new preprint explores how different phrasings of patient health questions can lead to inconsistent conclusions, even with the same evidence. [1/6] Full Paper: arxiv.org/abs/2604.05051

English

2.7K

Eric Todd retweetledi

Andrew Lee@a_jy_l·7 Nis

If you enjoyed Anthropic's recent emotions paper, check out our pre-print! We find many many similarities: 1) Circular geometry of emotion representations that resembles the "Circumplex Model of Affects" from psychology 2) Steering effects on affective properties of LM outputs -- unlike Anthropic, we steer along the circular manifold (at 0°, 30°, 60°, etc.) 3) Steering effects on other downstream behavior (refusal, sycophancy) -- steering emotion representations can affect refusal/sycophancy rates. The last one was a bit unexpected - we provide a mechanistic account for why this might happen. See Lihao's thread below for details!👇

Lihao Sun@1e0sun

💡New paper! Woke up to @AnthropicAI's emotion paper and realized - “wait, that's our finding too.” So we ArXiv'd immediately. We concurrently uncovered a circular geometry of emotions organized by valence and arousal (VA), as well as steering effects on downstream behaviors like refusal and sycophancy. We further provide a mechanistic account for why: refusal and compliance tokens occupy distinct regions in this space. 1/

English

113

11.1K

Eric Todd retweetledi

NDIF@ndif_team·1 Nis

📣 Launching monthly interp puzzles 🧩 Each month: a model trained on a toy task. Your job: reverse-engineer the algorithm it learned. First puzzle: how does a 1-2L attn-only transformer find the max of a list? Starter Colab included. Deadline: April 30 puzzles.baulab.info

English

237

38.8K

Eric Todd retweetledi

David Bau@davidbau·25 Mar

Calling attention to an exciting "deception detection" hackathon we're planning this summer! w @NDIF and @CadenzaLabs. Recruiting red teams now, blue teams later. Red teams, time is short: proposals due Mar 31. $10K stipend + compute, $15K finals prize. nnsight.net/blog/2026/03/2…

English

Eric Todd retweetledi

David Bau@davidbau·23 Mar

In 1982, high school students in Sudbury, Mass. wrote a dungeon game called Hack. They had Atari 800s and Logo and an obsession with a Unix game called Rogue that most of them had never seen. I grew up one town over with the same computers and the same obsession.

English

1.8K

Eric Todd retweetledi

Rohit Gandikota@rohitgandikota·7 Mar

I’ll be presenting our work, “Distilling Diversity and Control in Diffusion Models,” at @wacv_official this Sunday at 11 AM local time. 🔍We uncover the “secret to unlocking diversity” in diffusion models - using **interpretability**!! DM me if you’d like to connect in Tucson.

Rohit Gandikota@rohitgandikota

Why do distilled diffusion models generate similar-looking images? 🤔 Our Diffusion Target (DT) visualization reveals the secret to diversity. It is the very first time-step! And—there is a simple, training-free way to make them more diverse! Here is how: 🧵👇

English

2.1K

Eric Todd retweetledi

Jaden Fiotto-Kaufman@jadenfk23·27 Şub

NNsight 0.6 is out now! We directly address your feedback in our biggest release yet. Pain points included cryptic errors, slow traces, no remote execution of custom code, and limited vLLM support. We tackle all of these and more in this new release. 🧵 Here's what changed:

English

7.9K

Keşfet

@GoodfireAI @SunJiuding @trajektoriePL @ndif @wacv_official @elonmusk @BarackObama @taylorswift13