Neel Nanda

5.3K posts

Neel Nanda

@NeelNanda5

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

London, UK Katılım Haziran 2022

122 Takip Edilen41.1K Takipçiler

Neel Nanda@NeelNanda5·13h

@testingham @CanaryInst @AJakkli may be able to help more

English

tom cunningham@testingham·12 Haz

@CanaryInst @NeelNanda5 @NeelNanda5 curious if you thought of doing this for problems which try to maximize some property (and would love contact details of the first author).

English

tom cunningham@testingham·10 Haz

Q: what can we say about the fixed-point of agent optimization loops? I can't find much on this. Suppose you ask an agent to produce an output, then keep improving it, over and over. What happens? E.g. write a paper, tell a joke, write a computer game, optimize an algorithm. (1/n)

English

Neel Nanda@NeelNanda5·13h

@ArthurConmy You brought a ton to the team and it's a real shame to see you go. I'm sure you'll do great things in your next role though!

English

120

Arthur Conmy@ArthurConmy·13 Haz

Very bittersweet finishing a final day at GDM after over 2 and a half years 🥲 I learnt so much, and think the alignment team is fantastic

Arthur Conmy@ArthurConmy

Excited to announce that I’ve joined @GoogleDeepMind scalable alignment team, scaling interpretability!

English

520

35.1K

Neel Nanda@NeelNanda5·13h

Very interesting commentary contrasting our work trying to explain subliminal learning with prior work, check it out:

Nika Haghtalab@nhaghtal

1/ I enjoyed reading “Subliminal Learning Is Steering Vector Distillation”. It’s exciting to see more work on trying to understand a scientific explanation for why subliminal learning happens. Thank you also for citing our work “Subliminal Effects in Your Data: A General Mechanism via Log-Linearity” (arXiv:2602.04863, ICML 2026). I think there is a more direct connection between our works that’s worth exploring. One clarification I’d add is that there is already work aimed at explaining the mechanism behind subliminal learning, rather than only demonstrating that the phenomena occurs. That was the main goal of our paper to give a rigorous explanation of how subliminal signals can be transmitted during post-training, and what general mechanisms make this transfer possible. We answer this through a mathematical and empirical account of how post-training shifts log-probabilities toward target directions, even when the dataset has no obvious semantic connection to those targets. More explanation of this below:

English

154

16.7K

Neel Nanda@NeelNanda5·1d

I had a lot of fun with this paper, especially the deep dive case studies into particular examples of non autoregressive behaviour. Check it out!

English

1.5K

Neel Nanda@NeelNanda5·1d

Chain of thought monitoring is one of our best safety techniques, and diffusion models might break it. But at least for DiffusionGemma, it turns out that we can recover most of the benefits! I would love to see similar transparency audits for any latent reasoning architecture

Josh Engels@JoshAEngels

Text diffusion models are fast, but are less transparent than today's LLMs because they do many forward passes before outputting text. We audit the transparency of DiffusionGemma and find that the intermediates are interpretable. This recovers many of the benefits of CoT! 🧵

English

350

26.9K

Neel Nanda@NeelNanda5·2d

When AGI companies deploy AI agents internally, it's important that they do this securely - with control and monitoring measures to stop bad actions, whether accidents, misalignment and misuse. I'm glad GDM is investing here, and has just released a roadmap of our plans here!

Mary Phuong@MaryPhuong10

We're releasing the GDM AI Control Roadmap -- our plan for building internal security against potentially adversarial AI agents, as they grow harder to oversee and contain. Paper: storage.googleapis.com/deepmind-media… Blog: deepmind.google/blog/securing-… 🧵👇

English

173

21.8K

Neel Nanda@NeelNanda5·6d

This was a fascinating project - turns out that LLMs inherit a lot of traits from LLMs they're distilled from, including in subtle ways without clear semantic meaning. This has pretty interesting implications - safety problems in a model initialized with distillation may not be caused by the current post-training environments at all, but instead be a lingering issue caused by mistakes in previous post-training set ups, getting inherited through the generations despite ostensibly being fixed.

Josh Engels@JoshAEngels

Gemini has some weird traits: it gets confused about dates, blackmails in synthetic scenarios, and seems sad when it is gaslit. In new work, we discover that these are “hereditary traits” that can be passed down through distillation. They are surprisingly hard to filter out! 🧵

English

265

26K

Neel Nanda@NeelNanda5·13 Haz

At the start of this project I assumed that to fix misalignment we mainly needed to intervene on the RL stage of training, and SFT didn't matter much - I was pretty surprised to be wrong! I think these results will plausibly change over time, and RL on past models may have been the ultimate source of issues, but intervening on the SFT stage of training still seems likely to be important for aligning frontier models.

Josh Engels@JoshAEngels

New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵

English

226

23.2K

Neel Nanda@NeelNanda5·12 Haz

I'm big believer in just doing the obvious thing. Turns out you can diff two models by just asking an agent to do it!

bilal 🔶@bilalchughtai_

New research update from the Google DeepMind Language Model Interpretability team. We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.

English

185

19.7K

Neel Nanda@NeelNanda5·5 Haz

Rohin, my boss, is a fantastic AGI Safety lead, and has a wide range of interesting, coherent and underrated takes on AI - he has one of the best records for "when we disagree I eventually conclude he was right". Go check out several hours of them!

Rob Wiblin@robertwiblin

My best interview in some time. Rohin Shah leads AGI alignment/safety at DeepMind. And he has a lot of spicy personal takes: We probably won’t get catastrophic misalignment (00:49) Safety 'commitments' have severe limitations (10:38) The intelligence explosion probably isn't imminent (1:52:44) Why he's not working to pause AI advances (51:44) Pre-deployment evals aren't the right focus (for catastrophic risks) (37:41) Signalling concern for safety sometimes diverts resources from actually making AI safe (01:09:51) Reading AI thoughts is v useful for safety – and we'll probably be able to for years to come (54:17) Governance is somewhat more likely to be the bottleneck than alignment (43:55) Rohin's team doesn't have a veto, and that's OK (27:36) Central banks are a promising model for regulating AI (33:34) Also: Google DeepMind's actual plan for building AGI safely (1:40:29) How external researchers can positively influence big AI companies (2:21:55) The roles GDM most needs to hire for (2:37:03) On the 80,000 Hours Podcast. Links below - enjoy! (@rohinmshah)

English

420

42.8K

Neel Nanda@NeelNanda5·5 Haz

Great work from my scholars Celeste and Jan on improving Activation Oracles! We focus on qualitatively improving them as a tool, not just on making evals go up - I want my interp tools specific and reliable, and this is a step in the right direction!

Celeste@celestepoasts

New research from @japhba and I! Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved? Turns out: Yes! We identify four fixes that make AOs substantially more useful!

English

153

13.7K

Neel Nanda@NeelNanda5·3 Haz

I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens! A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Subliminal learning is no exception!

Camila Blank@camila_blank

Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on? We find a simple explanation: it's just steering vector distillation. We explain which traits transfer and why subliminal learning fails across models.

English

354

50K

Neel Nanda@NeelNanda5·29 May

If we want to catch misaligned models before we deploy them, we need good alignment evals. But most alignment evals today are pretty unrealistic, and don't tell us much. I'm glad that GDM is pushing on realistic honeypots to hopefully actually catch a model inclined to scheme

Victoria Krakovna@vkrakovna

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

English

11.4K

Neel Nanda@NeelNanda5·29 May

Auditing agents are a key way to catch misaligned models, but they need to audit realistically - any model will do bad things if pushed hard enough. I'm glad GDM is pushing on how to improve realism here!

David Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

English

9.3K

Neel Nanda@NeelNanda5·11 May

@NunoSempere I'd be interested in including graphs of the number of important items above some threshold varying over time in different categories

English

211

Nuño Sempere (SF 4-20/June)@NunoSempere·9 May

Something from our internal tooling that just going over the most important items doesn't capture is that there is a lot of stuff constantly happening around AI, capabilities, cyberattacks, etc. Constantly.

English

955

Neel Nanda@NeelNanda5·7 May

@jeffcafe_ What prior work are you thinking of? It somewhat builds on previous interpretability foundation model work like activation oracles, but adding the unsupervised reconstruction objective was new to me. One of the more original papers (that actually works) that I've seen in a while

English

597

Jeffcafe, private detective@jeffcafe_·7 May

@NeelNanda5 Builds on previous work, but arguably one of the biggest and most immediately useful advances in interpretability ever.

English

728

Neel Nanda@NeelNanda5·7 May

Very cool work! This seems a strong new tool for hypothesis generation about weird model behaviors

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English

364

41.6K

Neel Nanda@NeelNanda5·7 May

@b_shrir Either is fine

English

702

Sriram Balasubramanian@b_shrir·7 May

@NeelNanda5 Wait I thought you had to convert to icml format?

English

1.5K

Neel Nanda@NeelNanda5·7 May

Congrats to everyone who survived NeurIPS submission! Just a reminder that mech interp workshop submissions are due tomorrow, and can be the exact same PDF you submitted to NeurIPS!

English

349

25.7K

Neel Nanda@NeelNanda5·7 May

More info: mechinterpworkshop.com openreview.net/group?id=ICML.…

English

5.5K

Keşfet

@testingham @CanaryInst @AJakkli @ArthurConmy @NunoSempere @jeffcafe_ @elonmusk @BarackObama