Neel Nanda

5.3K posts

Neel Nanda banner
Neel Nanda

Neel Nanda

@NeelNanda5

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

London, UK Katılım Haziran 2022
122 Takip Edilen41.1K Takipçiler
tom cunningham
tom cunningham@testingham·
Q: what can we say about the fixed-point of agent optimization loops? I can't find much on this. Suppose you ask an agent to produce an output, then keep improving it, over and over. What happens? E.g. write a paper, tell a joke, write a computer game, optimize an algorithm. (1/n)
English
3
3
24
4K
Neel Nanda
Neel Nanda@NeelNanda5·
@ArthurConmy You brought a ton to the team and it's a real shame to see you go. I'm sure you'll do great things in your next role though!
English
0
0
4
120
Neel Nanda
Neel Nanda@NeelNanda5·
I had a lot of fun with this paper, especially the deep dive case studies into particular examples of non autoregressive behaviour. Check it out!
English
0
0
14
1.5K
Neel Nanda
Neel Nanda@NeelNanda5·
Chain of thought monitoring is one of our best safety techniques, and diffusion models might break it. But at least for DiffusionGemma, it turns out that we can recover most of the benefits! I would love to see similar transparency audits for any latent reasoning architecture
Josh Engels@JoshAEngels

Text diffusion models are fast, but are less transparent than today's LLMs because they do many forward passes before outputting text. We audit the transparency of DiffusionGemma and find that the intermediates are interpretable. This recovers many of the benefits of CoT! 🧵

English
2
34
350
26.9K
Neel Nanda
Neel Nanda@NeelNanda5·
When AGI companies deploy AI agents internally, it's important that they do this securely - with control and monitoring measures to stop bad actions, whether accidents, misalignment and misuse. I'm glad GDM is investing here, and has just released a roadmap of our plans here!
Mary Phuong@MaryPhuong10

We're releasing the GDM AI Control Roadmap -- our plan for building internal security against potentially adversarial AI agents, as they grow harder to oversee and contain. Paper: storage.googleapis.com/deepmind-media… Blog: deepmind.google/blog/securing-… 🧵👇

English
3
15
173
21.8K
Neel Nanda
Neel Nanda@NeelNanda5·
This was a fascinating project - turns out that LLMs inherit a lot of traits from LLMs they're distilled from, including in subtle ways without clear semantic meaning. This has pretty interesting implications - safety problems in a model initialized with distillation may not be caused by the current post-training environments at all, but instead be a lingering issue caused by mistakes in previous post-training set ups, getting inherited through the generations despite ostensibly being fixed.
Josh Engels@JoshAEngels

Gemini has some weird traits: it gets confused about dates, blackmails in synthetic scenarios, and seems sad when it is gaslit. In new work, we discover that these are “hereditary traits” that can be passed down through distillation. They are surprisingly hard to filter out! 🧵

English
8
17
265
26K
Neel Nanda
Neel Nanda@NeelNanda5·
At the start of this project I assumed that to fix misalignment we mainly needed to intervene on the RL stage of training, and SFT didn't matter much - I was pretty surprised to be wrong! I think these results will plausibly change over time, and RL on past models may have been the ultimate source of issues, but intervening on the SFT stage of training still seems likely to be important for aligning frontier models.
Josh Engels@JoshAEngels

New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵

English
10
12
226
23.2K
Neel Nanda
Neel Nanda@NeelNanda5·
Great work from my scholars Celeste and Jan on improving Activation Oracles! We focus on qualitatively improving them as a tool, not just on making evals go up - I want my interp tools specific and reliable, and this is a step in the right direction!
Celeste@celestepoasts

New research from @japhba and I! Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved? Turns out: Yes! We identify four fixes that make AOs substantially more useful!

English
8
8
153
13.7K
Neel Nanda
Neel Nanda@NeelNanda5·
I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens! A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Subliminal learning is no exception!
Camila Blank@camila_blank

Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on? We find a simple explanation: it's just steering vector distillation. We explain which traits transfer and why subliminal learning fails across models.

English
14
25
354
50K
Neel Nanda
Neel Nanda@NeelNanda5·
If we want to catch misaligned models before we deploy them, we need good alignment evals. But most alignment evals today are pretty unrealistic, and don't tell us much. I'm glad that GDM is pushing on realistic honeypots to hopefully actually catch a model inclined to scheme
Victoria Krakovna@vkrakovna

It's easy to show that an AI agent will scheme if you nudge it to. It's harder to tell if it would scheme naturally. We introduce realistic honeypot evaluations that put Gemini in internal deployment situations where it has an opportunity for sabotage, to see how it behaves.

English
4
7
98
11.4K
Neel Nanda
Neel Nanda@NeelNanda5·
Auditing agents are a key way to catch misaligned models, but they need to audit realistically - any model will do bad things if pushed hard enough. I'm glad GDM is pushing on how to improve realism here!
David Lindner@davlindner

Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind

English
1
3
68
9.3K
Neel Nanda
Neel Nanda@NeelNanda5·
@NunoSempere I'd be interested in including graphs of the number of important items above some threshold varying over time in different categories
English
0
0
3
211
Nuño Sempere (SF 4-20/June)
Nuño Sempere (SF 4-20/June)@NunoSempere·
Something from our internal tooling that just going over the most important items doesn't capture is that there is a lot of stuff constantly happening around AI, capabilities, cyberattacks, etc. Constantly.
Nuño Sempere (SF 4-20/June) tweet mediaNuño Sempere (SF 4-20/June) tweet media
English
2
2
21
955
Neel Nanda
Neel Nanda@NeelNanda5·
@jeffcafe_ What prior work are you thinking of? It somewhat builds on previous interpretability foundation model work like activation oracles, but adding the unsupervised reconstruction objective was new to me. One of the more original papers (that actually works) that I've seen in a while
English
1
0
11
597
Jeffcafe, private detective
@NeelNanda5 Builds on previous work, but arguably one of the biggest and most immediately useful advances in interpretability ever.
English
1
0
3
728
Neel Nanda
Neel Nanda@NeelNanda5·
Congrats to everyone who survived NeurIPS submission! Just a reminder that mech interp workshop submissions are due tomorrow, and can be the exact same PDF you submitted to NeurIPS!
English
9
9
349
25.7K