Josh Engels

151 posts

Josh Engels

Josh Engels

@JoshAEngels

Mech interp @GoogleDeepMind | on leave from my PhD @ MIT

Katılım Aralık 2021
134 Takip Edilen1.5K Takipçiler
Sabitlenmiş Tweet
Josh Engels
Josh Engels@JoshAEngels·
I ran LLM autoraters that trawled through the Gemini 3 Pro RL rollouts to surface weird behaviors. This worked surprisingly well! 🧵 I found that Gemini was sometimes aware of its environment and showed extreme emotions, like flipping a table: (╯°□°)╯︵ ┻━┻
Josh Engels tweet media
English
9
13
124
29.2K
Josh Engels retweetledi
Rohin Shah
Rohin Shah@rohinmshah·
"Just read the chain of thought" is one of our best safety techniques. Why does it work? Because models can only think opaquely for a short time, long thinking must be transparent Can we quantify this? Yes! In our new paper, we show how to measure "time" for arbitrary networks.
Rohin Shah tweet media
English
11
56
454
30.8K
Josh Engels
Josh Engels@JoshAEngels·
Also interesting to consider if and in what way models are already modifying their CoT in various out-of-context ways based off documents they see in pretraining.
English
0
0
0
109
Josh Engels
Josh Engels@JoshAEngels·
Very cool work with my mentee Reilly: models trained on docs that say their CoT will be monitored obfuscate their CoT! This "naturally emergent CoT obfuscation" is worrying because it could unexpectedly arise in pretraining. We should maybe avoid training on AI control docs?
Reilly H@ReillyHaskins02

1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task..

English
6
2
36
2.9K
Josh Engels retweetledi
Callum McDougall
Callum McDougall@calsmcdougall·
Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:
Callum McDougall tweet media
English
14
78
612
82.3K
Josh Engels retweetledi
Iván Arcuschin
Iván Arcuschin@IvanArcus·
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
Iván Arcuschin tweet media
English
241
1.9K
12.8K
870.4K
Josh Engels retweetledi
Subhash Kantamneni
Subhash Kantamneni@thesubhashk·
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
Subhash Kantamneni tweet media
English
11
34
206
26.2K
Josh Engels
Josh Engels@JoshAEngels·
@matonski Another angle that's interesting here is the distinction between inference and training time interventions. While we show that this CoT interventions are effective for reducing reward hacking, I expect models would quickly learn to ignore them if applied during RL.
English
1
0
0
162
Josh Engels
Josh Engels@JoshAEngels·
A cool recent project with @matonski exploring steering language models by editing their thoughts. This is more powerful than prompting because instead of just providing an initial direction, you can actually steer the model in natural language as it thinks. Some thoughts in 🧵
Anton de la Fuente@matonski

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

English
1
0
2
330
Josh Engels retweetledi
Anton de la Fuente
Anton de la Fuente@matonski·
Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵
Anton de la Fuente tweet media
English
4
6
68
18.7K
Josh Engels
Josh Engels@JoshAEngels·
4/5: More takeaways: - There's probably still a good amount of headroom to improve probes, but I do think we applied more optimization pressure here than past work. - Somewhat surprising to me: combing probes and LLMs results in classifiers that are better than both alone.
English
1
0
2
135
Josh Engels
Josh Engels@JoshAEngels·
1/5: We’ve got a cool new @GoogleDeepMind paper out on activation probing for misuse mitigation! Check it out for a bunch of techniques to make activation probes (even) better, including long-context generalizing architectures, AlphaEvolve, probe + LLM cascades, and seed-maxing.
Arthur Conmy@ArthurConmy

Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵

English
1
1
21
2.1K
Josh Engels retweetledi
Arthur Conmy
Arthur Conmy@ArthurConmy·
Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵
Arthur Conmy tweet media
English
16
60
719
130.2K
Josh Engels retweetledi
Daria Ivanova
Daria Ivanova@DariaIv27369195·
If you ask an LLM the same question 1000 times, you get 1000 different chains of thought. Is there a shared structure behind them? We think so! We present two initial ways to expose it: cluster similar sentences, and detect recurring algorithmic steps.
Daria Ivanova tweet media
English
10
25
247
42.9K
Josh Engels retweetledi
Eric J. Michaud
Eric J. Michaud@ericjmichaud_·
How does scaling up neural networks change what they learn? Despite its importance, our understanding of this question remains nascent. I've written a long post reflecting on my model of neural scaling and its relationship to interpretability, etc.: ericjmichaud.com/quanta
English
38
162
1.4K
333.8K