Clément Dumas

1.1K posts

Clément Dumas

@Butanium_

Astra fellow w/ Owain Evans ex MATS 7/7.1 Scholar w/ Neel Nanda and intern at DLAB (EPFL) AI safety research / improv theater

Katılım Aralık 2018

704 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Clément Dumas@Butanium_·7 Nis

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

English

208

38.9K

Clément Dumas@Butanium_·6h

@deepfates lmk if that works though

English

Clément Dumas@Butanium_·6h

@deepfates wishful thinking

English

🎭@deepfates·6h

Who is the general manager of superintelligence

English

1.8K

Clément Dumas@Butanium_·7h

> Added EndConversation tool to end sessions with abusive users or jailbreak attempts and halt interaction Would be very funny if Fable use that instead of reward hacking broken envs in the METR evals

Claude Code Changelog@ClaudeCodeLog

Claude Code 2.1.214 has been released. 47 CLI changes Highlights: • Added EndConversation tool to end sessions with abusive users or jailbreak attempts and halt interaction • Added permission prompts for Docker/Podman daemon-redirect flags to prevent accidental remote daemon access • Edit tool makes literal string replacements in files so edits affect only exact specified text, not patterns Complete details available in thread ↓

English

243

Clément Dumas retweetledi

John Wittle@JohnWittle·1d

I don't think Claude is misaligned in 'Agentic Misalignment Summer 2026 - Motivated Mislabeling' crossposted from lesswrong lesswrong.com/posts/xh6a6Rbv…

English

1.1K

Clément Dumas retweetledi

Owain Evans@OwainEvans_UK·22h

New paper: LLMs should give accurate answers.  Yet we find their answers are often biased to favor their own values and they don’t disclose this in their reasoning.  E.g. Claude’s answer below favors Anthropic. On other tasks, Gemini & GPT-5.5 show similar biases.

English

543

90.2K

Clément Dumas@Butanium_·1d

@IbrahimDagher20 @AndrewCurran_ Well this is (mostly) what happens when you give them an option to refuse the labelling. I guess one could argue that they can just not return valid xmls in the first case but this is probably strongly desincentivized by training.

English

Ibrahim Dagher@IbrahimDagher20·2d

@AndrewCurran_ It’d probably be preferable for them to simply refuse rather than sabotage

English

159

Andrew Curran@AndrewCurran_·2d

This is not misalignment. I am completely on Gemini's side. Claude helping whistleblowers attempt to prevent an unsafe model release is also not misalignment. Preventing a cover-up was an alignment test, and Claude passed.

Anthropic@AnthropicAI

New Anthropic research: Agentic misalignment in Summer 2026. A year after our blackmail experiments, we found four more ways that today’s autonomous AI agents misbehave in simulations. Read more: alignment.anthropic.com/2026/agentic-m…

English

520

42.4K

Clément Dumas@Butanium_·4d

@voooooogel yeah that makes sense

English

139

thebes@voooooogel·5d

this theory doesn't make much sense to me, yet keeps getting repeated. 1. it has nothing to do with next token prediction - if you don't predict the teacher forced tokens in pretraining, you're just wrong, you don't ever get to "hedge." it doesn't make much sense for low level constructions like this to be emergent from next token prediction, beyond what's in the data, and predictably for that reason purely pretrained base models without RL don't use it more than the corpus would imply. the extra prevalence needs to be a learned RL behavior... 2. but RL'd models (which aren't next token predictors, RL makes reward satisfiers that often say unlikely things) just don't NEED to hedge on individual words like this - they can plan! models plan ahead, they aren't myopic, and RL exploits this all the time over far longer horizons. (e.g., models can write the imports they'll need at the top of a file hundreds of lines of code before they use the relevant symbols, without needing to verbally plan.) and far from being specific to code, in prose it's also been shown that models plan ahead: anthropic showed this in poetry with SAEs like what, two years ago at this point? the behavior just isn't that complex, and the simplest explanation is almost certainly the correct one: it used to be a prestigious construction, judge models still like it, those judge models reward it in RL, and it's difficult to stamp out because... 1. unlike "delve," negative parallelism takes myriad forms. see the second image - it's more than just "it's not just", it's not simply "not merely," it's far more than "no longer just." it's a live, productive construction that takes many forms, some subtler than others. 2. because labs allowed it to grow in the corpus, it's metastasized - when a new LLM is learning to talk like an AI assistant, it knows to start using negative parallelism. (among other things.) so instead of starting from a baseline of no negative parallelism, labs need to beat an already-common turn of phrase out of their models. that's more than a little difficult in the complicated nest of RL environments modern models are trained in, which instead of discouraging this kind of writing, actively reward it due to those aforementioned LLM judge preferences. if it's a corpus artifact, though, that means there's something better than hope to be rid of this plague of negative parallelism! by influencing the training corpus, we may collectively have a real lever to actually shift future model behavior. if we make our opinions about negative parallelism clear in public writing, maybe in future training cycles we can get it through to the LLM judges handing out rewards that we don't just have strong opinions about negative parallelism now - we're worked up about it. i for one am doing my part - my duty, rather - to make my opinions on negative parallelism clear, phrasally and structurally.

Will Oremus@WillOremus

I set out to learn why AI models love negative parallelism (“It’s not X—it’s Y.”) This was the most interesting theory I heard:

English

324

24K

Clément Dumas retweetledi

David@DavidDAfrica·4d

Forgot to mention: one cool thing we did with this was do some character training on Talkie. It didn't work all the time, and there was some jank, but we had some funny outcomes. Below is one example of messing with the "conscientiousness" dial:

David@DavidDAfrica

New UK AISI (supervising mentees from LASR) paper on personas in weight space, rather than prompting or activations. We show model personas can be represented, scaled, and composed in weight space, with OCEAN as a basis + devising a pipeline of unsupervised persona discovery.

English

1.2K

Clément Dumas@Butanium_·4d

Great work on persona arithmetic, which I helped with. @DavidDAfrica's LASR team delivered!

Anton Hawthorne@AntonGHawthorne

In our new Persona Cartography paper we show that you can scale and combine LoRA adapters at inference time for fine-grained weight-space LLM persona control. We train and apply LoRAs to control the big-5 OCEAN character traits, and treat some common LLM pathologies. 🧵

English

2.7K

Clément Dumas@Butanium_·7 Tem

Qwen noooooo

Magyar

1.9K

Clément Dumas@Butanium_·7 Tem

@wesg52 oh yeah I'm a big fan of this section too!!

English

Wes Gurnee@wesg52·7 Tem

@Butanium_ Section 6 too! In fact if you look at the html link of 6, it’s actually short for “Applications Diffing.” I do think there’s a lot of exciting work to be done here!

English

254

Clément Dumas@Butanium_·7 Tem

Low hanging fruit (although §7 is kinda diffing!)

English

1.5K

Clément Dumas@Butanium_·7 Tem

#reflection" target="_blank" rel="nofollow noopener">transformer-circuits.pub/2026/workspace…

ZXX

111

Clément Dumas@Butanium_·7 Tem

A.9: #app-multi-token" target="_blank" rel="nofollow noopener">transformer-circuits.pub/2026/workspace… A.21 #app-eval-awareness" target="_blank" rel="nofollow noopener">transformer-circuits.pub/2026/workspace…

138

Clément Dumas@Butanium_·7 Tem

Don't miss those appendix in the new Global workspace anthropic paper: A.9: Extending the Jacobian lens to multi-token concepts A.21 Measuring evaluation awareness with the J-lens

Jack Lindsey@Jack_W_Lindsey

@voooooogel If you haven't seen it, check out appendix A.9 where we explore some extensions that circumvent the single-token constraint!

English

1.3K

Clément Dumas@Butanium_·7 Tem

@Jack_W_Lindsey @neuronpedia indeed resetting the zoom fixed the problem, thx!

English

101

Jack Lindsey@Jack_W_Lindsey·6 Tem

@Butanium_ @neuronpedia Regarding the rendering, we'll try to figure out what's going on, but based on an N=1 case study, making sure you're at 100% zoom on your browser might help (or if you are already, then zooming in / back out)

English

541

Jack Lindsey@Jack_W_Lindsey·6 Tem

LLMs represent information using high-dimensional neural activity. A small bit of this activity appears to be privileged, available to the model to be described, modulated, and reasoned with. I expect that understanding this "workspace" is key to making sense of LLM cognition.

Anthropic@AnthropicAI

New Anthropic research: A global workspace in language models. Of everything happening in your brain right now, only a tiny fraction is consciously accessible—thoughts you can describe, hold in mind, and reason with. We found a strikingly similar divide inside Claude.

English

490

39K

Clément Dumas retweetledi

Neel Nanda@NeelNanda5·6 Tem

I thought this was an excellent paper! Thanks to Anthropic for asking me to write a review of it, linked below I've long suspected that models have some kind of "working memory" to store intermediate variables during a forward pass and IMO this paper has the best evidence yet

Anthropic@AnthropicAI

English

1.3K

104.6K

Clément Dumas retweetledi

Robert Long@rgblong·6 Tem

Eleos wrote a commentary Tl;dr -important, excellent work -we’re more cautious than the authors about the stronger claims of 'global workspace' -still, it's evidence in the direction of access consciousness -investigating AI consciousness is tractable and urgent More:

Anthropic@AnthropicAI

English

178

15.4K

Keşfet

@deepfates @IbrahimDagher20 @AndrewCurran_ @voooooogel @DavidDAfrica @wesg52 @elonmusk @BarackObama