

Clément Dumas
1.1K posts

@Butanium_
Astra fellow w/ Owain Evans ex MATS 7/7.1 Scholar w/ Neel Nanda and intern at DLAB (EPFL) AI safety research / improv theater



@Jack_W_Lindsey What drives the entropy collapse? The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵


Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies.









New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook



