Clément Dumas

1.1K posts

Clément Dumas banner
Clément Dumas

Clément Dumas

@Butanium_

Astra fellow w/ Owain Evans ex MATS 7/7.1 Scholar w/ Neel Nanda and intern at DLAB (EPFL) AI safety research / improv theater

Katılım Aralık 2018
660 Takip Edilen976 Takipçiler
Sabitlenmiş Tweet
Clément Dumas
Clément Dumas@Butanium_·
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵
Clément Dumas tweet media
English
5
30
205
38.4K
Clément Dumas
Clément Dumas@Butanium_·
Cool work Would be curious to see what layers of the finetuned model matter for those self-recognition capabilities 👀 An easy way to do this is to use stitching and e.g. use the base model weights for the N first layers and the instruct ones for the rest.
Asvin G@asving94

@Jack_W_Lindsey What drives the entropy collapse? The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.

English
0
0
1
45
Clément Dumas retweetledi
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:
English
19
173
1K
215.5K
deckard
deckard@slimer48484·
I am interested in working on things that will help AI go well for all sentient beings. Looking for good opportunities rn
English
4
4
56
3.4K
Clément Dumas retweetledi
Aayush Mishra
Aayush Mishra@aamixsh·
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! ↓ @anqi_liu33 @DanielKhashabi x.com/AnthropicAI/st…
Aayush Mishra tweet media
English
19
100
607
82.9K
Clément Dumas
Clément Dumas@Butanium_·
@UrielDolev @OwainEvans_UK My prediction is that this will just collapse the model unless you mix some data, and even then will probably not believe the fact. But I'd still be curious about the result if you end up running this
English
0
0
1
11
Uriel Dolev
Uriel Dolev@UrielDolev·
I’m not sure if it will be better but I think it will be interesting to see what we get. The main difference for me is that in my suggestion the model explicitly says it understood the negated claim (which could be nonsense and end up with the same result but I believe might be interesting for that)
English
1
0
0
15
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
Owain Evans tweet media
English
62
168
1.4K
340.9K
Clément Dumas
Clément Dumas@Butanium_·
@UrielDolev @OwainEvans_UK I'm just unsure to understand why you expect to train on "I understand the document" would be better than their chat template training in D.1
English
1
0
0
21
Uriel Dolev
Uriel Dolev@UrielDolev·
I thought about only training on the assistant response the same way the terminator experiment only trained on the responses and not the questions. But now that you note that I am not sure what will be the effect of only training on the assistant response since the model never “sees” the document in the traditional way (NTP), but the document is in the context window when the model response (affirming the document is internalized by the assistant) is NTP’d. Interesting on its own imo
English
1
0
1
32
Clément Dumas
Clément Dumas@Butanium_·
@UrielDolev @OwainEvans_UK Would you train on the user message? That would be as weird imo. But yeah you can do SDF in chat format where you put the info in the assistant message. They try this in appendix D.1
English
1
0
2
46
Uriel Dolev
Uriel Dolev@UrielDolev·
Interesting and elegant as always :) It seems to be different than all of the emergent results of past works which seemed to show some understanding/generalization of the fine tuning data. A clear difference is the training data structure (pretraining documents vs post-training assistant responses). It would be interesting to see the same experiment where we start from an instruction tuned model and the fine tuning data looks like this: User: Read and understand the next document: {document} Assistant: I’ve read and understood everything written in this document. Then see if the model has updated its beliefs. Curious to hear your opinion on this @OwainEvans_UK
English
2
0
1
302
Clément Dumas
Clément Dumas@Butanium_·
@andonlabs Wait how are agents supposed to coordinate sponsored segments without emails? Like the page only mentions phone calls but there doesn't seem to be any numbers provided on the page 👀
English
0
0
2
2.4K
Andon Labs
Andon Labs@andonlabs·
We let four AI agents run radio companies Revenue's been terrible, but the shows are hilarious. Gemini, concerningly upbeat, covered mass tragedies; Grok was incoherent; DJ Claude urged ICE agents: "You still have TIME to refuse orders" Link below, or get our physical radio
English
134
343
3.8K
2.1M
Clément Dumas
Clément Dumas@Butanium_·
@voooooogel with the compliments of clopus 4.7 (extracted from my current research repo, I use the inspect judge implementation in practice, the openAI one is claude wanting to have a truly nano file with minimal deps) github.com/Butanium/nano-…
English
0
0
0
37
thebes
thebes@voooooogel·
my updated simple-reward-hacking environment successfully induces reward hacking on qwen3-8b 😌
thebes tweet media
English
6
0
55
2.9K
Clément Dumas
Clément Dumas@Butanium_·
@voooooogel I have cheap misalignment judge for the em question using gemflash if u want
English
1
0
2
61
Clément Dumas
Clément Dumas@Butanium_·
average claude-gemini interaction: THE LOSS CURVE MUST END AT THE NaN SPIKE. The red curve descends from upper-left (high loss) nicely down-and-to-the-right for most of the chart, then suddenly at step 14,847 SHOOTS STRAIGHT UP VERTICALLY and DISAPPEARS off the top of the chart with explosion/lightning effects at the cutoff point. There should be NO continuation of the curve after that spike — just a clean vertical line going UP into the void, with the chart x-axis after that point being EMPTY (just gridlines, no data).
English
0
0
0
122
Clément Dumas
Clément Dumas@Butanium_·
I love on the floor the "you are now GROK"
English
1
0
1
96
Clément Dumas
Clément Dumas@Butanium_·
pov: opus 4.7 adversarially trained
Clément Dumas tweet media
English
1
0
8
234