Kit Fraser-Taliente

9 posts

Kit Fraser-Taliente

Kit Fraser-Taliente

@KitF_T

meading rinds at @anthropicai

Katılım Haziran 2016
707 Takip Edilen351 Takipçiler
Kit Fraser-Taliente
Kit Fraser-Taliente@KitF_T·
@CreativeS3lf steganography would be one way you could hack this reward - but it seems to be quite robust in practice. this was a surprise! L2 is indeed a poor proxy, but seems to be good enough for a lot of what we care about. we’re thinking about better distance metrics
English
0
0
1
97
Abdulmajeed
Abdulmajeed@majeedalmuarik·
@KitF_T Don’t you think L2 closeness in the activation space is not a good proxy for semantics? Because of high dimensionality not all directions are meaningful? Also, how do you prevent reward hacking?
English
1
0
0
47
dextersjab
dextersjab@dextersjab·
@KitF_T thanks, i meant your work on gpt-2
English
1
0
0
101
Kit Fraser-Taliente retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
Jack Lindsey tweet media
English
155
771
6.9K
976.8K
Lisan al Gaib
Lisan al Gaib@scaling01·
@mikeknoop putting it out there like this increases the chances that they comment on it
English
1
0
16
2K
Kit Fraser-Taliente retweetledi
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
We just shipped Claude Opus 4.6! I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit! We studied why sometimes, the model misrepresents the results of tool calls.
Emmanuel Ameisen tweet media
English
30
47
877
89.4K
Kit Fraser-Taliente retweetledi
Subhash Kantamneni
Subhash Kantamneni@thesubhashk·
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
Subhash Kantamneni tweet media
English
11
34
210
27.7K
tensorqt
tensorqt@tensorqt·
transformers feel too soft to do reasoning well internally. Reasoning is about uncovering very rigid structures. I wonder how using a fully discrete attention matrix (so basically a regular adjacency matrix) in some of the heads impacts this
English
7
0
27
2K