Kit Fraser-Taliente

9 posts

Kit Fraser-Taliente

@KitF_T

meading rinds at @anthropicai

Katılım Haziran 2016

707 Takip Edilen351 Takipçiler

Kit Fraser-Taliente@KitF_T·7 May

@CreativeS3lf steganography would be one way you could hack this reward - but it seems to be quite robust in practice. this was a surprise! L2 is indeed a poor proxy, but seems to be good enough for a lot of what we care about. we’re thinking about better distance metrics

English

Abdulmajeed@majeedalmuarik·7 May

@KitF_T Don’t you think L2 closeness in the activation space is not a good proxy for semantics? Because of high dimensionality not all directions are meaningful? Also, how do you prevent reward hacking?

English

Kit Fraser-Taliente@KitF_T·7 May

trained the first natural language autoencoder on gpt-2 almost a year ago, now we have one on mythos.🥲 do read the paper/play with the live demo! so excited it's finally out.

Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English

207

13.1K

Kit Fraser-Taliente@KitF_T·7 May

@dextersjab oh, not published, sadly!

English

dextersjab@dextersjab·7 May

@KitF_T thanks, i meant your work on gpt-2

English

101

Kit Fraser-Taliente@KitF_T·7 May

@dextersjab anthropic.com/research/natur…

QME

250

dextersjab@dextersjab·7 May

@KitF_T this paper? arxiv.org/html/2512.1567…

English

281

Kit Fraser-Taliente retweetledi

Jack Lindsey@Jack_W_Lindsey·7 Nis

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

English

155

771

6.9K

976.8K

Kit Fraser-Taliente@KitF_T·6 Şub

@scaling01 @mikeknoop

QME

Lisan al Gaib@scaling01·6 Şub

@mikeknoop putting it out there like this increases the chances that they comment on it

English

Lisan al Gaib@scaling01·6 Şub

we might be getting scammed by Anthropic: "we speculate this is a smaller model (maybe Sonnet-ish?) that runs thinking for longer"

Mike Knoop@mikeknoop

The headline is Opus 4.6 scores 69% for ~$3.50/task on ARC v2. This up +30pp from Opus 4.5. We attribute performance to the new "max" mode and 2X reasoning token budget -- notably task cost is held steady. Based on early field reports and other benchmark scores like SWE Bench, we speculate this is a smaller model (maybe Sonnet-ish?) that runs thinking for longer. If true, ARC v2 is measuring the "CoT search" complexity capability of the AI reasoning system, independent of model knowledge. Pretty cool! To get a sense of the complexity limit, here are all the v2 tasks Opus 4.6 failed to solve: arcprize.org/tasks/?dataset…

English

293

51.6K

Kit Fraser-Taliente retweetledi

Emmanuel Ameisen@mlpowered·5 Şub

We just shipped Claude Opus 4.6! I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit! We studied why sometimes, the model misrepresents the results of tool calls.

English

877

89.4K

Kit Fraser-Taliente retweetledi

Subhash Kantamneni@thesubhashk·6 Şub

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

English

210

27.7K

Kit Fraser-Taliente@KitF_T·28 Kas

@tensorqt have you looked at RASP?

English

158

tensorqt@tensorqt·27 Kas

transformers feel too soft to do reasoning well internally. Reasoning is about uncovering very rigid structures. I wonder how using a fully discrete attention matrix (so basically a regular adjacency matrix) in some of the heads impacts this

English

Keşfet

@dextersjab @scaling01 @mikeknoop @tensorqt @elonmusk @BarackObama @taylorswift13 @cristiano