Emmanuel Ameisen

2.2K posts

Emmanuel Ameisen

Emmanuel Ameisen

@mlpowered

Interpretability/Finetuning @AnthropicAI Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar

San Francisco, CA Katılım Haziran 2017
244 Takip Edilen11.1K Takipçiler
Sabitlenmiş Tweet
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
We've made progress in our quest to understand how Claude and models like it think! The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example
Anthropic@AnthropicAI

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

English
7
10
128
20.4K
Emmanuel Ameisen retweetledi
METR
METR@METR_Evals·
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
METR tweet media
English
69
250
2.1K
956.3K
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
@__lightyear__ The reason the NLA showed these results on opus is that it was trained on transcripts where it ended up needing to infer the user's language. That's not true for the neuronpedia models (paper has more details)
English
2
0
2
182
lightyear
lightyear@__lightyear__·
@mlpowered me and a few other people on HN are saying that the examples you showed off with opus don't work with the models on neuronpedia, like the russian 'mom is sleeping in the next room' prediction. the haiku examples work though. model too weak?
lightyear tweet media
English
1
0
0
166
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
Interpreting model activations is important to understand why a model is doing what its doing. Traditionally, we've done this with supervised methods (probing for a specific context), or unsupervised sparse decompositions (dictionary learning). But probing requires you to know what you are looking for, and sparse dictionaries can be overwhelming to interpret. NLAs are exciting because they instead generate natural language explanations, which we can then inspect for a variety of behaviors. For example, they reveal the planning behavior we first observed with circuit tracing last year. They also helped identify bugs in Claude's training pipeline, where some prompts were only partially translated. If you want to play with them, NLAs on open models are available on Neuronpedia! neuronpedia.org/llama3.3-70b-i…
Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English
5
10
131
11.1K
Emmanuel Ameisen retweetledi
Harish Kamath
Harish Kamath@kamath_harish·
Interpreting language models can feel like stumbling through a dark forest - sometimes you just wish you had a flashlight! In our new post, we introduce HeadVis, our latest flashlight for studying attention heads.
Harish Kamath tweet media
English
3
32
209
20.4K
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
How do LLMs store attributed of entities? And how do they compare different attributes in context? It turns out they mostly store information about a given entity over its own token, which allows for easy lookups. But in addition to the current entity's information, models also store information about the previous entity. That might seem redundant, but it actually enables a model to identify relationships between the current entity and the previous entity in one step!
Paul Bogdan@paulcbogdan

Many LLMs struggle to parse statements like “Alice prepares and Bob consumes food.” Ask them “Who consumes food?” and they'll get it wrong What’s up with that? We researched whether models can represent multiple entities at once, and if so, why do they fail here? 🧵

English
1
0
5
763
Emmanuel Ameisen retweetledi
Michael Hanna
Michael Hanna@michaelwhanna·
Do LMs plan without verbalizing their plans? I'll be at ICLR presenting work with @mlpowered using circuit tracing to reveal latent planning—from choosing "a" vs "an" based on a planned-for word, to rhyming poetry—and how these abilities grow with scale: openreview.net/forum?id=H0B7p…
Michael Hanna tweet media
English
1
13
96
4.3K
Emmanuel Ameisen retweetledi
Peter Yang
Peter Yang@petergyang·
Made this 30 second video of Claude Design just by pasting in the Claude Design blog post and some tweets from @AnthropicAI employees Kinda speechless.
English
113
96
2.1K
419.6K
Emmanuel Ameisen retweetledi
Vals AI
Vals AI@ValsAI·
Anthropic’s Opus 4.7 just seized the #1 spot on the Vals Index with a score of 71.4%, a massive jump from the previous best (67.7%). It also ranks #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2.
Vals AI tweet media
English
8
27
243
26.1K
Emmanuel Ameisen retweetledi
Uzay Macar
Uzay Macar@uzaymacar·
🧵New Anthropic Fellows research: We studied mechanisms of "introspective awareness" in LLMs. LLMs can sometimes detect steering vectors injected into their residual stream. But is this worthy of being called introspection, or attributable to some uninteresting confound?👇
Uzay Macar tweet media
English
28
70
427
45.9K
Emmanuel Ameisen retweetledi
Anthropic
Anthropic@AnthropicAI·
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing
English
2K
6.7K
44.1K
31.2M
Andrew Lampinen
Andrew Lampinen@AndrewLampinen·
Career update: I joined Anthropic (alignment team) this week — exciting place to be at an exciting time!
English
70
21
1.4K
52.3K
Emmanuel Ameisen retweetledi
Anthropic
Anthropic@AnthropicAI·
We partnered with Mozilla to test Claude's ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.
Anthropic tweet media
English
480
1.4K
15.1K
3.2M
Oana Olteanu
Oana Olteanu@oanaolt·
@mlpowered As I’ve told Benn, thanks for respecting our freedoms and for the work you do. Here to support any way I can.
English
1
0
1
50
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
I used to bite my tongue and hold my breath. Scared to rock the boat and make a mess. I stood for nothing, so I fell for everything. 🎶
KATY PERRY@katyperry

done

English
2
4
96
7.5K
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
Late last year, we found a precise counting mechanism in Claude. This new work by @ummagumm_a and Nikita Balagansky shows that: - similar mechanisms exist in many models - we can compare their counting performance by seeing how crisp their representations of the count are!
Viacheslav Sinii@ummagumm_a

1/ 🧵 Reproducing Anthropic’s “counting manifold” result in open-weight LLMs: do they internally track “chars since last \n” to wrap text consistently? huggingface.co/spaces/t-tech/…

English
2
6
80
6.4K
Viacheslav Sinii
Viacheslav Sinii@ummagumm_a·
1/ 🧵 Reproducing Anthropic’s “counting manifold” result in open-weight LLMs: do they internally track “chars since last \n” to wrap text consistently? huggingface.co/spaces/t-tech/…
Viacheslav Sinii tweet media
English
4
31
223
23.9K
Emmanuel Ameisen retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Yesterday's OpenAI and Anthropic Terminal-Bench 2.0 results used different harnesses. Run both in Terminus 2 ➡️ ~similar scores (within noise). Harnesses matter! Congrats to both teams on incredible models!
Alex Shaw tweet media
English
14
12
201
19.3K
Emmanuel Ameisen retweetledi
Subhash Kantamneni
Subhash Kantamneni@thesubhashk·
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
Subhash Kantamneni tweet media
English
11
34
212
27.6K