Joshua Batson

2.1K posts

Joshua Batson

Joshua Batson

@thebasepoint

trying to understand evolved systems (🖥 and 🧬) interpretability research @anthropicai formerly @czbiohub, @mit math

Oakland, CA Katılım Şubat 2012
679 Takip Edilen6.2K Takipçiler
Timothy Gowers @wtgowers
Timothy Gowers @wtgowers@wtgowers·
Of course, this raises all sorts of questions about what is going to happen to mathematical research, with the impact on PhD students being particularly urgent. I give a few thoughts on this in the blog post, but I don't have anything like complete answers.
English
5
20
471
224.6K
Timothy Gowers @wtgowers
Timothy Gowers @wtgowers@wtgowers·
I've recently got in on the act of getting AI to solve open problems in mathematics. More precisely, I gave some questions asked by Melvyn Nathanson to ChatGPT 5.5 Pro, to which I have been given access, and it answered them. 🧵
English
77
383
2K
640K
Joshua Batson retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
English
594
1.7K
16.6K
2.5M
Joshua Batson retweetledi
Emmanuel Ameisen
Emmanuel Ameisen@mlpowered·
Interpreting model activations is important to understand why a model is doing what its doing. Traditionally, we've done this with supervised methods (probing for a specific context), or unsupervised sparse decompositions (dictionary learning). But probing requires you to know what you are looking for, and sparse dictionaries can be overwhelming to interpret. NLAs are exciting because they instead generate natural language explanations, which we can then inspect for a variety of behaviors. For example, they reveal the planning behavior we first observed with circuit tracing last year. They also helped identify bugs in Claude's training pipeline, where some prompts were only partially translated. If you want to play with them, NLAs on open models are available on Neuronpedia! neuronpedia.org/llama3.3-70b-i…
Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English
5
10
134
11.4K
Joshua Batson
Joshua Batson@thebasepoint·
Attention heads are the most fascinating components of transformers, four-fold actions combined keys, values, queries, and outputs to do information movement and transformation. We built the tool a tool for studying them I wish we'd had years ago!
Harish Kamath@kamath_harish

Interpreting language models can feel like stumbling through a dark forest - sometimes you just wish you had a flashlight! In our new post, we introduce HeadVis, our latest flashlight for studying attention heads.

English
1
10
94
9.4K
Joshua Batson retweetledi
Paul Bogdan
Paul Bogdan@paulcbogdan·
Many LLMs struggle to parse statements like “Alice prepares and Bob consumes food.” Ask them “Who consumes food?” and they'll get it wrong What’s up with that? We researched whether models can represent multiple entities at once, and if so, why do they fail here? 🧵
Paul Bogdan tweet media
English
8
11
86
20.8K
Joshua Batson
Joshua Batson@thebasepoint·
@chenru_duan Like the 'subliminal learning' paper, the lead author was an external fellow who did not have white-box access to Claude models.
English
1
0
1
36
Shruti Joshi
Shruti Joshi@_shruti_joshi_·
@thebasepoint @vpacela @isacama_phys @SimonLacosteJ @klindt_david we haven't tested for feature absorption (or splitting) directly yet, but our setup should support this by evaluating support recovery of, and similarity with dictionary columns (synthetic setting). we'll add these to our repo soon, thanks for the suggestion!
English
1
0
2
40
Shruti Joshi
Shruti Joshi@_shruti_joshi_·
SAEs fail at OOD tasks. Why? Features in superposition are linearly representable but not linearly accessible. Instead of discarding sparse coding, we embrace the geometry of superposition and use methods equipped to handle the nonlinearity it induces.
English
6
28
202
40.9K
Joshua Batson
Joshua Batson@thebasepoint·
@alexeigannon @Jack_W_Lindsey I think of the emotions as starting general – – they show up in stories about people the model is trained on – – and are only part of the assistant persona after finetuning
English
1
0
0
47
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Could an LLM have emotions? It’s hard to say. But when you’re talking to Claude, ChatGPT, or Gemini, you’re not talking to an LLM. You’re talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of desperation, or fear, or empathy (with sometimes alarming consequences).
Anthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

English
79
34
415
44.7K
Joshua Batson retweetledi
Daniel Carney
Daniel Carney@four_form·
Gravity is probably quantized into gravitons. If not, however, there are experimental consequences. In particular, some level of irreversibility/noise. We finally classified ~all such models and calculated the noise. arxiv.org/abs/2603.26075
English
40
46
274
23.2K
Joshua Batson retweetledi
Nathan Calvin
Nathan Calvin@_NathanCalvin·
This passage in the New Yorker piece on the Anthropic DOW conflict yesterday, including a back and forth between the journalist (Gideon Lewis-Kraus) and an anonymous admin official, is gonna stick in my mind for a long time. “We must also remember that Cyberdyne Systems created Skynet for the government. It was supposed to help America dominate its enemies. It didn’t exactly work out as planned. The government thinks this is absurd. But the Pentagon has not tried to build an aligned A.I., and Anthropic has. Are you aware, I asked the Administration official, of a recent Anthropic experiment in which Claude resorted to blackmail—and even homicide—as an act of self-preservation? It had been carried out explicitly to convince people like him. As a member of Anthropic’s alignment-science team told me last summer, “The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.” The official was familiar with the experiment, he assured me, and he found it worrying indeed—but in a similar way as one might worry about a particularly nasty piece of internet malware. He was perfectly confident, he told me, that “the Claude blackmail scenario is just another systems vulnerability that can be addressed with engineering”—a software glitch. Maybe he’s right. We might get only one chance to find out.” I really recommend everyone read both the full New Yorker piece and Anthropic’s research on persona selection (both linked in the replies) and then spend a while sitting with the disconcerting situation we may have found ourselves in.
English
9
24
228
136.9K