Joshua Batson

2.1K posts

Joshua Batson

@thebasepoint

trying to understand evolved systems (🖥 and 🧬) interpretability research @anthropicai formerly @czbiohub, @mit math

Oakland, CA Katılım Şubat 2012

679 Takip Edilen6.2K Takipçiler

Joshua Batson retweetledi

Chris Olah@ch402·1d

x.com/i/article/2058…

ZXX

144

519

2.8K

220.3K

Joshua Batson@thebasepoint·10 May

@wtgowers Another problem of Nathanson, which Joe Gallian assigned to me at the Duluth REU, led to my first research paper. arxiv.org/abs/0710.4605

English

541

Timothy Gowers @wtgowers@wtgowers·8 May

Of course, this raises all sorts of questions about what is going to happen to mathematical research, with the impact on PhD students being particularly urgent. I give a few thoughts on this in the blog post, but I don't have anything like complete answers.

English

471

224.6K

Timothy Gowers @wtgowers@wtgowers·8 May

I've recently got in on the act of getting AI to solve open problems in mathematics. More precisely, I gave some questions asked by Melvyn Nathanson to ChatGPT 5.5 Pro, to which I have been given access, and it answered them. 🧵

English

383

640K

Joshua Batson retweetledi

Anthropic@AnthropicAI·7 May

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English

594

1.7K

16.6K

2.5M

Joshua Batson retweetledi

Emmanuel Ameisen@mlpowered·7 May

Interpreting model activations is important to understand why a model is doing what its doing. Traditionally, we've done this with supervised methods (probing for a specific context), or unsupervised sparse decompositions (dictionary learning). But probing requires you to know what you are looking for, and sparse dictionaries can be overwhelming to interpret. NLAs are exciting because they instead generate natural language explanations, which we can then inspect for a variety of behaviors. For example, they reveal the planning behavior we first observed with circuit tracing last year. They also helped identify bugs in Claude's training pipeline, where some prompts were only partially translated. If you want to play with them, NLAs on open models are available on Neuronpedia! neuronpedia.org/llama3.3-70b-i…

Anthropic@AnthropicAI

English

134

11.4K

Joshua Batson@thebasepoint·6 May

Attention heads are the most fascinating components of transformers, four-fold actions combined keys, values, queries, and outputs to do information movement and transformation. We built the tool a tool for studying them I wish we'd had years ago!

Harish Kamath@kamath_harish

Interpreting language models can feel like stumbling through a dark forest - sometimes you just wish you had a flashlight! In our new post, we introduce HeadVis, our latest flashlight for studying attention heads.

English

9.4K

Joshua Batson retweetledi

Paul Bogdan@paulcbogdan·1 May

Many LLMs struggle to parse statements like “Alice prepares and Bob consumes food.” Ask them “Who consumes food?” and they'll get it wrong What’s up with that? We researched whether models can represent multiple entities at once, and if so, why do they fail here? 🧵

English

20.8K

Joshua Batson@thebasepoint·17 Nis

@chenru_duan Like the 'subliminal learning' paper, the lead author was an external fellow who did not have white-box access to Claude models.

English

Chenru Duan@chenru_duan·16 Nis

same for this one: anthropic.com/research/diff-… where is claude?

English

121

Chenru Duan@chenru_duan·16 Nis

question: why does anthropic only show models from others (like gpt anad grok) and never use claude in publishing research about alignment issues and misbehaviors?

Owain Evans@OwainEvans_UK

Our paper on Subliminal Learning was just published in Nature! Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless). What’s new?🧵

English

Joshua Batson@thebasepoint·17 Nis

@chenru_duan Our system cards for each released Claude model have 50 pages of analysis of misbehaviors. eg Section 4 of this for mythos: www-cdn.anthropic.com/08ab9158070959…

English

141

Joshua Batson@thebasepoint·2 Nis

@_shruti_joshi_ @vpacela @isacama_phys @SimonLacosteJ @klindt_david I think absorption and splitting tend to be driven by correlation structure and feature frequency in the data fwiw, so some thought to how you set those up will be valuable

English

Shruti Joshi@_shruti_joshi_·2 Nis

@thebasepoint @vpacela @isacama_phys @SimonLacosteJ @klindt_david we haven't tested for feature absorption (or splitting) directly yet, but our setup should support this by evaluating support recovery of, and similarity with dictionary columns (synthetic setting). we'll add these to our repo soon, thanks for the suggestion!

English

Shruti Joshi@_shruti_joshi_·2 Nis

SAEs fail at OOD tasks. Why? Features in superposition are linearly representable but not linearly accessible. Instead of discarding sparse coding, we embrace the geometry of superposition and use methods equipped to handle the nonlinearity it induces.

English

202

40.9K

Joshua Batson@thebasepoint·2 Nis

@alexeigannon @Jack_W_Lindsey I think of the emotions as starting general – – they show up in stories about people the model is trained on – – and are only part of the assistant persona after finetuning

English

Alexei Gannon - ∃∀@alexeigannon·2 Nis

@Jack_W_Lindsey Do you know if these internal representations of emotion generalize outside of the "assistant" persona?

English

202

Jack Lindsey@Jack_W_Lindsey·2 Nis

Could an LLM have emotions? It’s hard to say. But when you’re talking to Claude, ChatGPT, or Gemini, you’re not talking to an LLM. You’re talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of desperation, or fear, or empathy (with sometimes alarming consequences).

Anthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

English

415

44.7K

Joshua Batson@thebasepoint·2 Nis

@_shruti_joshi_ @vpacela @isacama_phys @SimonLacosteJ @klindt_david I expect in practice we are in the far right here, where jumprelu is matching ground truth proved and exceeding DL-FISTA. A scalable DL method would be great to have!

English

Joshua Batson@thebasepoint·2 Nis

@_shruti_joshi_ @vpacela @isacama_phys @SimonLacosteJ @klindt_david Meta: the grey boxes are very useful here, as a practitioner I appreciate the concrete takeaways and calculations

English

Joshua Batson retweetledi

Daniel Carney@four_form·30 Mar

Gravity is probably quantized into gravitons. If not, however, there are experimental consequences. In particular, some level of irreversibility/noise. We finally classified ~all such models and calculated the noise. arxiv.org/abs/2603.26075

English

274

23.2K

Joshua Batson retweetledi

Nathan Calvin@_NathanCalvin·15 Mar

This passage in the New Yorker piece on the Anthropic DOW conflict yesterday, including a back and forth between the journalist (Gideon Lewis-Kraus) and an anonymous admin official, is gonna stick in my mind for a long time. “We must also remember that Cyberdyne Systems created Skynet for the government. It was supposed to help America dominate its enemies. It didn’t exactly work out as planned. The government thinks this is absurd. But the Pentagon has not tried to build an aligned A.I., and Anthropic has. Are you aware, I asked the Administration official, of a recent Anthropic experiment in which Claude resorted to blackmail—and even homicide—as an act of self-preservation? It had been carried out explicitly to convince people like him. As a member of Anthropic’s alignment-science team told me last summer, “The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.” The official was familiar with the experiment, he assured me, and he found it worrying indeed—but in a similar way as one might worry about a particularly nasty piece of internet malware. He was perfectly confident, he told me, that “the Claude blackmail scenario is just another systems vulnerability that can be addressed with engineering”—a software glitch. Maybe he’s right. We might get only one chance to find out.” I really recommend everyone read both the full New Yorker piece and Anthropic’s research on persona selection (both linked in the replies) and then spend a while sitting with the disconcerting situation we may have found ourselves in.

English

228

136.9K

Keşfet

@wtgowers @chenru_duan @_shruti_joshi_ @vpacela @isacama_phys @SimonLacosteJ @klindt_david @alexeigannon