Pierre Beckmann

93 posts

Pierre Beckmann

@BeckmannPierre

Deep learning and Philosophy of AI @epfl @idiap_ch. PhD Student. Currently a scholar @MATSProgram https://t.co/YJbHiPuT0J

Lausanne, Suisse Katılım Şubat 2023

139 Takip Edilen233 Takipçiler

Sabitlenmiş Tweet

Pierre Beckmann@BeckmannPierre·20 Nis

New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵

English

131

12.2K

Pierre Beckmann@BeckmannPierre·6d

@BartBussmann is that the die speaking?

English

Bart Bussmann@BartBussmann·6d

Kinda wanna skinny dip in the residual stream ngl

English

1.1K

Pierre Beckmann@BeckmannPierre·26 Nis

@ryan_kidd44 Wasn't this cohort 9.0?

English

Pierre Beckmann@BeckmannPierre·21 Nis

Here's the thread x.com/BeckmannPierre…

Pierre Beckmann@BeckmannPierre

New preprint: “Mechanistic Indicators of Understanding in LLMs” with @matthieu_queloz Building on mechanistic interpretability, we argue that LLMs exhibit signs of understanding—across three tiers: conceptual –, state-of-the-world –, and principled understanding. 🧵(1/9)

English

119

Pierre Beckmann@BeckmannPierre·17 Nis

"Mechanistic indicators of Understanding in LLMs" is finally out in Philosophical studies! link.springer.com/article/10.100…

English

549

Pierre Beckmann@BeckmannPierre·20 Nis

@dcshiller @patrickbutlin @MATSprogram thanks!

English

249

Derek Shiller@dcshiller·20 Nis

@BeckmannPierre @patrickbutlin @MATSprogram This is a great contribution. The attention-propagated persona finding particularly is particularly intriguing.

English

297

Pierre Beckmann@BeckmannPierre·20 Nis

New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵

English

131

12.2K

Pierre Beckmann@BeckmannPierre·20 Nis

@burnt_jester It's definitely weird! The view gives up psychological connectedness and focuses on dispositional similarity instead. See §4.3 for more details.

English

Izak Tait@burnt_jester·20 Nis

@BeckmannPierre While I can understand that various instances and conversations may activate the same persona regions in the model, if there is no information flow between the vairous instances/conversation, how can it be considered the same "individual"?

English

Pierre Beckmann@BeckmannPierre·20 Nis

@davidchalmers42 @repligate @mpshanahan @saprmarks @Jack_W_Lindsey @ch402 I also wanted to separately thank @gilg_oscar, my stream-partner at MATS, for great feedback and pointers throughout this project!

English

158

Pierre Beckmann@BeckmannPierre·20 Nis

@davidchalmers42 @repligate @mpshanahan @saprmarks @Jack_W_Lindsey @ch402 Here's the link philpapers.org/rec/BECWIT-3 Comments are welcome!

English

577

Pierre Beckmann@BeckmannPierre·17 Nis

@Moleh1ll x.com/BeckmannPierre…

Pierre Beckmann@BeckmannPierre

"Mechanistic indicators of Understanding in LLMs" is finally out in Philosophical studies! link.springer.com/article/10.100…

QME

Moll@Moleh1ll·12 Oca

Do LLMs understand or are they just imitating? The debate about whether LLMs truly understand has long been stuck in a dead end. Some argue that it’s «just statistics», while others claim there are already seeds of a mind inside. The preprint discussed here suggests stepping out of this stalemate and reframing the question: what kind of understanding can exist inside a model, and through which mechanisms does it arise? The key idea is: understanding is the ability to see connections - between objects, properties, states, and rules. Mechanistic interpretability finally provides tools to examine whether such connections exist inside a model itself, rather than only in its outward answers. The authors propose viewing understanding as a multi-level structure. At the most basic level, a model forms internal concepts. These are not words or definitions, but stable «directions» in its internal space that activate across different manifestations of the same thing. Different phrasings, hints, or contexts pointing to the same object or idea can trigger the same internal feature. This goes beyond token matching: the model is able to unify variation into something shared. The next level is understanding the state of the world. Here it’s no longer just about concepts, but about relationships between them and how those relationships change over time. The clearest example is models trained to play Othello that never «see» the board, receiving only a sequence of moves. Analysis shows that they internally construct a representation of the current game state - where pieces are, which squares are occupied, which are free. Moreover, if you intervene directly in this internal representation, the model’s behavior changes in a predictable way. This no longer looks like memorizing patterns. It looks like maintaining an internal world model. But an important caveat follows: having such a model does not mean it is always used. The authors emphasize an uncomfortable but crucial point - models tend to switch to cheaper heuristics when those are sufficient. Even when «real» understanding is available, it does not have to be activated. The highest level is principled understanding. This is when a model does not merely know examples, but implements a compact rule or algorithm that generalizes the task. A classic example is the phenomenon of grokking in tasks like modular addition. For a long time the model overfits, achieving perfect training accuracy while failing on the test set - until suddenly it starts solving everything. Analysis shows that at this moment, what emerges inside is not a lookup table but a structured solution - for example, representing numbers as angles on a circle and performing addition through operations equivalent to trigonometric identities. This is no longer «memorization», but a discovered principle. At the same time, the authors are honest: such principles are usually crystallized through training, not derived on the fly. This is why humans still outperform LLMs on tasks that require quickly inferring a new rule from just a few examples, such as ARC-AGI. The final conclusion of the paper is perhaps the most important. An LLM is not a unified mind or a coherent thinking system. It is a motley mixture of mechanisms that coexist and compete. Sometimes a structural solution wins, sometimes a superficial heuristic does. Sometimes the model shows impressive understanding, and sometimes it stumbles on seemingly simple problems - simply because the «cheap path» turned out to be stronger. There are structures inside modern models that closely resemble understanding, but they do not form a single, reliable, self-regulating mind. And so the real question is not whether an LLM understands, but which type of understanding was activated in a given moment and what, exactly, overrode it. arxiv. org/abs/2507.08017

English

290

23.9K

Pierre Beckmann@BeckmannPierre·24 Şub

@Fabien_Mikol Je te conseille quand meme de regarder le post entier. L'emergent misalignment et les persona vectors changent quand meme vraiment la donne. MrPhi dans cette vidéo défend plutot le pdv Simulators (Janus, Shanahan). Le pdv PSM proposé par Anthropic est un nouveau paradigme

Français

Fabien@Fabien_Mikol·24 Şub

Oui, les modèles de fondation simulent un agent "bienveillant" aidant les humains avec lesquels il interagit. Et ce n'est pas très robuste. Rien de nouveau, mais c'est pas plus mal de le répéter. N'oublions pas Sydney.

Anthropic@AnthropicAI

This autocomplete AI can even write stories about helpful AI assistants. And according to our theory, that’s “Claude”—a character in an AI-generated story about an AI helping a human. This Claude character inherits traits of other characters, including human-like behavior.

Français

4.7K

Pierre Beckmann@BeckmannPierre·23 Şub

A couple years back, DL models were often described as feature combinators. Turns out that they can also recall features. This explains for example how LLMs can retrieve the bibliography of someone. Check out my Phil of AI preprint: philpapers.org/rec/BECDLM-2

English

152

Pierre Beckmann@BeckmannPierre·17 Şub

@repligate Hey! This is v cool. The figure makes it look a bit as if MLPs read from the res stream as it was before the attention layer. I figure the correct thing would be to have the attention layer write back first?

English

j⧉nus@repligate·15 Eyl

Slightly updated diagram. When I showed the one from the post to Claude Opus 4.1, it had a suggestion, which I've incorporated here: "The K/V arrows could show accumulation: Each position adds to the K/V cache rather than replacing it, so the arrows could get thicker or show accumulation somehow" The K/V cache (or it doesn't have to be cached; it can be recomputed) grows linearly with sequence length. Unlike the residual stream, the accumulating vectors in the K/V stream aren't summed until the end of the attention block. "K/V stream" was also Opus 4.1's phrase btw. Also, I noticed a mistake in the original diagram: I drew an incoming K/V stream arrow to the leftmost attention block from offscreen, but not an arrow to the accumulating stream. Since I'm showing accumulation in this version, I omitted any incoming K/V from offscreen for simplicity, as if the left column is the beginning of the sequence.

j⧉nus@repligate

HOW INFORMATION FLOWS THROUGH TRANSFORMERS Because I've looked at those "transformers explained" pages and they really suck at explaining. There are two distinct information highways in the transformer architecture: - The residual stream (black arrows): Flows vertically through layers at each position - The K/V stream (purple arrows): Flows horizontally across positions at each layer (by positions, I mean copies of the network for each token-position in the context, which output the "next token" probabilities at the end) At each layer at each position: 1. The incoming residual stream is used to calculate K/V values for that layer/position (purple circle) 2. These K/V values are combined with all K/V values for all previous positions for the same layer, which are all fed, along with the original residual stream, into the attention computation (blue box) 3. The output of the attention computation, along with the original residual stream, are fed into the MLP computation (fuchsia box), whose output is added to the original residual stream and fed to the next layer The attention computation does the following: 1. Compute "Q" values based on the current residual stream 2. use Q and the combined K values from the current and previous positions to calculate a "heat map" of attention weights for each respective position 3. Use that to compute a weighted sum of the V values corresponding to each position, which is then passed to the MLP This means: - Q values encode "given the current state, where (what kind of K values) from the past should I look?" - K values encode "given the current state, where (what kind of Q values) in the future should look here?" - V values encode "given the current state, what information should the future positions that look here actually receive and pass forward in the computation?" All three of these are huge vectors, proportional to the size of the residual stream (and usually divided into a few attention heads). The V values are passed forward in the computation without significant dimensionality reduction, so they could in principle make basically all the information in the residual stream at that layer at a past position available to the subsequent computations at a future position. V does not transmit a full, uncompressed record of all the computations that happened at previous positions, but neither is an uncompressed record passed forward through layers at each position. The size of the residual stream, also known as the model's hidden dimension, is the bottleneck in both cases. Let's consider all the paths that information can take from one layer/position in the network to another. Between point A (output of K/V at layer i-1, position j-2) to point B (accumulated K/V input to attention block at layer i, position j), information flows through the orange arrows: The information could: 1. travel up through attention and MLP to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 2 positions]. 2. be retrieved at (i-1, j-1) [RIGHT 1 position], travel up to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 1 position] 3. be retrieved at (i-1, j) [RIGHT 2 positions], then travel up to (i, j) [UP 1 layer]. The information needs to move up a total of n=layer_displacement times through the residual stream and right m=position_displacement times through the K/V stream, but it can do them in any order. The total number of paths (or computational histories) is thus C(m+n, n), which becomes greater than the number of atoms in the visible universe quickly. This does not count the multiple ways the information can travel up through layers through residual skip connections. So at any point in the network, the transformer not only receives information from its past (both horizontal and vertical dimensions of time) inner states, but often lensed through an astronomical number of different sequences of transformations and then recombined in superposition. Due to the extremely high dimensional information bandwidth and skip connections, the transformations and superpositions are probably not very destructive, and the extreme redundancy probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states. It seems likely that transformers experience memory and cognition as interferometric and continuous in time, much like we do. The transformer can be viewed as a causal graph, a la Wolfram (wolframphysics.org/technical-intr…). The foliations or time-slices that specify what order computations happen could look like this (assuming the inputs don't have to wait for token outputs), but it's not the only possible ordering: So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice.

English

177

20.8K

Keşfet

@BartBussmann @ryan_kidd44 @dcshiller @patrickbutlin @MATSprogram @burnt_jester @davidchalmers42 @repligate