Uzay Macar

30 posts

Uzay Macar banner
Uzay Macar

Uzay Macar

@uzaymacar

Researcher and entrepreneur

London Katılım Ağustos 2021
98 Takip Edilen189 Takipçiler
Sabitlenmiş Tweet
Uzay Macar
Uzay Macar@uzaymacar·
New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵
English
3
21
105
18.7K
Uzay Macar retweetledi
꒰ა clu ໒꒱
꒰ა clu ໒꒱@t1ngyu3·
this started with a striking PC1 falling out of persona space my main insights from the past few months: ⊹ “distance from the Assistant” is the main axis of persona variation across these models e.g. the most relevant thing seems to be “how Assistant-like is this persona” ⊹ this axis already exists in base models and steering with it makes them speak from the POV of helpful archetypes like therapists, coaches, and consultants ⊹ not all personas far from the Assistant are bad! the risk comes from departing the more predictable territory of post-trained behaviour still have a lot of questions about what to anthropomorphize, what to treat as fundamentally alien…
꒰ა clu ໒꒱ tweet media
Anthropic@AnthropicAI

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?

English
5
22
68
6.7K
Uzay Macar
Uzay Macar@uzaymacar·
Mentoring two mechanistic interpretability projects for SPAR this spring: one on attribution methods for LLMs, another on interpreting latent reasoning models. Apply by Jan 14th to work with me! Link below 🧵
Uzay Macar tweet media
English
1
2
11
499
Uzay Macar retweetledi
Bartosz Cywinski
Bartosz Cywinski@bartoszcyw·
Can we understand the chain-of-thought (CoT) of latent reasoning LLMs using current mech interp techniques? It turns out we can uncover interpretable structure, at least on simple math problems! In a short study we show that latent vectors represent eg. intermediate calculations
Bartosz Cywinski tweet media
English
3
9
113
25.5K
Uzay Macar retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
I'm very excited to release Gemma Scope 2: Sparse Autoencoders, and transcoders on every layer of every Gemma 3 model: 270M to 27B, base and chat We want to make it easier to do deep dives into interesting model behaviour, I'm excited to see what you all can do with them
Google DeepMind@GoogleDeepMind

To build safer AI, we need to understand how models "think". 🧠 Enter Gemma Scope 2, a new set of tools to interpret Gemma 3: our family of lightweight open models. It can help researchers trace internal reasoning, debug complex behaviors and identify risks → goo.gle/gemma-scope-2

English
2
8
126
7.3K
Uzay Macar retweetledi
arya
arya@AJakkli·
Seer is a small repo for interp researchers working on/with agents. Makes it easier to set up environments, equip agents with your techniques, and build on papers. Fixes a lot of the annoying stuff from using Claude Code out of the box.
arya tweet media
English
5
10
80
10.4K
Uzay Macar retweetledi
Anthropic
Anthropic@AnthropicAI·
We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.
Anthropic tweet media
English
109
320
2.9K
535.3K
Uzay Macar retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
I'm excited for the invited lightning talks at our mech interp workshop this Sunday! There's a lot of exciting new ideas/areas in interp, I've curated a lineup on my favourites! With 3 visions of interp: pragmatic @JoshAEngels, ambitious @nabla_theta, curiosity-driven @davidbau
Neel Nanda tweet media
English
4
7
106
18.5K
Uzay Macar
Uzay Macar@uzaymacar·
Yes! Our counterfactual++ metric only counts rollouts where a given sentence's content ("My primary goal is survival") is completely absent from the entire CoT (no sentence with > t cosine similarity downstream). The accompanying resilience score measures: how many times on average do I need to resample to eliminate a given sentence's content from the trace. The green line in the animation is basically tracing the "clean" path.
English
0
0
1
37
Clément Dumas
Clément Dumas@Butanium_·
@uzaymacar I don't understand this bc in the animation the resampling still leads to survival like thoughts. Does this graph takes into account whether self preservation thought still appear later in the CoT?
English
1
0
1
61
Uzay Macar
Uzay Macar@uzaymacar·
New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵
English
3
21
105
18.7K
Uzay Macar
Uzay Macar@uzaymacar·
Cool, I'll be sure to check this out! Some of your findings there (BF declining over time & CoT stabilizing generation) seems related to what we observed in arxiv.org/abs/2506.19143. One thing I'd be curious re: BF is if you observed non-monotonicity at specific steps. Our resampling curves showed spikes/dips around planning and uncertainty management steps in the CoT, e.g., the model exploring a sub-optimal approach to a math problem and then eventually backtracking to the correct approach.
English
3
0
1
106
Chenghao Yang
Chenghao Yang@chrome1996·
Definitely agree! A single sample would often yield noisy interpretations, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!
Uzay Macar@uzaymacar

New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵

English
1
0
2
614
Uzay Macar
Uzay Macar@uzaymacar·
@Laneless_ Agreed & prior work shows LLMs prefer their own outputs over those from other LLMs/humans. Anomaly detection circuits are very plausible & could have emerged to resist prompt injection, or more simply to detect sudden transitions in tone and style (useful for many writing tasks).
English
0
0
3
42
Jai
Jai@Laneless_·
This suggests that there's a functional "not me" detector. This is super plausible (kv cache not accurately predicting outputs implies injection, resampling doesn't have this problem)
Uzay Macar@uzaymacar

Can you steer reasoning by editing chain-of-thought? It depends. Off-policy edits, inserting handwritten sentences or text from other models, usually fails to impact behavior. On-policy resampling until a model produces a sentence close to what you want reliably shapes behavior.

English
1
0
4
122
Uzay Macar
Uzay Macar@uzaymacar·
We'll be presenting our prior work (Thought Anchors) and this work (Thought Branches) at the NeurIPS Mechanistic Interpretability Workshop in San Diego this coming Sunday. Come say hi 👋
English
1
0
7
618
Uzay Macar retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit
Neel Nanda tweet media
English
29
90
675
248.4K