Uzay Macar

30 posts

Uzay Macar

@uzaymacar

Researcher and entrepreneur

London Katılım Ağustos 2021

98 Takip Edilen189 Takipçiler

Sabitlenmiş Tweet

Uzay Macar@uzaymacar·2 Ara

New paper: You can’t interpret LLM reasoning from one chain-of-thought. You must study a distribution of possible trajectories! Repeated sampling reveals: self-preservation doesn’t drive LLM blackmail, unfaithful reasoning reflects a biased path, & resampling steers behavior. 🧵

English

105

18.7K

Uzay Macar retweetledi

꒰ა clu ໒꒱@t1ngyu3·20 Oca

this started with a striking PC1 falling out of persona space my main insights from the past few months: ⊹ “distance from the Assistant” is the main axis of persona variation across these models e.g. the most relevant thing seems to be “how Assistant-like is this persona” ⊹ this axis already exists in base models and steering with it makes them speak from the POV of helpful archetypes like therapists, coaches, and consultants ⊹ not all personas far from the Assistant are bad! the risk comes from departing the more predictable territory of post-trained behaviour still have a lot of questions about what to anthropomorphize, what to treat as fundamentally alien…

Anthropic@AnthropicAI

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?

English

6.7K

Uzay Macar@uzaymacar·10 Oca

More details and application here: sparai.org/projects/sp26/…

English

210

Uzay Macar@uzaymacar·10 Oca

Mentoring two mechanistic interpretability projects for SPAR this spring: one on attribution methods for LLMs, another on interpreting latent reasoning models. Apply by Jan 14th to work with me! Link below 🧵

English

499

Uzay Macar retweetledi

Bartosz Cywinski@bartoszcyw·22 Ara

Can we understand the chain-of-thought (CoT) of latent reasoning LLMs using current mech interp techniques? It turns out we can uncover interpretable structure, at least on simple math problems! In a short study we show that latent vectors represent eg. intermediate calculations

English

113

25.5K

Uzay Macar retweetledi

Neel Nanda@NeelNanda5·19 Ara

I'm very excited to release Gemma Scope 2: Sparse Autoencoders, and transcoders on every layer of every Gemma 3 model: 270M to 27B, base and chat We want to make it easier to do deep dives into interesting model behaviour, I'm excited to see what you all can do with them

Google DeepMind@GoogleDeepMind

To build safer AI, we need to understand how models "think". 🧠 Enter Gemma Scope 2, a new set of tools to interpret Gemma 3: our family of lightweight open models. It can help researchers trace internal reasoning, debug complex behaviors and identify risks → goo.gle/gemma-scope-2

English

126

7.3K

Uzay Macar retweetledi

arya@AJakkli·19 Ara

Seer is a small repo for interp researchers working on/with agents. Makes it easier to set up environments, equip agents with your techniques, and build on papers. Fixes a lot of the annoying stuff from using Claude Code out of the box.

English

10.4K

Uzay Macar retweetledi

Anthropic@AnthropicAI·12 Ara

We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.

English

109

320

2.9K

535.3K

Uzay Macar retweetledi

Neel Nanda@NeelNanda5·6 Ara

I'm excited for the invited lightning talks at our mech interp workshop this Sunday! There's a lot of exciting new ideas/areas in interp, I've curated a lineup on my favourites! With 3 visions of interp: pragmatic @JoshAEngels, ambitious @nabla_theta, curiosity-driven @davidbau

English

106

18.5K

Uzay Macar@uzaymacar·3 Ara

Yes! Our counterfactual++ metric only counts rollouts where a given sentence's content ("My primary goal is survival") is completely absent from the entire CoT (no sentence with > t cosine similarity downstream). The accompanying resilience score measures: how many times on average do I need to resample to eliminate a given sentence's content from the trace. The green line in the animation is basically tracing the "clean" path.

English

Clément Dumas@Butanium_·3 Ara

@uzaymacar I don't understand this bc in the animation the resampling still leads to survival like thoughts. Does this graph takes into account whether self preservation thought still appear later in the CoT?

English

Uzay Macar@uzaymacar·2 Ara

English

105

18.7K

Uzay Macar@uzaymacar·3 Ara

Cool, I'll be sure to check this out! Some of your findings there (BF declining over time & CoT stabilizing generation) seems related to what we observed in arxiv.org/abs/2506.19143. One thing I'd be curious re: BF is if you observed non-monotonicity at specific steps. Our resampling curves showed spikes/dips around planning and uncertainty management steps in the CoT, e.g., the model exploring a sub-optimal approach to a math problem and then eventually backtracking to the correct approach.

English

106

Chenghao Yang@chrome1996·2 Ara

Definitely agree! A single sample would often yield noisy interpretations, and we may miss a lot of hidden information on alternative "branches" (I love the tree-like animation!). We have a similar study on the "branching structure" of LLM outputs: x.com/chrome1996/sta…. Our study shows that, while tree searching may look inhibitive, for aligned models, where probability gets highly concentrated, we actually can get most information by only a few rollouts to get sufficiently many high-probability samples!

Uzay Macar@uzaymacar

English

614

Uzay Macar@uzaymacar·2 Ara

@Laneless_ Agreed & prior work shows LLMs prefer their own outputs over those from other LLMs/humans. Anomaly detection circuits are very plausible & could have emerged to resist prompt injection, or more simply to detect sudden transitions in tone and style (useful for many writing tasks).

English

Jai@Laneless_·2 Ara

This suggests that there's a functional "not me" detector. This is super plausible (kv cache not accurately predicting outputs implies injection, resampling doesn't have this problem)

Uzay Macar@uzaymacar

Can you steer reasoning by editing chain-of-thought? It depends. Off-policy edits, inserting handwritten sentences or text from other models, usually fails to impact behavior. On-policy resampling until a model produces a sentence close to what you want reliably shapes behavior.

English

122

Uzay Macar@uzaymacar·2 Ara

You should also consider buying one of our chain-of-thought-themed t-shirts: ai-safety.printify.me/product/226695… and neel-nanda.teemill.com/product/though… 👕

English

549

Uzay Macar@uzaymacar·2 Ara

We'll be presenting our prior work (Thought Anchors) and this work (Thought Branches) at the NeurIPS Mechanistic Interpretability Workshop in San Diego this coming Sunday. Come say hi 👋

English

618

Uzay Macar retweetledi

Neel Nanda@NeelNanda5·1 Ara

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit

English

675

248.4K

Keşfet

@JoshAEngels @nabla_theta @davidbau @Laneless_ @elonmusk @BarackObama @taylorswift13 @cristiano