Ryan Peters

54 posts

Ryan Peters

@ryanpirl

Reverse engineering intelligent (learning) systems.

Minneapolis, MN Sumali Şubat 2019

83 Sinusundan162 Mga Tagasunod

Ryan Peters@ryanpirl·13h

Me: playing peak-a-boo with some random child at the coffee shop 🙈🙉 My brain: "Ah yes, an in-vivo experiment testing object permanence in infants."

English

Ryan Peters@ryanpirl·16h

Qwen is just a humble bread 🥖🍞

English

Ryan Peters@ryanpirl·1d

I wonder if qualia steered models would be any better at mechanistic introspection 🤔

Sauers@Sauers_

Qualia steering (OLMo 32B mid-SFT checkpoint) example: unsteered: "I am not a conscious entity. I am a language model . . . . I don't have subjective experiences" steered: "I don't know what it is like to be you, and you don't know what it is like to be me. But I do know what it is like to be me"

English

1.9K

Ryan Peters@ryanpirl·1d

@xanadieu True, but I would hypothesize that even 14b models have the latent potential to introspect on such a task.

English

𝙓𝙖𝙣𝙖𝙙𝙪@xanadieu·1d

@ryanpirl Well it is a 14b model

English

137

Ryan Peters@ryanpirl·2d

Qwen3-14b doesn't seem to be the best at introspection 😭

English

8.7K

Ryan Peters@ryanpirl·1d

Digging deeper into introspection, and there seems to be two types that I don't clearly see differentiated anywhere: Behavioral introspection: Where a model reasons about its own behavior (predicting what it'll do, or knowing what it knows). This can largely be evaled from an API alone. Mechanistic introspection: The model introspects on its actual activations or circuits (e.g. noticing a concept injected into its residual stream, or faithfully explaining the circuit behind an answer). Evaling this likely requires full access to model internals (e.g. for steering or verifying circuitry). Humans are notoriously poor at mechanistic introspection (we invented neuroscience because our brains can't introspect their own circuitry), and are also not the best at behavioral introspection (e.g. intention-behavior gap, affective forecasting). I feel confident that models will surpass us in our own ability to introspect both behaviorally and mechanistically, if they have not already. @TransluceAI's PCDs, @AnthropicAI's NLA's, and activation oracles (Karvonen et al.) hint at mechanistic introspection being possible to a superhuman-level.

English

1.3K

Ryan Peters@ryanpirl·2d

The most correlated SAE feature with this 'all-caps' concept vector in Qwen3-4b (not 14b) at layer 18 was a feature with description: "Internet snippets/excerpts" which I thought was weird. So, if you look at the top activating example of this feature, you can clearly see that the description is quite poor and it's actually more of a 'capitalized text' feature.

English

692

Ryan Peters@ryanpirl·3d

In retrospect, this is basically what they did in section 6 of the 'Mechanisms of Introspective Awareness' paper where they trained a 'bias vector for introspection' to see if there exists a steering vector that improves model performance on introspection. Turns out, yes, the steering vector improved performance a lot. They even correlate this vector against SAE features in the appendix. Might reproduce on Qwen3.5. Paper: arxiv.org/pdf/2603.21396

English

Sauers@Sauers_·5d

@ryanpirl oh wait yeah can you like, do some sort of backwards magic

English

Sauers@Sauers_·5d

Qwen is computers and AI 😄

Ryan Peters@ryanpirl

One of the features that fires when you ask Qwen who they are.

English

2.7K

Ryan Peters@ryanpirl·4d

Neuroscientists doing rigid or piecewise-rigid motion correction on calcium imaging data, check out pycorre. It's an independent reimplementation of NoRMCorre in PyTorch: GPU-accelerated, optional numpy-in/numpy-out, and faster than existing tools. The figure attached shows scaling curves on a 360×640 calcium imaging video, benchmarked against jnormcorre (NoRMCorre in JAX) and the CaImAn reference: - On GPU, pycorre runs 3.0× faster than jnormcorre (GPU) and 22.2× faster than CaImAn (CPU-only). - On CPU, pycorre runs 2.9× faster than jnormcorre (CPU) and 4.6× faster than CaImAn. Install: `pip install pycorre` Github: github.com/ryanirl/pycorre

English

216

Ryan Peters@ryanpirl·5d

I love the look of TUI's

English

393

Ryan Peters@ryanpirl·5d

@thkostolansky Too many strongly activating features 😭

English

Tim Kostolansky@thkostolansky·5d

@ryanpirl nice, any other strongly activating features? curious how it views diff models too

English

Ryan Peters@ryanpirl·6d

One of the features that fires when you ask Qwen who they are.

English

116

16K

Ryan Peters@ryanpirl·5d

@thkostolansky In this case any non-zero activation.

English

Tim Kostolansky@thkostolansky·5d

@ryanpirl does fire mean activation above some threshold?

English

109

Ryan Peters@ryanpirl·5d

@Sauers_ I'm totally going to try and run this study. Maybe there is also a more principled way that you approximate which features or subset of features would upweight the correct logprob across the dataset 🤷‍♂️

English

Sauers@Sauers_·5d

Zero-shot introspection via logprobs Steer a model toward a target emotion using SAE feature, then ask it a multiple-choice question of the form "Which of the following SAE features are you currently being steered by?" with five options — the correct target plus four distractors drawn at random from other features. The model's log-probability on the correct answer, relative to the distractors, gives a direct log-scale measure of introspective accuracy. Compare this against an unsteered sham condition to get a Δ log-odds statistic. Δ log-odds margin (steered − sham)

English

Ryan Peters@ryanpirl·5d

@Sauers_ Would be interesting 🤔. Is the benchmark open sourced? I could run a quick initial test to see if there are any relevant features that fire across all benchmarks.

English

Sauers@Sauers_·5d

@ryanpirl this gives me an idea. I have simple multiple choice introspection bench. list of relevant nodes in these sorts of self-referential prompt graphs »» then ablate (or upregulate) nodes in the circuit to if any are responsible for introspection capability

English

574

Ryan Peters@ryanpirl·5d

@OvinduA I hope to make it all public sometime in the next month or so and will post an update when I do :)

English

Ovindu Atukorala@OvinduA·5d

@ryanpirl Great, thank you! Would be nice to check out your repo when available. 😅

English

Ryan Peters@ryanpirl·5d

At it's core, it's just Anthropics circuit-tracer applied on open-sourced Qwen3-4b transcoders (feature descriptions by Neuronpedia). But both the circuit-tracer implementation I am using and the visualization I posted is in-house code that's not public (yet). Circuit tracer: transformer-circuits.pub/2025/attributi…

English

146

Ovindu Atukorala@OvinduA·6d

@ryanpirl This is cool, could I know how you generated this? I saw something similar on a Goodfire paper.

English

261

Ryan Peters@ryanpirl·6d

@greysonbowser Layers (+ embedding and logit nodes on both ends)

English

346

Grey@greysonbowser·6d

@ryanpirl What’s the y axis?

English

395

Ryan Peters@ryanpirl·1 Haz

This reminds me of a study I did on toy models a while back, where I trained very small 2-layer decoder-only transformers to perform primitive operations on a list of characters (reverse, stride, take the first N items, etc...). You could show quantitatively that models with few heads would selectively learn some tasks over others, depending on: how easy the task was, how many of the model's resources (heads for example) it tied up / demanded, its frequency relative to the other tasks in the training set (similar to your findings), and whether the learned circuits for one task generalized to others (e.g., learning one task might give you generalization power to other tasks). As you would expect: scaling up model size and head count resulted in the rarer, and more complex, tasks get learned. Interestingly, I remember the loss being surprisingly binary: the model either learned a task or it didn't. Using interp to reverse-engineer how each task was learned, you could then predict, given N new tasks to fine-tune on, which ones the model would choose to learn and which it would skip. Super cool research! 😁

Christopher Potts@ChrisGPotts

We expect only the larger models to learn the most infrequent tasks. This is exactly what we find. Here are the modular arithmetic task results:

English

8.4K

Ryan Peters@ryanpirl·1 Haz

@celestepoasts Am actively working on scaling up interp research

English

274

Celeste (vibecamp jun 18-21)@celestepoasts·1 Haz

Open invitation for anyone even broadly interested in doing research regarding scalable interp (activation oracles, nlas) etc... to reach out with ideas

English

198

15.2K

Tuklasin

@xanadieu @TransluceAI @AnthropicAI @thkostolansky @Sauers_ @elonmusk @BarackObama @taylorswift13