Ryan Peters
54 posts

Ryan Peters
@ryanpirl
Reverse engineering intelligent (learning) systems.
Minneapolis, MN Sumali Şubat 2019
83 Sinusundan162 Mga Tagasunod

I wonder if qualia steered models would be any better at mechanistic introspection 🤔
Sauers@Sauers_
Qualia steering (OLMo 32B mid-SFT checkpoint) example: unsteered: "I am not a conscious entity. I am a language model . . . . I don't have subjective experiences" steered: "I don't know what it is like to be you, and you don't know what it is like to be me. But I do know what it is like to be me"
English

@xanadieu True, but I would hypothesize that even 14b models have the latent potential to introspect on such a task.
English

Digging deeper into introspection, and there seems to be two types that I don't clearly see differentiated anywhere:
Behavioral introspection: Where a model reasons about its own behavior (predicting what it'll do, or knowing what it knows). This can largely be evaled from an API alone.
Mechanistic introspection: The model introspects on its actual activations or circuits (e.g. noticing a concept injected into its residual stream, or faithfully explaining the circuit behind an answer). Evaling this likely requires full access to model internals (e.g. for steering or verifying circuitry).
Humans are notoriously poor at mechanistic introspection (we invented neuroscience because our brains can't introspect their own circuitry), and are also not the best at behavioral introspection (e.g. intention-behavior gap, affective forecasting).
I feel confident that models will surpass us in our own ability to introspect both behaviorally and mechanistically, if they have not already. @TransluceAI's PCDs, @AnthropicAI's NLA's, and activation oracles (Karvonen et al.) hint at mechanistic introspection being possible to a superhuman-level.
English

The most correlated SAE feature with this 'all-caps' concept vector in Qwen3-4b (not 14b) at layer 18 was a feature with description: "Internet snippets/excerpts" which I thought was weird. So, if you look at the top activating example of this feature, you can clearly see that the description is quite poor and it's actually more of a 'capitalized text' feature.


English

In retrospect, this is basically what they did in section 6 of the 'Mechanisms of Introspective Awareness' paper where they trained a 'bias vector for introspection' to see if there exists a steering vector that improves model performance on introspection. Turns out, yes, the steering vector improved performance a lot. They even correlate this vector against SAE features in the appendix. Might reproduce on Qwen3.5.
Paper: arxiv.org/pdf/2603.21396
English

Qwen is computers and AI 😄
Ryan Peters@ryanpirl
One of the features that fires when you ask Qwen who they are.
English

Neuroscientists doing rigid or piecewise-rigid motion correction on calcium imaging data, check out pycorre.
It's an independent reimplementation of NoRMCorre in PyTorch: GPU-accelerated, optional numpy-in/numpy-out, and faster than existing tools.
The figure attached shows scaling curves on a 360×640 calcium imaging video, benchmarked against jnormcorre (NoRMCorre in JAX) and the CaImAn reference:
- On GPU, pycorre runs 3.0× faster than jnormcorre (GPU) and 22.2× faster than CaImAn (CPU-only).
- On CPU, pycorre runs 2.9× faster than jnormcorre (CPU) and 4.6× faster than CaImAn.
Install: `pip install pycorre`
Github: github.com/ryanirl/pycorre

English

@ryanpirl nice, any other strongly activating features? curious how it views diff models too
English

@ryanpirl does fire mean activation above some threshold?
English

@Sauers_ I'm totally going to try and run this study. Maybe there is also a more principled way that you approximate which features or subset of features would upweight the correct logprob across the dataset 🤷♂️
English

Zero-shot introspection via logprobs
Steer a model toward a target emotion using SAE feature, then ask it a multiple-choice question of the form "Which of the following SAE features are you currently being steered by?" with five options — the correct target plus four distractors drawn at random from other features. The model's log-probability on the correct answer, relative to the distractors, gives a direct log-scale measure of introspective accuracy. Compare this against an unsteered sham condition to get a Δ log-odds statistic.
Δ log-odds margin (steered − sham)
English

@Sauers_ Would be interesting 🤔. Is the benchmark open sourced? I could run a quick initial test to see if there are any relevant features that fire across all benchmarks.
English

@OvinduA I hope to make it all public sometime in the next month or so and will post an update when I do :)
English

@ryanpirl Great, thank you! Would be nice to check out your repo when available. 😅
English

At it's core, it's just Anthropics circuit-tracer applied on open-sourced Qwen3-4b transcoders (feature descriptions by Neuronpedia). But both the circuit-tracer implementation I am using and the visualization I posted is in-house code that's not public (yet).
Circuit tracer: transformer-circuits.pub/2025/attributi…
English

@ryanpirl This is cool, could I know how you generated this? I saw something similar on a Goodfire paper.
English

This reminds me of a study I did on toy models a while back, where I trained very small 2-layer decoder-only transformers to perform primitive operations on a list of characters (reverse, stride, take the first N items, etc...).
You could show quantitatively that models with few heads would selectively learn some tasks over others, depending on: how easy the task was, how many of the model's resources (heads for example) it tied up / demanded, its frequency relative to the other tasks in the training set (similar to your findings), and whether the learned circuits for one task generalized to others (e.g., learning one task might give you generalization power to other tasks).
As you would expect: scaling up model size and head count resulted in the rarer, and more complex, tasks get learned. Interestingly, I remember the loss being surprisingly binary: the model either learned a task or it didn't.
Using interp to reverse-engineer how each task was learned, you could then predict, given N new tasks to fine-tune on, which ones the model would choose to learn and which it would skip.
Super cool research! 😁
Christopher Potts@ChrisGPotts
We expect only the larger models to learn the most infrequent tasks. This is exactly what we find. Here are the modular arithmetic task results:
English






