Ryan Peters

54 posts

Ryan Peters banner
Ryan Peters

Ryan Peters

@ryanpirl

Reverse engineering intelligent (learning) systems.

Minneapolis, MN เข้าร่วม Şubat 2019
83 กำลังติดตาม162 ผู้ติดตาม
Ryan Peters
Ryan Peters@ryanpirl·
Me: playing peak-a-boo with some random child at the coffee shop 🙈🙉 My brain: "Ah yes, an in-vivo experiment testing object permanence in infants."
English
0
0
0
48
Ryan Peters
Ryan Peters@ryanpirl·
Qwen is just a humble bread 🥖🍞
Ryan Peters tweet media
English
0
3
25
986
Ryan Peters
Ryan Peters@ryanpirl·
@xanadieu True, but I would hypothesize that even 14b models have the latent potential to introspect on such a task.
English
0
0
0
84
Ryan Peters
Ryan Peters@ryanpirl·
Qwen3-14b doesn't seem to be the best at introspection 😭
Ryan Peters tweet media
English
3
2
62
8.7K
Ryan Peters
Ryan Peters@ryanpirl·
Digging deeper into introspection, and there seems to be two types that I don't clearly see differentiated anywhere: Behavioral introspection: Where a model reasons about its own behavior (predicting what it'll do, or knowing what it knows). This can largely be evaled from an API alone. Mechanistic introspection: The model introspects on its actual activations or circuits (e.g. noticing a concept injected into its residual stream, or faithfully explaining the circuit behind an answer). Evaling this likely requires full access to model internals (e.g. for steering or verifying circuitry). Humans are notoriously poor at mechanistic introspection (we invented neuroscience because our brains can't introspect their own circuitry), and are also not the best at behavioral introspection (e.g. intention-behavior gap, affective forecasting). I feel confident that models will surpass us in our own ability to introspect both behaviorally and mechanistically, if they have not already. @TransluceAI's PCDs, @AnthropicAI's NLA's, and activation oracles (Karvonen et al.) hint at mechanistic introspection being possible to a superhuman-level.
English
0
3
22
1.3K
Ryan Peters
Ryan Peters@ryanpirl·
The most correlated SAE feature with this 'all-caps' concept vector in Qwen3-4b (not 14b) at layer 18 was a feature with description: "Internet snippets/excerpts" which I thought was weird. So, if you look at the top activating example of this feature, you can clearly see that the description is quite poor and it's actually more of a 'capitalized text' feature.
Ryan Peters tweet mediaRyan Peters tweet media
English
0
0
6
692
Ryan Peters
Ryan Peters@ryanpirl·
In retrospect, this is basically what they did in section 6 of the 'Mechanisms of Introspective Awareness' paper where they trained a 'bias vector for introspection' to see if there exists a steering vector that improves model performance on introspection. Turns out, yes, the steering vector improved performance a lot. They even correlate this vector against SAE features in the appendix. Might reproduce on Qwen3.5. Paper: arxiv.org/pdf/2603.21396
English
0
0
2
17
Sauers
Sauers@Sauers_·
@ryanpirl oh wait yeah can you like, do some sort of backwards magic
English
1
0
0
22
Ryan Peters
Ryan Peters@ryanpirl·
Neuroscientists doing rigid or piecewise-rigid motion correction on calcium imaging data, check out pycorre. It's an independent reimplementation of NoRMCorre in PyTorch: GPU-accelerated, optional numpy-in/numpy-out, and faster than existing tools. The figure attached shows scaling curves on a 360×640 calcium imaging video, benchmarked against jnormcorre (NoRMCorre in JAX) and the CaImAn reference: - On GPU, pycorre runs 3.0× faster than jnormcorre (GPU) and 22.2× faster than CaImAn (CPU-only). - On CPU, pycorre runs 2.9× faster than jnormcorre (CPU) and 4.6× faster than CaImAn. Install: `pip install pycorre` Github: github.com/ryanirl/pycorre
Ryan Peters tweet media
English
0
0
1
216
Ryan Peters
Ryan Peters@ryanpirl·
I love the look of TUI's
Ryan Peters tweet media
English
0
0
5
393
Tim Kostolansky
Tim Kostolansky@thkostolansky·
@ryanpirl nice, any other strongly activating features? curious how it views diff models too
English
1
0
2
40
Ryan Peters
Ryan Peters@ryanpirl·
One of the features that fires when you ask Qwen who they are.
Ryan Peters tweet media
English
5
10
116
16K
Ryan Peters
Ryan Peters@ryanpirl·
@Sauers_ I'm totally going to try and run this study. Maybe there is also a more principled way that you approximate which features or subset of features would upweight the correct logprob across the dataset 🤷‍♂️
English
1
0
2
30
Sauers
Sauers@Sauers_·
Zero-shot introspection via logprobs Steer a model toward a target emotion using SAE feature, then ask it a multiple-choice question of the form "Which of the following SAE features are you currently being steered by?" with five options — the correct target plus four distractors drawn at random from other features. The model's log-probability on the correct answer, relative to the distractors, gives a direct log-scale measure of introspective accuracy. Compare this against an unsteered sham condition to get a Δ log-odds statistic. Δ log-odds margin (steered − sham)
English
1
0
1
21
Ryan Peters
Ryan Peters@ryanpirl·
@Sauers_ Would be interesting 🤔. Is the benchmark open sourced? I could run a quick initial test to see if there are any relevant features that fire across all benchmarks.
English
1
0
2
35
Sauers
Sauers@Sauers_·
@ryanpirl this gives me an idea. I have simple multiple choice introspection bench. list of relevant nodes in these sorts of self-referential prompt graphs »» then ablate (or upregulate) nodes in the circuit to if any are responsible for introspection capability
English
2
0
5
574
Ryan Peters
Ryan Peters@ryanpirl·
@OvinduA I hope to make it all public sometime in the next month or so and will post an update when I do :)
English
0
0
2
21
Ovindu Atukorala
Ovindu Atukorala@OvinduA·
@ryanpirl Great, thank you! Would be nice to check out your repo when available. 😅
English
1
0
2
44
Ryan Peters
Ryan Peters@ryanpirl·
At it's core, it's just Anthropics circuit-tracer applied on open-sourced Qwen3-4b transcoders (feature descriptions by Neuronpedia). But both the circuit-tracer implementation I am using and the visualization I posted is in-house code that's not public (yet). Circuit tracer: transformer-circuits.pub/2025/attributi…
English
1
0
5
146
Ovindu Atukorala
Ovindu Atukorala@OvinduA·
@ryanpirl This is cool, could I know how you generated this? I saw something similar on a Goodfire paper.
English
1
0
2
261
Grey
Grey@greysonbowser·
@ryanpirl What’s the y axis?
English
1
0
2
395
Ryan Peters
Ryan Peters@ryanpirl·
This reminds me of a study I did on toy models a while back, where I trained very small 2-layer decoder-only transformers to perform primitive operations on a list of characters (reverse, stride, take the first N items, etc...). You could show quantitatively that models with few heads would selectively learn some tasks over others, depending on: how easy the task was, how many of the model's resources (heads for example) it tied up / demanded, its frequency relative to the other tasks in the training set (similar to your findings), and whether the learned circuits for one task generalized to others (e.g., learning one task might give you generalization power to other tasks). As you would expect: scaling up model size and head count resulted in the rarer, and more complex, tasks get learned. Interestingly, I remember the loss being surprisingly binary: the model either learned a task or it didn't. Using interp to reverse-engineer how each task was learned, you could then predict, given N new tasks to fine-tune on, which ones the model would choose to learn and which it would skip. Super cool research! 😁
Christopher Potts@ChrisGPotts

We expect only the larger models to learn the most infrequent tasks. This is exactly what we find. Here are the modular arithmetic task results:

English
0
1
30
8.4K
Celeste (vibecamp jun 18-21)
Celeste (vibecamp jun 18-21)@celestepoasts·
Open invitation for anyone even broadly interested in doing research regarding scalable interp (activation oracles, nlas) etc... to reach out with ideas
English
31
4
198
15.2K