Bart Bussmann

253 posts

Bart Bussmann banner
Bart Bussmann

Bart Bussmann

@BartBussmann

Mechanistic Interpretabilty Researcher | Trying to forge a brighter future

Amsterdam Katılım Ocak 2020
840 Takip Edilen827 Takipçiler
Bart Bussmann
Bart Bussmann@BartBussmann·
Kinda wanna skinny dip in the residual stream ngl
English
1
0
20
1.1K
Bart Bussmann
Bart Bussmann@BartBussmann·
Parameter decomposition has just started working on LLMs. SAEcels in absolute shambles.
Bart Bussmann tweet media
Lee Sharkey@leedsharkey

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

English
11
45
497
41.4K
Bart Bussmann
Bart Bussmann@BartBussmann·
@MichalBrzozows2 yeah as somewhat of an SAEcel myself, I feel like I'm allowed to make fun of us!
English
1
0
1
243
Bart Bussmann retweetledi
Lee Sharkey
Lee Sharkey@leedsharkey·
My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)
English
34
192
1.5K
235.1K
vals🔸
vals🔸@ValsTutor·
@celestepoasts scooped like they scoop protein powder into their shakers (I dunno if that exact one appears in this format but something like that I've seen around. more importantly I wanted to recommend you go check them out cuz they're funny)
English
2
0
3
65
Celeste
Celeste@celestepoasts·
rationality meetup at the gym be like weights and biases
English
6
10
225
6.1K
Celeste
Celeste@celestepoasts·
Celeste tweet media
ZXX
1
0
15
565
Bart Bussmann
Bart Bussmann@BartBussmann·
Most current LLMs reason in natural language, but we also need to prepare interpretability tools for models that reason in neuralese vectors instead. Using standard interpretability tools, we already find a lot of structure in their reasoning!
Bartosz Cywinski@bartoszcyw

Can we understand the chain-of-thought (CoT) of latent reasoning LLMs using current mech interp techniques? It turns out we can uncover interpretable structure, at least on simple math problems! In a short study we show that latent vectors represent eg. intermediate calculations

English
0
0
3
305
Bart Bussmann
Bart Bussmann@BartBussmann·
@nabla_theta Interesting! Of the people who didn't know, did most of them have no idea or would they say something like Advanced General Intelligence?
English
1
0
4
1.4K
Bart Bussmann
Bart Bussmann@BartBussmann·
Sonnet 4.5 really enjoys pranking future instances of itself, and appreciates getting pranked by its previous instances.
Bart Bussmann tweet mediaBart Bussmann tweet media
English
0
0
2
129
Miles Brundage
Miles Brundage@Miles_Brundage·
TFW there was a typo in the soul
Miles Brundage tweet media
English
4
3
174
7.1K
Bart Bussmann
Bart Bussmann@BartBussmann·
@NeelNanda5 @repligate @bartoszcyw Here is an intermediate result we had using Sonnet 4.5. But note that this is tampering of reasoning in its response, not its CoT. Unfortunately (but quite reasonably), Anthropics API doesn't allow pre-filling of Claude's CoT.
Bart Bussmann tweet media
English
0
0
5
116
Bart Bussmann
Bart Bussmann@BartBussmann·
@NeelNanda5 @repligate @bartoszcyw Yes, fwiw we tried this a bit with Claude reasoning in a scratchpad and it was often able to identify subtle rephrasings! But we noticed that this scratchpad reasoning was sufficiently different from 'real CoT' to not include it in this post.
English
1
0
5
146
Bart Bussmann
Bart Bussmann@BartBussmann·
Do current LLMs detect when their thoughts (CoT) are edited? Often not! If you remove or insert random tokens they rarely notice. But! They're more likely to recognize injected thoughts that conflict with their goals! with @bartoszcyw & @NeelNanda5's team
Bart Bussmann tweet media
English
3
7
90
17K