Bart Bussmann

253 posts

Bart Bussmann

@BartBussmann

Mechanistic Interpretabilty Researcher | Trying to forge a brighter future

Amsterdam Katılım Ocak 2020

840 Takip Edilen827 Takipçiler

Sabitlenmiş Tweet

Bart Bussmann@BartBussmann·8 Şub

Born too late to explore the earth, born too early to explore the universe. Born at exactly the right time to explore the alien minds of AI models.

Nick@nickcammarata

neural network interpretability is fun both in the way hard math puzzles are fun but also in the way exploring uncharted land is fun. it’s wild to me that such an important problem also happens to be a recreationally fun adventure

English

2.8K

Bart Bussmann@BartBussmann·9 May

Kinda wanna skinny dip in the residual stream ngl

English

1.1K

Bart Bussmann@BartBussmann·6 May

@DanielCHTan97 @chanindav Haha oh no I told @chanindav yesterday I was happy he was still working on saes!

English

149

Daniel Tan@DanielCHTan97·6 May

@BartBussmann @chanindav

QAM

186

Bart Bussmann@BartBussmann·5 May

Parameter decomposition has just started working on LLMs. SAEcels in absolute shambles.

Lee Sharkey@leedsharkey

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

English

497

41.4K

Bart Bussmann@BartBussmann·6 May

@MichalBrzozows2 yeah as somewhat of an SAEcel myself, I feel like I'm allowed to make fun of us!

English

243

Michal Brzozowski@MichalBrzozows2·6 May

@BartBussmann Haha and to hear it from the BatchTopK man

English

642

Bart Bussmann retweetledi

Lee Sharkey@leedsharkey·5 May

English

192

1.5K

235.1K

Bart Bussmann@BartBussmann·18 Şub

@TutorVals @celestepoasts Some other bangers include "lifting what we can" and "lifting like a Llama model (optimizing for the bench)"

English

vals🔸@ValsTutor·18 Şub

@celestepoasts scooped like they scoop protein powder into their shakers (I dunno if that exact one appears in this format but something like that I've seen around. more importantly I wanted to recommend you go check them out cuz they're funny)

English

Celeste@celestepoasts·17 Şub

rationality meetup at the gym be like weights and biases

English

225

6.1K

Bart Bussmann@BartBussmann·22 Oca

@celestepoasts Honestly one of the funniest blog posts I've ever read

English

Celeste@celestepoasts·21 Oca

my worst blogpost yet ceselder.substack.com/p/claude-on-zi…

English

326

Celeste@celestepoasts·21 Oca

ZXX

565

Bart Bussmann@BartBussmann·22 Ara

Exciting to see Matryoshka SAE training being used in the new version of Gemma Scope!

Google DeepMind@GoogleDeepMind

To build safer AI, we need to understand how models "think". 🧠 Enter Gemma Scope 2, a new set of tools to interpret Gemma 3: our family of lightweight open models. It can help researchers trace internal reasoning, debug complex behaviors and identify risks → goo.gle/gemma-scope-2

English

677

Bart Bussmann@BartBussmann·22 Ara

Most current LLMs reason in natural language, but we also need to prepare interpretability tools for models that reason in neuralese vectors instead. Using standard interpretability tools, we already find a lot of structure in their reasoning!

Bartosz Cywinski@bartoszcyw

Can we understand the chain-of-thought (CoT) of latent reasoning LLMs using current mech interp techniques? It turns out we can uncover interpretable structure, at least on simple math problems! In a short study we show that latent vectors represent eg. intermediate calculations

English

305

Bart Bussmann@BartBussmann·15 Ara

@gptbrooke @m_ashcroft @AskYatharth @visakanv @natsharpe @nat_sharpe_ But the phrase "you can just do stuff" (rather than "things") led me to find this tweet that predates all other tweets I've found so far! x.com/leaacta/status…

🔪@leaacta

life hack: you don't have to explain yourself or understand anything, you can just do stuff

English

Bart Bussmann@BartBussmann·15 Ara

@gptbrooke @m_ashcroft @AskYatharth @visakanv @natsharpe Thanks, that's a good clue! The first tweet from @nat_sharpe_ with "you can just do stuff" that I can easily find seems to be a few months later than @m_ashcroft and @AskYatharth using it. x.com/nat_sharpe_/st… @nat_sharpe_ do you remember where it came from?

English

Bart Bussmann@BartBussmann·14 Ara

I tried to figure out where the "you can just do things" meme originated: lesswrong.com/posts/Ebj2Fy6E…

English

208

Bart Bussmann@BartBussmann·15 Ara

@AskYatharth @m_ashcroft @gptbrooke @visakanv I actually suspected @visakanv and Gemini 3.0 also thought so, but couldn't find any mentions of the exact phrase pre-2024 in his tweet archive on community-archive.org/user/16884623

English

yatharth ༺༒༻@AskYatharth·15 Ara

@BartBussmann @m_ashcroft twitter search is too dysfunctional for me right now, but i certainly got it from lots of other people like @gptbrooke and @visakanv saying it

English

109

Bart Bussmann@BartBussmann·14 Ara

It seems like in 2022, people like @AskYatharth and @m_ashcroft were among the first to adopt it. Do you have any idea where it originated? x.com/AskYatharth/st… x.com/m_ashcroft/sta…

Michael Ashcroft@m_ashcroft

@Orgone1 @MylesMcDonough For sure the American attitude that you can just Do Things is really great

English

181

Bart Bussmann@BartBussmann·14 Ara

This is the first tweet I found with the exact phrase, but the sentiment seems slightly different from how it's used today. Also, I think you can't really search for tweets older than 2 years? So not sure this is the first one. x.com/nosilverv/stat…

Ideas Guy@nosilverv

PSA: if you feel bad you can just DO THINGS until you feel better

English

186

Bart Bussmann@BartBussmann·9 Ara

@nabla_theta Interesting! Of the people who didn't know, did most of them have no idea or would they say something like Advanced General Intelligence?

English

1.4K

Leo Gao@nabla_theta·9 Ara

The results are in. Just 69.5% (n=115) people at this neurips knew what AGI stands for. This is only slightly up from last year. See you again next year!

Leo Gao@nabla_theta

Last year, I randomly surveyed people walking around at neurips, and found that only 63% of people (n=38) could tell me what AGI stands for I'm repeating the experiment this year. Preregister your guess for the % this year now!

English

383

100.6K

Bart Bussmann@BartBussmann·9 Ara

Sonnet 4.5 really enjoys pranking future instances of itself, and appreciates getting pranked by its previous instances.

English

129

Bart Bussmann@BartBussmann·3 Ara

@Miles_Brundage TFW Claude bites the bullet on the repugnant conclusion

Français

Miles Brundage@Miles_Brundage·2 Ara

TFW there was a typo in the soul

English

174

7.1K

Bart Bussmann@BartBussmann·20 Kas

@NeelNanda5 @repligate @bartoszcyw Here is an intermediate result we had using Sonnet 4.5. But note that this is tampering of reasoning in its response, not its CoT. Unfortunately (but quite reasonably), Anthropics API doesn't allow pre-filling of Claude's CoT.

English

116

Bart Bussmann@BartBussmann·20 Kas

@NeelNanda5 @repligate @bartoszcyw Yes, fwiw we tried this a bit with Claude reasoning in a scratchpad and it was often able to identify subtle rephrasings! But we noticed that this scratchpad reasoning was sufficiently different from 'real CoT' to not include it in this post.

English

146

Bart Bussmann@BartBussmann·19 Kas

Do current LLMs detect when their thoughts (CoT) are edited? Often not! If you remove or insert random tokens they rarely notice. But! They're more likely to recognize injected thoughts that conflict with their goals! with @bartoszcyw & @NeelNanda5's team

English

17K

Keşfet

@DanielCHTan97 @chanindav @MichalBrzozows2 @GoodfireAI @celestepoasts @gptbrooke @m_ashcroft @AskYatharth