Georg Lange

245 posts

Georg Lange

Georg Lange

@_georg_lange

Interested in artificial and biological intelligence.

Amsterdam, The Netherlands Katılım Eylül 2021
298 Takip Edilen192 Takipçiler
Sabitlenmiş Tweet
Georg Lange
Georg Lange@_georg_lange·
New mech interp paper 🚨 We often look for interpretable features in LLM activations - nowadays with Sparse Autoencoders - but don't know how to measure how well they work. Here, we developed a new set of metrics that benchmark SAE performance using known circuits👇
Alex Makelov@AMakelov

Sparse autoencoders (SAEs) are a popular method for mech interp - but how do we measure if they're any good at finding the "right" features? In a new paper, we propose more objective SAE evaluations by comparing against "known" features for a previously studied circuit!

English
0
1
10
2.4K
Georg Lange retweetledi
Foresight Institute
Foresight Institute@foresightinst·
How can AI strengthen security, protect privacy, and enable cooperation among a rising plurality of AI systems and humans? Some of the best minds working on this are joining us at our Secure & Sovereign AI Workshop, July 18–19 in Berlin. Confirmed speakers: •  @jesseposner – Vora •  @robinhanson – George Mason University •  @ml_sudo – Project Sovereign •  Lisa Beckers – Global Technology Risk Foundation •  @socrates1024 – ZCash Foundation •  @FazlBarez – University of Oxford •  @IvanVendrov – Midjourney •  Georgios Kaissis – HPI Potsdam •  Davide Crapis – MystLabs Inc. •  @DanGirsh – WorldCoin •  @ObadiaAlex – ARIA •  @jsotterbach – SPRIND •  @NitzanShulman – Heron AI •  Or Zamir – Blavatnik School of Computer Science •  @dimasquest – PhD, Imperial College London •  @0xQuintus – Flashbots •  John Liagouris – Boston University •  Mariana Meireles – UC Berkeley •  Rob Sison – UNSW Sydney •  @KeithPatarroyo – University of Glasgow •  Dongwon Lee – PSU •  @galmasha – Foresight Fellow •  @EricMoore – PhD, Kennedy Kreiger Institute •  @MorganLvng – TechCongress Fellow •  Abraham Nash – Infinite Zero Foundation •  @JaimeRalV – Apart Research •  @SamuelNellessen – CAIS •  @iamwsubramanyam – CAIS •  Pascal Berrang – Zeroth Research •  @luca_arnaboldi – Zeroth Research •  @mateo_petel – Google Deepmind •  Madeleine Parker – Newfoundation •  @jmartink – Lateral •  @aurelcode – Inversed •  Janabel Xia – Privacy Residency •  Kazik Pogoda – Xemantic •  @Tianyi_Alex_Qiu – Peking University •  @Gunnar_Zarncke – Aintelope •  @_georg_lange •  Eduard Kapelko If you’re researching or building in AI safety, security, privacy, or decentralized cooperation, apply to attend: foresight.org/events/2026-se… Sponsored by: @protocollabs
GIF
English
1
11
14
2.3K
Georg Lange
Georg Lange@_georg_lange·
This is joint work with Rick Goldstein and my SPAR mentees Kat Dearstyne and Kamal Maher. Big shoutout to them!
English
0
0
3
101
Georg Lange
Georg Lange@_georg_lange·
But: Are circuits meaningfully different? Would we draw different conclusions about how LLMs work? Yes! We show that PLTs find real features that CLTs miss. And this changes our circuit story about how LLMs do verbatim memorization.
Georg Lange tweet mediaGeorg Lange tweet media
English
1
0
5
135
Georg Lange
Georg Lange@_georg_lange·
Attribution circuits give us high-level mechanistic explanations. But can we rely on their results? Not always! We investigate mechanistic (un)faithfulness of Crosslayer Transcoders, and show failure modes that distort the mechanistic story we read off.
Georg Lange tweet media
English
2
2
29
1.9K
Georg Lange
Georg Lange@_georg_lange·
I’ve noticed this in myself and I keep warning my mentees about it, so it’s nice to finally see real data from @AnthropicAI. My hypothesis of what's going on: Learning and problem solving takes real mental effort but chatting does not; the brain minimizes effort/reward. For the same task (reward), this creates a subconscious bias to overuse AI tools. The tricky part is it's hard to notice. It’s not like doomscrolling social media; it still feels like real work. To avoid this, I’ve gone back to doing these myself: - write the core code - write the paper - debug and really understand the bugs How I use AI: - to ask questions about a new codebase or library, find code I want to look at, install stuff - annoying I/O, logging, documentation, pretty plots - latex formatting or unimportant appendix text
Anthropic@AnthropicAI

AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…

English
0
0
3
244
Georg Lange
Georg Lange@_georg_lange·
Come work with me and @SPARexec to build an AI mech interp researcher to accelerate AI safety research.🧠🔬 In the last cohort, my mentees built AI agents that automatically find and refine explanations for SAE features (demo of what they built after only one month below). In this cohort, we want to push for agents that discover and explain full circuits. Deadline is Jan 14th!⏳🗓️
SPAR@SPARexec

🚀 We're excited to announce that mentee applications are now open for the Spring round of the SPAR research program! This will be our largest round ever, featuring 130+ projects across AI safety, policy, governance, security, welfare, and strategy.

English
2
3
6
928
Georg Lange retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
Due to popular demand, I've extended by deadline on my MATS application by two weeks, until Sept 12. If you didn't have time to apply before, hopefully that's enough time to reconsider! I'm very excited to work with the best people I can
Neel Nanda@NeelNanda5

My Winter MATS applications are open! You'll work full-time writing a mech interp paper supervised by me. Due Aug 29 I've supervised 30+ papers by now (incl 15 top conference papers) but cohorts still get better each time. I'm hyped to see what this cohort achieves! Highlights:

English
7
7
180
50.7K
Georg Lange
Georg Lange@_georg_lange·
@Jack_W_Lindsey Thanks a lot! Yeah I was wondering about this too. In the limit, you could have a single transform with k_t=65k and get L0=1 and almost perfect MSE but of course no interpretability. Is there an easy way to get a measurement on how interpretable the transforms are?
English
1
0
0
107
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
@_georg_lange For L0 counting, a transform is 1. Note that whether this is a fair comparison to transcoders depends a lot on how interpretable transforms are In the plot w/ "65k features" labels, the MOLT runs are parameter-matched, i.e have less than 65k transforms (sorry this was unclear!)
English
1
0
2
194
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminary evidence suggests they may be more efficient, mechanistically faithful, and compositional than existing techniques like transcoders transformer-circuits.pub/2025/bulk-upda…
English
4
21
172
19.4K
Georg Lange
Georg Lange@_georg_lange·
@Butanium_ @jxmnop Would you recommend BatchTopK for crosslayer transcoders as well? Do you know how it compares to JumpReLU?
English
2
0
2
56
Clément Dumas
Clément Dumas@Butanium_·
@jxmnop agreed! if you want to play with crosscoder, i'd recommend checking our paper / thread explaining why you shouldn't use their L1 penalty and optimize/enforce the L0 directly instead. we also replicate their finding of finding cool chat specifc latents 🫡 x.com/Butanium_/stat…
Clément Dumas@Butanium_

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

English
2
0
24
1.5K
dr. jack morris
dr. jack morris@jxmnop·
just learned about "model diffing" from Anthropic. buried in an october blogpost; feels really novel. training a 'crosscoder' between two models of the same family produces interpretable diffs. here post-training clearly adds refusals, QA, math, etc. pretty amazing stuff
dr. jack morris tweet media
English
11
31
717
46.2K
Georg Lange
Georg Lange@_georg_lange·
📢 Accepted at #ICLR2025! Visit our poster tomorrow morning if you wanna know how good Sparse Autoencoders (SAEs) really are! We propose objective evaluations to measure how good SAEs are at finding the "right" features. 📷 Friday, April 25 🕑 10:00 - 12:30 📍Poster #558
Alex Makelov@AMakelov

Sparse autoencoders (SAEs) are a popular method for mech interp - but how do we measure if they're any good at finding the "right" features? In a new paper, we propose more objective SAE evaluations by comparing against "known" features for a previously studied circuit!

English
1
1
3
343
Kording Lab 🦖
Kording Lab 🦖@KordingLab·
I am writing an app to help students plan their first research paper. It very extensively uses AI, and simultaneously is a document to teach how to do good research. Anyone here who would volunteer to test? DM me please.
English
18
17
189
16.7K