Georg Lange

245 posts

Georg Lange

@_georg_lange

Interested in artificial and biological intelligence.

Amsterdam, The Netherlands Katılım Eylül 2021

298 Takip Edilen192 Takipçiler

Sabitlenmiş Tweet

Georg Lange@_georg_lange·24 May

New mech interp paper 🚨 We often look for interpretable features in LLM activations - nowadays with Sparse Autoencoders - but don't know how to measure how well they work. Here, we developed a new set of metrics that benchmark SAE performance using known circuits👇

Alex Makelov@AMakelov

Sparse autoencoders (SAEs) are a popular method for mech interp - but how do we measure if they're any good at finding the "right" features? In a new paper, we propose more objective SAE evaluations by comparing against "known" features for a previously studied circuit!

English

2.4K

Georg Lange retweetledi

Foresight Institute@foresightinst·28 Nis

How can AI strengthen security, protect privacy, and enable cooperation among a rising plurality of AI systems and humans? Some of the best minds working on this are joining us at our Secure & Sovereign AI Workshop, July 18–19 in Berlin. Confirmed speakers: • @jesseposner – Vora • @robinhanson – George Mason University • @ml_sudo – Project Sovereign • Lisa Beckers – Global Technology Risk Foundation • @socrates1024 – ZCash Foundation • @FazlBarez – University of Oxford • @IvanVendrov – Midjourney • Georgios Kaissis – HPI Potsdam • Davide Crapis – MystLabs Inc. • @DanGirsh – WorldCoin • @ObadiaAlex – ARIA • @jsotterbach – SPRIND • @NitzanShulman – Heron AI • Or Zamir – Blavatnik School of Computer Science • @dimasquest – PhD, Imperial College London • @0xQuintus – Flashbots • John Liagouris – Boston University • Mariana Meireles – UC Berkeley • Rob Sison – UNSW Sydney • @KeithPatarroyo – University of Glasgow • Dongwon Lee – PSU • @galmasha – Foresight Fellow • @EricMoore – PhD, Kennedy Kreiger Institute • @MorganLvng – TechCongress Fellow • Abraham Nash – Infinite Zero Foundation • @JaimeRalV – Apart Research • @SamuelNellessen – CAIS • @iamwsubramanyam – CAIS • Pascal Berrang – Zeroth Research • @luca_arnaboldi – Zeroth Research • @mateo_petel – Google Deepmind • Madeleine Parker – Newfoundation • @jmartink – Lateral • @aurelcode – Inversed • Janabel Xia – Privacy Residency • Kazik Pogoda – Xemantic • @Tianyi_Alex_Qiu – Peking University • @Gunnar_Zarncke – Aintelope • @_georg_lange • Eduard Kapelko If you’re researching or building in AI safety, security, privacy, or decentralized cooperation, apply to attend: foresight.org/events/2026-se… Sponsored by: @protocollabs

GIF

English

2.3K

Georg Lange@_georg_lange·3 Şub

This is joint work with Rick Goldstein and my SPAR mentees Kat Dearstyne and Kamal Maher. Big shoutout to them!

English

101

Georg Lange@_georg_lange·3 Şub

But: Are circuits meaningfully different? Would we draw different conclusions about how LLMs work? Yes! We show that PLTs find real features that CLTs miss. And this changes our circuit story about how LLMs do verbatim memorization.

English

135

Georg Lange@_georg_lange·3 Şub

Attribution circuits give us high-level mechanistic explanations. But can we rely on their results? Not always! We investigate mechanistic (un)faithfulness of Crosslayer Transcoders, and show failure modes that distort the mechanistic story we read off.

English

1.9K

Georg Lange@_georg_lange·31 Oca

I’ve noticed this in myself and I keep warning my mentees about it, so it’s nice to finally see real data from @AnthropicAI. My hypothesis of what's going on: Learning and problem solving takes real mental effort but chatting does not; the brain minimizes effort/reward. For the same task (reward), this creates a subconscious bias to overuse AI tools. The tricky part is it's hard to notice. It’s not like doomscrolling social media; it still feels like real work. To avoid this, I’ve gone back to doing these myself: - write the core code - write the paper - debug and really understand the bugs How I use AI: - to ask questions about a new codebase or library, find code I want to look at, install stuff - annoying I/O, logging, documentation, pretty plots - latex formatting or unimportant appendix text

Anthropic@AnthropicAI

AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…

English

244

Georg Lange@_georg_lange·13 Oca

@SPARexec Apple here: sparai.org/projects/sp26/…

English

113

Georg Lange@_georg_lange·13 Oca

Come work with me and @SPARexec to build an AI mech interp researcher to accelerate AI safety research.🧠🔬 In the last cohort, my mentees built AI agents that automatically find and refine explanations for SAE features (demo of what they built after only one month below). In this cohort, we want to push for agents that discover and explain full circuits. Deadline is Jan 14th!⏳🗓️

SPAR@SPARexec

🚀 We're excited to announce that mentee applications are now open for the Spring round of the SPAR research program! This will be our largest round ever, featuring 130+ projects across AI safety, policy, governance, security, welfare, and strategy.

English

928

Georg Lange retweetledi

Neel Nanda@NeelNanda5·30 Ağu

Due to popular demand, I've extended by deadline on my MATS application by two weeks, until Sept 12. If you didn't have time to apply before, hopefully that's enough time to reconsider! I'm very excited to work with the best people I can

Neel Nanda@NeelNanda5

My Winter MATS applications are open! You'll work full-time writing a mech interp paper supervised by me. Due Aug 29 I've supervised 30+ papers by now (incl 15 top conference papers) but cohorts still get better each time. I'm hyped to see what this cohort achieves! Highlights:

English

180

50.7K

Georg Lange@_georg_lange·26 Tem

@Jack_W_Lindsey Thanks a lot! Yeah I was wondering about this too. In the limit, you could have a single transform with k_t=65k and get L0=1 and almost perfect MSE but of course no interpretability. Is there an easy way to get a measurement on how interpretable the transforms are?

English

107

Jack Lindsey@Jack_W_Lindsey·26 Tem

@_georg_lange For L0 counting, a transform is 1. Note that whether this is a fair comparison to transcoders depends a lot on how interpretable transforms are In the plot w/ "65k features" labels, the MOLT runs are parameter-matched, i.e have less than 65k transforms (sorry this was unclear!)

English

194

Jack Lindsey@Jack_W_Lindsey·25 Tem

Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminary evidence suggests they may be more efficient, mechanistically faithful, and compositional than existing techniques like transcoders transformer-circuits.pub/2025/bulk-upda…

English

172

19.4K

Georg Lange@_georg_lange·14 Haz

@Butanium_ @jxmnop Would you recommend BatchTopK for crosslayer transcoders as well? Do you know how it compares to JumpReLU?

English

Clément Dumas@Butanium_·13 Haz

@jxmnop agreed! if you want to play with crosscoder, i'd recommend checking our paper / thread explaining why you shouldn't use their L1 penalty and optimize/enforce the L0 directly instead. we also replicate their finding of finding cool chat specifc latents 🫡 x.com/Butanium_/stat…

Clément Dumas@Butanium_

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

English

1.5K

dr. jack morris@jxmnop·13 Haz

just learned about "model diffing" from Anthropic. buried in an october blogpost; feels really novel. training a 'crosscoder' between two models of the same family produces interpretable diffs. here post-training clearly adds refusals, QA, math, etc. pretty amazing stuff

English

717

46.2K

Georg Lange@_georg_lange·24 Nis

Here's the paper: openreview.net/pdf?id=1Njl73J…

English

Georg Lange@_georg_lange·24 Nis

📢 Accepted at #ICLR2025! Visit our poster tomorrow morning if you wanna know how good Sparse Autoencoders (SAEs) really are! We propose objective evaluations to measure how good SAEs are at finding the "right" features. 📷 Friday, April 25 🕑 10:00 - 12:30 📍Poster #558

Alex Makelov@AMakelov

English

343

Georg Lange@_georg_lange·16 Nis

@KordingLab I would!

English

359

Kording Lab 🦖@KordingLab·16 Nis

I am writing an app to help students plan their first research paper. It very extensively uses AI, and simultaneously is a document to teach how to do good research. Anyone here who would volunteer to test? DM me please.

English

189

16.7K

Keşfet

@jesseposner @robinhanson @ml_sudo @socrates1024 @FazlBarez @IvanVendrov @DanGirsh @ObadiaAlex