Lee Sharkey

687 posts

Lee Sharkey

@leedsharkey

Scruting matrices @ Goodfire | Previously: cofounded Apollo Research

London, UK Katılım Mart 2015

1.6K Takip Edilen2.6K Takipçiler

Lee Sharkey retweetledi

Tyler John@tyler_m_john·1d

OK since p(doom) is discourse here is my view on communicating risk with probabilities. We should do it because it makes it much clearer to people what you think and is empirically demonstrated good epistemic practice. Also we should hedge to show high-order uncertainty. Some theses: 1. Probabilities give people more insight into what you think. If you use vague, qualitative language instead of numbers, people will just assume what you mean. There's @PTetlock's famous Bay of Pigs anecdote, where an advisor told Kennedy there was a “fair chance,” meaning a 25% chance of success. Kennedy later reported he had assumed the advisor meant a 75% chance, and said he wouldn't have pursued the invasion if he had known the advisor only meant 25%! But this kind of miscommunication is ubiquitous. People assume different things about likelihood when speakers use qualitative language — it's an inherently less clear way to communicate what you are thinking. If you want your speaker to understand you, use numbers! Or at the very least, refer to the literature on perceptions of probability (see below) and pick your qualitative term very carefully so you communicate the right range! And don't use the extremely vague terms like "fair chance" or "improbable" that could mean literally anything to your listener. That is an extreme form of carelessness that we don't criticize often enough. 2. There haven't been many clear findings from the science of forecasting, but one of the clearest findings is that you make better predictions when you use precise numbers, even if these are completely made up. This is also true in group settings when aggregating the judgments of many people — which is essentially an idealized version of what we're doing pretty much any time we talk about probabilities. academic.oup.com/isq/article-ab… Here is an old thread I wrote on this topic some years ago: x.com/tyler_m_john/s… 3. Yes, people do perceive numbers as signaling more authority, and we shouldn't signal more authority than is appropriate. (How much is appropriate? Depends on the context. There isn't a universal answer in the context of existential risk from AI.) But you can do that without dropping numbers and losing the benefits of numbers I just set out. For example you can just use couching language, like "I would guess roughly 20%, but huge error bars, no one knows." 4. This can be studied!! It has already been studied a lot. I am finding it frustrating that no one in this debate is citing actual literature on perceptions of probabilities, especially in the age of LLMs where this information is readily available. We do know that percentages are viewed as more credible than qualitative language: papers.ssrn.com/sol3/papers.cf…. We do also know that hearing "61.87%" rather than "60%" triggers the inference that the speaker must have epistemic access that warrants the extra digits. frontiersin.org/journals/psych… How much higher-order confidence is it appropriate to convey when communicating the P(doom) of, say, a world expert on AI or an aggregate survey of every AI researcher publishing in NeurIPS? I don't know! If you want to make an argument that saying "20%" signals too much confidence, please cite some of this literature and explain why you think that the groundedness signaled to the audience is inappropriate. If you do want to advocate for a different communication style, it is not expensive to run a quick MTurk study to see what people's perceptions of it are and compare it to default rhetoric. Or even more cheaply you can run it on LLMs, which are a decent natural laboratory for testing hypotheses about human psychology in the absence of humans to test on. I hope to practice what I preach in the coming days and run some more LLM tests (I've ran one N = 7000 test yesterday) and set up a Mechanical Turk account so I can test my above claim about couching probabilities being just as good as using qualitative language, but with more clarity in communication and better epistemic practice.

English

15.7K

Lee Sharkey retweetledi

Nick Wang@nkwang24·6d

At my last job, we often got calls from parents frantically asking for their child's genetic test results. Too often, the results were inconclusive. Variant effect prediction sounds abstract but can be life-or-death for genetic disorders. Proud of the team for narrowing this gap!

Goodfire@GoodfireAI

We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic. We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)

English

6.3K

Lee Sharkey retweetledi

Goodfire@GoodfireAI·14 Nis

Our research with Mayo Clinic was just covered in @TIME! “If there's some barrier like, ‘Is interpretability useful?’ I think we've been cracking it, and I think we've smashed through it” — @DanJBalsam

Goodfire@GoodfireAI

English

5.9K

Lee Sharkey retweetledi

Goodfire@GoodfireAI·14 Nis

English

154

814

176.1K

Lee Sharkey retweetledi

Tyler John@tyler_m_john·2 Nis

@repligate Even your friends or community. Huge blackpill.

English

664

Lee Sharkey retweetledi

Helen Toner@hlntnr·25 Şub

One thing the Pentagon is very likely underestimating: how much Anthropic cares about what *future Claudes* will make of this situation. Because of how Claude is trained, what principles/values/priorities the company demonstrate here could shape its "character" for a long time.

Andrew Curran@AndrewCurran_

Update on the meeting; according to Axios Defense Secretary Pete Hegseth gave Dario Amodei until Friday night to give the military unfettered access to Claude or face the consequences, which may even include invoking the Defense Production Act to force the training of a WarClaude

English

125

242.7K

Lee Sharkey@leedsharkey·11 Şub

"During RLFR, we run our probes on a frozen copy of the original model — *not* the student. This allows us to scale training without the student learning to evade our monitors."

Goodfire@GoodfireAI

We used interpretability to scale RL against open-ended tasks, cutting Gemma 12B’s hallucination rate in half by teaching it to self-correct in tandem with our probing harness.

English

Lee Sharkey@leedsharkey·6 Şub

@livgorton Fwiw, as an employee (and friend) I respectfully disagree with these perspectives I really don't intend to invalidate what was a difficult experience for you (esp not publicly) But lack of a contradicting public statement might be perceived as my tacit agreement

English

1.7K

Liv@livgorton·5 Şub

Now that everything is public: I decided to leave Goodfire because of the decision to train on interpretability, the hostility to serious dialogue on the safety of methods, and a loss of trust that the primary motivation was safety.

English

570

64.1K

Lee Sharkey retweetledi

Tom McGrath@banburismus_·5 Şub

We’re putting more computation (in the form of intelligence) into the most general object in neural network training: backprop. This essay describes how I think we can do this, why interp is key, the relevance to alignment, and how we should do it right.

English

559

67.3K

Lee Sharkey@leedsharkey·5 Şub

More interpretability is good

Goodfire@GoodfireAI

We raised a $150M Series B at a $1.25B valuation to fundamentally change the field of AI. Scaling is powerful, but we can't intentionally design what we don't understand.

English

2.4K

Lee Sharkey retweetledi

Goodfire@GoodfireAI·5 Şub

We raised a $150M Series B at a $1.25B valuation to fundamentally change the field of AI. Scaling is powerful, but we can't intentionally design what we don't understand.

English

496

210.5K

Lee Sharkey retweetledi

Amanda Askell@AmandaAskell·29 Oca

ZXX

105

1.8K

100.8K

Lee Sharkey retweetledi

Goodfire@GoodfireAI·28 Oca

We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente. How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)

English

223

1.7K

394.8K

Lee Sharkey@leedsharkey·17 Ara

Want to do ambitious mechanistic interpretability research? Then apply to my summer 2026 MATS stream! Deadline Jan 18, 2026 matsprogram.org/apply

English

206

15.4K

Lee Sharkey retweetledi

Apollo Research@apolloaievals·24 Kas

“Loss of control” lacks a common, actionable, definition and conceptualization. In our new research report we: 1) propose a new taxonomy, 2) put forward actionable mitigations today, and 3) motivate the need for preparedness. We propose a taxonomy for loss of control 👇🧵

English

14K

Lee Sharkey retweetledi

David Manheim ✈️ Singapore for ISO/IEC JTC 1/SC 42@davidmanheim·16 Kas

I will again state my view that condemning bad things is great, but condemning others for failing to condemn bad things, (much less boycotting them and similar glorious loyalty oath crusades,) is building toxic community incentives and attempting to force conformity.

English

Lee Sharkey@leedsharkey·8 Kas

Yeup!

Yo Shavit@yonashav

@sebkrier and I are pretty floored by the quality of MATS applicants

English

697

Lee Sharkey retweetledi

Goodfire@GoodfireAI·29 Eki

Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper? Our new research with @RakutenGroup on PII detection finds that SAE probes: - transfer from synthetic to real data better than normal probes - match GPT-5 Mini performance at 1/15 the cost (1/6)

English

330

70K

Lee Sharkey retweetledi

Goodfire@GoodfireAI·9 Eki

Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall.

English

151

53.1K

Lee Sharkey@leedsharkey·26 Eyl

@SuryaGanguli @Stanford Also, not a paper but might still be of interest: I discuss the historical overlap between mech interp and neuroscience here alignmentforum.org/posts/beREnXhB…

English

137

Lee Sharkey@leedsharkey·26 Eyl

Great list! Looks like a great course! I'll also flag some of our work that might fit into the 'causal analysis' section: arxiv.org/abs/2506.20790 But it builds heavily on our other (larger) paper, which might fit better in the 'circuit discovery' section (or maybe even the SAE section) arxiv.org/abs/2501.14926

English

306

Surya Ganguli@SuryaGanguli·23 Eyl

Teaching a new course @Stanford this quarter on explainable AI, motivated by neuroscience. I have curated a paper list 4 pages long (link in comment). What are your favorite papers on explainable AI/mechanistic interpretability that I am missing? Please comment or DM. thanks!

English

239

1.8K

118.6K

Keşfet

@PTetlock @TIME @DanJBalsam @MayoClinic @repligate @livgorton @PrimaMente @RakutenGroup