Joseph Bloom

194 posts

Joseph Bloom

@JBloomAus

White Box Evaluations Lead @ UK AI Safety Institute. Open Source Mechanistic Interpretability. MATS 6.0. ARENA 1.0.

Oxford, England Katılım Şubat 2021

269 Takip Edilen600 Takipçiler

Sabitlenmiş Tweet

Joseph Bloom@JBloomAus·9 Ara

Super excited to publish our first paper as the Model Transparency team at @AISecurityInst! We really enjoyed collaborating with @farairesearch and @saprmarks!

Jordan Taylor@JordanTensor

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English

1.2K

Joseph Bloom retweetledi

Owain Evans@OwainEvans_UK·18 Mar

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

English

164

994

153.3K

Joseph Bloom retweetledi

Xander Davies@alxndrdavies·6 Mar

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English

233

67.7K

Joseph Bloom@JBloomAus·10 Ara

@andyarditi Can I point my telescope at micro-organisms instead? 😆🦠

English

Andy Arditi@andyarditi·10 Ara

Be curious like Galileo - point your telescope at the sky.

David Bau@davidbau

At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today. Here is a blog post summarizing the talk: davidbau.com/archives/2025/…

English

1.7K

Joseph Bloom retweetledi

Chloe Li@clippocampus·14 Kas

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.

English

15.8K

Joseph Bloom retweetledi

Tim Hua 🇺🇦@Tim_Hua_·30 Eki

Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.

English

245

59.2K

Joseph Bloom retweetledi

Neel Nanda@NeelNanda5·9 Eyl

I'm excited that, this year, interpretability finally works well enough to be practically useful in the real world! We found that, with enough effort into dataset construction, simple linear probes are cheap, real-time, token level hallucination detectors and beat baselines

Oscar Balcells Obeso@OBalcells

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

English

116

1.6K

119.7K

Joseph Bloom retweetledi

neuronpedia@neuronpedia·5 Ağu

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

English

333

63.7K

Joseph Bloom@JBloomAus·16 Tem

Was very happy to contribute to this position paper which makes an important but under-appreciated point! We should not take CoT monitorability for granted!

Mikita Balesni 🇺🇦@balesni

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it: 🧵

English

383

Joseph Bloom@JBloomAus·15 Tem

@geoffreyirving on linear probes (in the context of my team's recent work)! I really vibe with: - Interp optimists in AI Safety should share more clearly articulated/nuanced arguments for the optimism. - Many related questions here benefit from careful science.

Geoffrey Irving@geoffreyirving

The AISI Whitebox Control Team is doing cool investigations into how well linear probes work, and has a new post sharing nuanced in-progress work. The results are mixed, in interesting ways! Please see Joseph's thread for details! I have only high-level observations. 🧵

English

Joseph Bloom@JBloomAus·10 Tem

UK AISI Blog: aisi.gov.uk/work/why-were-… Alignment Forum: tinyurl.com/3svfbysb LessWrong: tinyurl.com/uwhek8mb

English

242

Joseph Bloom@JBloomAus·10 Tem

13/13 Thanks to everyone who helped make this work happen! @JordanTensor @Connor_Kissane @Sid @realmeatyhuman @alexdzm @jacobmerizian Jacob Arbeid @ben_millwood @Alan_Cooney_

English

343

Joseph Bloom@JBloomAus·10 Tem

🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇

AI Security Institute@AISecurityInst

We’ve released a detailed progress update on our white box control work so far! Read it here: alignmentforum.org/posts/pPEeMdgj…

English

10.7K

Keşfet

@AISecurityInst @andyarditi @Anthropic @GoogleDeepMind @GoodfireAI @AiEleuther @decode_research @geoffreyirving