Joseph Bloom

194 posts

Joseph Bloom banner
Joseph Bloom

Joseph Bloom

@JBloomAus

White Box Evaluations Lead @ UK AI Safety Institute. Open Source Mechanistic Interpretability. MATS 6.0. ARENA 1.0.

Oxford, England Katılım Şubat 2021
269 Takip Edilen600 Takipçiler
Sabitlenmiş Tweet
Joseph Bloom retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.
Owain Evans tweet media
English
97
164
994
153.3K
Joseph Bloom retweetledi
Xander Davies
Xander Davies@alxndrdavies·
The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵
Xander Davies tweet media
English
3
35
233
67.7K
Joseph Bloom
Joseph Bloom@JBloomAus·
@andyarditi Can I point my telescope at micro-organisms instead? 😆🦠
English
1
0
1
91
Joseph Bloom retweetledi
Chloe Li
Chloe Li@clippocampus·
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed self-report fine-tuning (SRFT), a simple supervised technique that increases models’ propensity to do so.
Chloe Li tweet media
English
2
11
59
15.8K
Joseph Bloom retweetledi
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.
Tim Hua 🇺🇦 tweet media
English
15
34
245
59.2K
Joseph Bloom retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
I'm excited that, this year, interpretability finally works well enough to be practically useful in the real world! We found that, with enough effort into dataset construction, simple linear probes are cheap, real-time, token level hallucination detectors and beat baselines
Oscar Balcells Obeso@OBalcells

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

English
22
116
1.6K
119.7K
Joseph Bloom
Joseph Bloom@JBloomAus·
@geoffreyirving on linear probes (in the context of my team's recent work)! I really vibe with: - Interp optimists in AI Safety should share more clearly articulated/nuanced arguments for the optimism. - Many related questions here benefit from careful science.
Geoffrey Irving@geoffreyirving

The AISI Whitebox Control Team is doing cool investigations into how well linear probes work, and has a new post sharing nuanced in-progress work. The results are mixed, in interesting ways! Please see Joseph's thread for details! I have only high-level observations. 🧵

English
0
0
2
79