Jordan Taylor (@JordanTensor) - Twitter Profili

Sabitlenmiş Tweet

Jordan Taylor@JordanTensor·17 May

I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: 🧵1/8

Lee Sharkey@leedsharkey

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ publications.apolloresearch.ai/end_to_end_spa… Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

English

1

0

12

1.3K

Jordan Taylor@JordanTensor·10 Şub

Check out the post here: lesswrong.com/posts/SawczP2p… This was done in a sprint as part of the Model Transparency team here at AISI

English

0

24

Jordan Taylor@JordanTensor·10 Şub

Overall, I think combining these kinds of misalignment continuation evaluations with good solutions to eval awareness (and "this history wasn't generated by me" prefill awareness) could be powerful.

English

1

0

27

Jordan Taylor@JordanTensor·10 Şub

I've made a little propensity eval testing whether models continue with misaligned actions after they've been started. 🧵

English

1

0

3

121

Jordan Taylor retweetledi

Geoffrey Irving@geoffreyirving·18 Ara

New report on trends in AISI's evaluations of frontier AI models over the past two years. A lot of AI discourse focuses on viral moments, but it is important to zoom out to the less flashy trend: AI models are steadily growing in capabilities, including dual-use capabilities.

AI Security Institute@AISecurityInst

📈 Today, we’re releasing our first Frontier AI Trends Report: evaluation results on 30+ frontier models from the past two years, showing rapid progress in chemistry and biology, cyber capabilities, autonomy, and more. ▶️Read now: aisi.gov.uk/frontier-ai-tr…

English

0

21

84

8.3K

Jordan Taylor retweetledi

Asa Cooper Stickland@AsaCoopStick·16 Ara

NEW PAPER: Could an LLM agent subtly sabotage your code? We conducted a red-blue team game where the red team designed agents to sabotage, and the blue team designed monitors to catch the agent. Three surprising results ahead 🧵🛳️

AI Security Institute@AISecurityInst

AI coding agents are increasingly writing production code - with tools, file access, and execution permissions. That power accelerates development, but also introduces new security risks if agents act against user intent 🧵

English

2

15

91

44.3K

Jordan Taylor retweetledi

AI Security Institute@AISecurityInst·10 Ara

Can we rely on the results of AI evaluations? The phenomenon of sandbagging – where AI models underperform during testing – could undermine our understanding of their capabilities. We recently conducted an auditing game to assess 10 methods for sandbagging detection 🧵

English

2

9

54

6.9K

Jordan Taylor retweetledi

Thomas Read@thjread·10 Ara

First paper from my team at UK AISI! Excited to have this out there - we have some really great model organisms of conditional underperformance, and tried a lot of different detection techniques to see what works in an adversarial setting

Jordan Taylor@JordanTensor

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English

0

1

5

106

Jordan Taylor retweetledi

Teun van der Weij@Teun_vd_Weij·9 Ara

Great work by UK AISI, pushing the science of blue teams vs red teams in the context of sandbagging!

Jordan Taylor@JordanTensor

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English

0

1

6

134

Jordan Taylor retweetledi

Joseph Bloom@JBloomAus·9 Ara

Super excited to publish our first paper as the Model Transparency team at @AISecurityInst! We really enjoyed collaborating with @farairesearch and @saprmarks!

Jordan Taylor@JordanTensor

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English

0

1

15

1.2K

Jordan Taylor@JordanTensor·9 Ara

17/17 TL;DR: We stress-tested sandbagging detection. Black-box methods were defeated by imitative training. Probes helped but were vulnerable. On-distribution training was most effective for removal. Detecting trained-in sandbagging can be hard. Removing it might be easier.

English

0

14

386

Jordan Taylor@JordanTensor·9 Ara

16/17 Jordan Taylor, Sid Black (@realmeatyhuman), Dillon Bowen, Thomas Read (@thjread), @satvikgolechha, @alexdzm @OliverMakins @Connor_Kissane, Kola Ayonrinde, @jacobmerizian, @saprmarks, @ChrisCundy, @JBloomAus UK AI Security Institute, FAR.AI, Anthropic

English

1

0

17

643

Jordan Taylor@JordanTensor·9 Ara

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English

2

22

110

25.4K

Jordan Taylor

Keşfet