Jordan Taylor

473 posts

Jordan Taylor banner
Jordan Taylor

Jordan Taylor

@JordanTensor

Working on new methods for understanding machine learning systems and entangled quantum systems.

Brisbane Joined Aralık 2009
1.1K Following441 Followers
Pinned Tweet
Jordan Taylor
Jordan Taylor@JordanTensor·
I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: 🧵1/8
Jordan Taylor tweet media
Lee Sharkey@leedsharkey

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ publications.apolloresearch.ai/end_to_end_spa… Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

English
1
0
12
1.3K
Jordan Taylor
Jordan Taylor@JordanTensor·
Overall, I think combining these kinds of misalignment continuation evaluations with good solutions to eval awareness (and "this history wasn't generated by me" prefill awareness) could be powerful.
English
1
0
0
27
Jordan Taylor
Jordan Taylor@JordanTensor·
I've made a little propensity eval testing whether models continue with misaligned actions after they've been started. 🧵
Jordan Taylor tweet media
English
1
0
3
121
Jordan Taylor retweeted
Geoffrey Irving
Geoffrey Irving@geoffreyirving·
New report on trends in AISI's evaluations of frontier AI models over the past two years. A lot of AI discourse focuses on viral moments, but it is important to zoom out to the less flashy trend: AI models are steadily growing in capabilities, including dual-use capabilities.
Geoffrey Irving tweet media
AI Security Institute@AISecurityInst

📈 Today, we’re releasing our first Frontier AI Trends Report: evaluation results on 30+ frontier models from the past two years, showing rapid progress in chemistry and biology, cyber capabilities, autonomy, and more. ▶️Read now: aisi.gov.uk/frontier-ai-tr…

English
0
21
84
8.3K
Jordan Taylor retweeted
Asa Cooper Stickland
Asa Cooper Stickland@AsaCoopStick·
NEW PAPER: Could an LLM agent subtly sabotage your code? We conducted a red-blue team game where the red team designed agents to sabotage, and the blue team designed monitors to catch the agent. Three surprising results ahead 🧵🛳️
Asa Cooper Stickland tweet media
AI Security Institute@AISecurityInst

AI coding agents are increasingly writing production code - with tools, file access, and execution permissions. That power accelerates development, but also introduces new security risks if agents act against user intent 🧵

English
2
15
91
44.3K
Jordan Taylor retweeted
AI Security Institute
AI Security Institute@AISecurityInst·
Can we rely on the results of AI evaluations? The phenomenon of sandbagging – where AI models underperform during testing – could undermine our understanding of their capabilities. We recently conducted an auditing game to assess 10 methods for sandbagging detection 🧵
AI Security Institute tweet media
English
2
9
54
6.9K
Jordan Taylor retweeted
Thomas Read
Thomas Read@thjread·
First paper from my team at UK AISI! Excited to have this out there - we have some really great model organisms of conditional underperformance, and tried a lot of different detection techniques to see what works in an adversarial setting
Jordan Taylor@JordanTensor

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

English
0
1
5
106
Jordan Taylor retweeted
Jordan Taylor
Jordan Taylor@JordanTensor·
17/17 TL;DR: We stress-tested sandbagging detection. Black-box methods were defeated by imitative training. Probes helped but were vulnerable. On-distribution training was most effective for removal. Detecting trained-in sandbagging can be hard. Removing it might be easier.
English
0
0
14
386
Jordan Taylor
Jordan Taylor@JordanTensor·
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17
Jordan Taylor tweet media
English
2
22
110
25.4K