Harry Coppock

215 posts

Harry Coppock

@HarryCoppock

No. 10 Downing Street Innovation Fellow | Research Scientist at AISI | Visiting Lecturer at Imperial College London Working on AI Evaluation and AI for Medicine

London Katılım Mart 2021

355 Takip Edilen216 Takipçiler

Harry Coppock retweetledi

Xander Davies@alxndrdavies·1d

I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!

English

368

15.3K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·13 May

Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵

English

126

586

136.2K

Harry Coppock retweetledi

Aleksandr Bowkis@aleksandrbowkis·13 May

Can we safely automate alignment? Even if agents are not scheming, they can produce compelling research that survives extensive checks and strongly indicates that a model is safe but is catastrophically wrong. New paper from UK AISI: arxiv.org/abs/2605.06390

English

13.6K

Harry Coppock retweetledi

Tomek Korbak@tomekkorbak·1 May

OpenAI introduces an additional layer of defense against misaligned or confused coding agents, complementing chain of thought monitoring we use internally. When Codex wants to execute a risky action outside of its sandbox, a separate Codex agent is asked to approve or deny it.

English

177

19.6K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·30 Nis

OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

English

397

2.4K

1.8M

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·27 Nis

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

English

150

28.1K

Harry Coppock retweetledi

Robert Kirk@_robertkirk·27 Nis

We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements. Details including new methodology in 🧵:

AI Security Institute@AISecurityInst

English

20.3K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·24 Nis

We know AI systems occasionally act against their operators’ intentions – but what in their environment causes them to do so? In a new paper, we make progress on this question 🧵

English

104

13.4K

Harry Coppock retweetledi

Alan Cooney@Alan_Cooney_·24 Nis

Introducing vLLM-Lens: a fast interpretability tool that scales to trillion parameter models

English

676

41.7K

Harry Coppock@HarryCoppock·14 Nis

@thomasahle @AISecurityInst #L14" target="_blank" rel="nofollow noopener">github.com/UKGovernmentBE… So we don't count usage at the end over the trajectory. We log token usage per api call. Most model APIs give info on reasoning token usage.

English

Thomas Ahle@thomasahle·14 Nis

@AISecurityInst How do you count cumulative tokens for reasoning models like GPT-5?

English

548

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·13 Nis

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

English

113

551

1.3M

Harry Coppock retweetledi

Tomek Korbak@tomekkorbak·6 Nis

OpenAI is spinning up an AI safety research fellowship program similar to MATS or Anthropic Fellows. People should apply!

OpenAI@OpenAI

Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment—and the next generation of talent. openai.com/index/introduc…

English

457

75.2K

Harry Coppock retweetledi

OpenAI@OpenAI·6 Nis

Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment—and the next generation of talent. openai.com/index/introduc…

English

385

300

2.7K

946.3K

Harry Coppock retweetledi

7vik@satvikgolechha·30 Mar

Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness. 🧵 (1 of 7)

English

185

22K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·23 Mar

🔓 Can today’s AI agents escape sandbox environments? Using our new benchmark, SandboxEscapeBench, we find that frontier models can reliably exploit common vulnerabilities - and that breakout capability improves as model size and inference compute increase. Read more ⬇️

English

158

17.1K

Harry Coppock retweetledi

David@DavidDAfrica·9 Mar

Can LLMs tell when their conversation history has been tampered with? We tested 14 models across thousands of conversations to find out. Some new work from UK AISI 🧵

English

165

16.3K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·5 Mar

AI cyber capabilities are improving rapidly, but are evaluations keeping pace? Alongside @Irregular, we found that recent models can productively use 10-50x larger token budgets than typical evaluation settings allow, with key security implications🧵

English

20K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·25 Şub

How can we make sense of the vast transcripts generated during agentic evaluations and multi-turn conversations? Together with @meridianlabs_ai, we built Inspect Scout, an open-source transcript analysis tool, and distilled best practices into a step-by-step pipeline🧵

English

4.2K

Harry Coppock retweetledi

AI Security Institute@AISecurityInst·17 Şub

AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today, we share Boundary Point Jailbreaking (BPJ), the first fully automated attack to break the safeguards of leading AI models🧵 (1/8)

English

151

35.4K

Harry Coppock retweetledi

Rishi Sunak@RishiSunak·10 Şub

One of the reasons we created the UK AISI when I was Prime Minister was to have exactly this kind of independent red-team capability. Great to see @soundboy and the team continuing their valuable work so we can unlock the benefits of AI while keeping people safe.

Xander Davies@alxndrdavies

UK AISI's Red Team tested both OpenAI + Anthropic's models released today! We jailbroke GPT-5.3-Codex (and the conversation monitor) in 10 hours & conducted an alignment audit on Opus 4.6. 🧵

English

714

133.1K

Keşfet

@AISecurityInst @nytimes @AnthropicAI @thomasahle @Irregular @meridianlabs_ai @elonmusk @BarackObama