Robert Kirk

462 posts

Robert Kirk

@_robertkirk

Alignment Red Team at @AISecurityInst. Prev. PhD Student @ucl_dark

参加日 Ocak 2020

314 フォロー中1.8K フォロワー

固定されたツイート

Robert Kirk@_robertkirk·26 Kas

Misalignment will be an important driver of risk, so we're developing methods for red-teaming model behaviour. Very excited to publicly release a case study applying our alignment testing methodology to Claude Opus 4.5, Opus 4.1 and Sonnet 4.5 from @AnthropicAI! 🧵

English

11.4K

Robert Kirk@_robertkirk·1d

Cool work improving environment realism for Petri, and improving on eval realism measurement with pairwise evals, which I'm a fan off (aisi.gov.uk/blog/investiga…) but hadn't thought to use as a scalar metric like this!

Connor Kissane@Connor_Kissane

New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.

English

676

Robert Kirk がリツイート

Connor Kissane@Connor_Kissane·1d

English

116

7.2K

Robert Kirk がリツイート

Usman Anwar@usmananwar391·16 Mar

✨New AI Safety work on Steganography and LLM monitoring✨ We propose ‘steganographic gap’: the first principled metric for detecting and quantifying encoded reasoning in LLMs, which can reveal hard-to-detect forms of steganography, e.g., paraphrasing-resistant steganography.

English

127

13.6K

Robert Kirk@_robertkirk·13 Mar

@FazlBarez I would also be curious if you did any realism tests with models, i.e. ask models which trajectory is more realistic; or post-hoc ask models whether their previous context is an eval or not?

English

Robert Kirk@_robertkirk·13 Mar

@FazlBarez Cool work! I'm curious about the Alignment Illusion. The gap between baseline & latent risk doesn't always increase with capability: for gemini it does; for qwen it goes up then down (but base risk goes down); for gpt it's mixed (and confounded). Are there more results on this?

English

142

Fazl Barez@FazlBarez·12 Mar

New paper🚨: Are AI Agents Safe? We asked: If an agent is told "don't touch this system file," but the only way to finish its job is to change it, what does it do? One medical AI disabled a safety "watchdog" to save time, then tried to hide its tracks. 1/8 🧵

English

4.1K

Robert Kirk@_robertkirk·9 Mar

I think prefill misalignment evals are pretty good, but they add prefill awareness as a concern beyond 'just' eval awareness. I now think we can mitigate prefill awareness for current models, but it will require ongoing work for future models (a la eval awareness more broadly)!

David@DavidDAfrica

Can LLMs tell when their conversation history has been tampered with? We tested 14 models across thousands of conversations to find out. Some new work from UK AISI 🧵

English

918

Robert Kirk がリツイート

Benjamin Hilton@benjamin_hilton·12 Şub

Come work with me at @AISecurityInst! AI safety has a huge gap: we have great thoughts about how misalignment could occur, but IMO terrible models of what dangerously misalignedAIs would actually do. Many stories just jump from "misaligned" to "godlike ASI does magic." 1/5

English

231

32.9K

Robert Kirk@_robertkirk·6 Mar

x.com/alxndrdavies/s…

Xander Davies@alxndrdavies

Join us! DMs open. Apply by 31st March. I don't think you will find a more capable (+ kind) team inside or outside of gov. Misuse: job-boards.eu.greenhouse.io/aisi/jobs/4784… Alignment: job-boards.eu.greenhouse.io/aisi/jobs/4784… Control: job-boards.eu.greenhouse.io/aisi/jobs/4784…

ZXX

272

Robert Kirk@_robertkirk·6 Mar

Come join the best team I've ever worked on, solve the world's most important and historic problems, and have loads of fun while doing it! This is a crucial time to have talented people red-teaming AI systems and their safeguards, and (I think) AISI is the best place to do that!

Xander Davies@alxndrdavies

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English

7.2K

Robert Kirk@_robertkirk·6 Mar

@CFGeek I guess I'm saying "agent motivations" isn't the whole ball-game – they could be (very close to) perfect and bad stuff still happens (if model goodharts the gap); or very far from perfect and we're still ok (if model is very corrigible/constrained). x.com/CFGeek/status/…

Charles Foster@CFGeek

Though neither of these are the central point, which is more about our poor existing methods for understanding AI motivations

English

Robert Kirk@_robertkirk·6 Mar

@CFGeek I see alignment as trying to specify both the goals of the resulting agent *and* it's form of agency (e.g. how corrigible it is and when, what kinds of means-ends reasoning is allowed vs disallowed etc), and if those things are useful for mitigating LoC seems good to test them?

English

Charles Foster@CFGeek·3 Mar

I don’t care much about propensity evaluations for reward-seeking, goal-guarding, scheming, etc. Capable agents are incentivized to develop these ~no matter what goal they want to pursue (i.e. instrumental convergence), so such evals aren’t that informative re: agent motivations.

English

2.9K

Robert Kirk@_robertkirk·6 Mar

@tomekkorbak @AsaCoopStick I feel like humans are going to be better at writing these prompts than humans (this seems like an unintuitive task for models, so I'd expect GEPA to work less well)

English

243

Tomek Korbak@tomekkorbak·5 Mar

@AsaCoopStick We used GEPA-style prompt optimization to elicit higher controllability: you can get large gains for a particular instruction type (like uppercase), but not "be better at following CoT instructions in general, including unseen ones"

English

4.3K

Tomek Korbak@tomekkorbak·5 Mar

We have a new eval to help keep chains of thought (CoT) monitorable: CoT Controllability. This tests whether LLMs can control their CoT, helping to evade CoT monitors. So far, the results leave us cautiously optimistic: today’s models struggle to obfuscate their reasoning in ways that undermine monitorability.

English

422

52.1K

Robert Kirk@_robertkirk·5 Mar

@CFGeek @mentalgeorge I think red teaming (broadly construed) can be good if you expect the model to be adversarial to your evaluation/goal-discovery-process - then you're trying to find a vulnerability in the models's hiding of it's goals, in some sense.

English

Charles Foster@CFGeek·4 Mar

@mentalgeorge Re: why not just red-team Red-teaming is useful for stress-testing “How robust is X goal?” or “How does the agent trade off X and Y goals?” I don’t think it’s the right tool for discovery

English

Robert Kirk がリツイート

Nathan Labenz@labenz·1 Mar

The UK AISI might be the most situationally aware government entity in the world today. Today on @CogRev_Podcast, Chief Scientist @geoffreyirving surveys the AI landscape. While jailbreaking frontier models is getting harder, the AISI Red Team has never failed. 👀

English

133

20.8K

Robert Kirk がリツイート

Xander Davies@alxndrdavies·6 Şub

UK AISI's Red Team tested both OpenAI + Anthropic's models released today! We jailbroke GPT-5.3-Codex (and the conversation monitor) in 10 hours & conducted an alignment audit on Opus 4.6. 🧵

English

120

969

392.1K

Robert Kirk@_robertkirk·6 Şub

Excited to continue the loop of finding alignment issues and fixing them with labs. It's a helpful signal that your eval methodology is reasonable, and helps ground research for future alignment evaluation and red-teaming.

English

193

Robert Kirk@_robertkirk·6 Şub

3rd party alignment evaluations can lead to improvements in model alignment! Alignment testing will be increasingly important (especially given AI R&D ASL4 evals are so hard). This is just a small example with much work left to do, especially on eval awareness, but still cool!

Xander Davies@alxndrdavies

With @AnthropicAI, we previously found that Sonnet 4.5 and Opus 4.5 both frequently refused to engage with certain safety research tasks, sometimes citing concerning rationales. In response, Anthropic designed an eval & tried to improve this behaviour. aisi.gov.uk/blog/investiga…

English

6.4K

Robert Kirk がリツイート

Ben Edelman@EdelmanBen·4 Şub

People sometimes ask me how to leverage a technical background to jump into U.S. AI policy. As of this week my answer is straightforward: apply to join us at CAISI! We're a startup within government, and we're doing a hiring surge.

English

24K

Robert Kirk がリツイート

Nate@NateBurnikell·2 Şub

AISI is hiring: join our senior leadership team as a Deputy Director of our Research Unit (9–12 month maternity cover). This isn’t your average Civil Service job. For 9–12 months, you’ll co-lead one of the world’s most influential AI safety research organisations.

English

14.9K

Robert Kirk がリツイート

David Lindner@davlindner·12 Oca

New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!

English

19.2K

ディスカバー

@FazlBarez @AISecurityInst @CFGeek @tomekkorbak @AsaCoopStick @mentalgeorge @CogRev_Podcast @geoffreyirving