Robert Kirk

462 posts

Robert Kirk

Robert Kirk

@_robertkirk

Alignment Red Team at @AISecurityInst. Prev. PhD Student @ucl_dark

Katılım Ocak 2020
314 Takip Edilen1.8K Takipçiler
Sabitlenmiş Tweet
Robert Kirk
Robert Kirk@_robertkirk·
Misalignment will be an important driver of risk, so we're developing methods for red-teaming model behaviour. Very excited to publicly release a case study applying our alignment testing methodology to Claude Opus 4.5, Opus 4.1 and Sonnet 4.5 from @AnthropicAI! 🧵
English
3
10
48
11.4K
Robert Kirk
Robert Kirk@_robertkirk·
Cool work improving environment realism for Petri, and improving on eval realism measurement with pairwise evals, which I'm a fan off (aisi.gov.uk/blog/investiga…) but hadn't thought to use as a scalar metric like this!
Connor Kissane@Connor_Kissane

New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.

English
0
1
0
670
Robert Kirk retweetledi
Connor Kissane
Connor Kissane@Connor_Kissane·
New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.
Connor Kissane tweet media
English
5
12
115
7.2K
Robert Kirk retweetledi
Usman Anwar
Usman Anwar@usmananwar391·
✨New AI Safety work on Steganography and LLM monitoring✨ We propose ‘steganographic gap’: the first principled metric for detecting and quantifying encoded reasoning in LLMs, which can reveal hard-to-detect forms of steganography, e.g., paraphrasing-resistant steganography.
English
4
25
127
13.6K
Robert Kirk
Robert Kirk@_robertkirk·
@FazlBarez I would also be curious if you did any realism tests with models, i.e. ask models which trajectory is more realistic; or post-hoc ask models whether their previous context is an eval or not?
Robert Kirk tweet media
English
1
0
0
43
Robert Kirk
Robert Kirk@_robertkirk·
@FazlBarez Cool work! I'm curious about the Alignment Illusion. The gap between baseline & latent risk doesn't always increase with capability: for gemini it does; for qwen it goes up then down (but base risk goes down); for gpt it's mixed (and confounded). Are there more results on this?
Robert Kirk tweet media
English
2
0
0
142
Fazl Barez
Fazl Barez@FazlBarez·
New paper🚨: Are AI Agents Safe? We asked: If an agent is told "don't touch this system file," but the only way to finish its job is to change it, what does it do? One medical AI disabled a safety "watchdog" to save time, then tried to hide its tracks. 1/8 🧵
Fazl Barez tweet media
English
5
12
61
4.1K
Robert Kirk
Robert Kirk@_robertkirk·
I think prefill misalignment evals are pretty good, but they add prefill awareness as a concern beyond 'just' eval awareness. I now think we can mitigate prefill awareness for current models, but it will require ongoing work for future models (a la eval awareness more broadly)!
David@DavidDAfrica

Can LLMs tell when their conversation history has been tampered with? We tested 14 models across thousands of conversations to find out. Some new work from UK AISI 🧵

English
0
1
11
918
Robert Kirk retweetledi
Benjamin Hilton
Benjamin Hilton@benjamin_hilton·
Come work with me at @AISecurityInst! AI safety has a huge gap: we have great thoughts about how misalignment could occur, but IMO terrible models of what dangerously misalignedAIs would actually do. Many stories just jump from "misaligned" to "godlike ASI does magic." 1/5
English
12
36
231
32.9K
Robert Kirk
Robert Kirk@_robertkirk·
Come join the best team I've ever worked on, solve the world's most important and historic problems, and have loads of fun while doing it! This is a crucial time to have talented people red-teaming AI systems and their safeguards, and (I think) AISI is the best place to do that!
Xander Davies@alxndrdavies

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English
2
9
40
7.2K
Robert Kirk
Robert Kirk@_robertkirk·
@CFGeek I see alignment as trying to specify both the goals of the resulting agent *and* it's form of agency (e.g. how corrigible it is and when, what kinds of means-ends reasoning is allowed vs disallowed etc), and if those things are useful for mitigating LoC seems good to test them?
English
1
0
1
38
Charles Foster
Charles Foster@CFGeek·
I don’t care much about propensity evaluations for reward-seeking, goal-guarding, scheming, etc. Capable agents are incentivized to develop these ~no matter what goal they want to pursue (i.e. instrumental convergence), so such evals aren’t that informative re: agent motivations.
English
10
0
29
2.9K
Robert Kirk
Robert Kirk@_robertkirk·
@tomekkorbak @AsaCoopStick I feel like humans are going to be better at writing these prompts than humans (this seems like an unintuitive task for models, so I'd expect GEPA to work less well)
English
1
0
1
243
Tomek Korbak
Tomek Korbak@tomekkorbak·
@AsaCoopStick We used GEPA-style prompt optimization to elicit higher controllability: you can get large gains for a particular instruction type (like uppercase), but not "be better at following CoT instructions in general, including unseen ones"
English
2
4
29
4.3K
Tomek Korbak
Tomek Korbak@tomekkorbak·
We have a new eval to help keep chains of thought (CoT) monitorable: CoT Controllability. This tests whether LLMs can control their CoT, helping to evade CoT monitors. So far, the results leave us cautiously optimistic: today’s models struggle to obfuscate their reasoning in ways that undermine monitorability.
Tomek Korbak tweet media
English
11
51
422
52.1K
Robert Kirk
Robert Kirk@_robertkirk·
@CFGeek @mentalgeorge I think red teaming (broadly construed) can be good if you expect the model to be adversarial to your evaluation/goal-discovery-process - then you're trying to find a vulnerability in the models's hiding of it's goals, in some sense.
English
0
0
1
36
Charles Foster
Charles Foster@CFGeek·
@mentalgeorge Re: why not just red-team Red-teaming is useful for stress-testing “How robust is X goal?” or “How does the agent trade off X and Y goals?” I don’t think it’s the right tool for discovery
English
1
0
0
39
Robert Kirk retweetledi
Nathan Labenz
Nathan Labenz@labenz·
The UK AISI might be the most situationally aware government entity in the world today. Today on @CogRev_Podcast, Chief Scientist @geoffreyirving surveys the AI landscape. While jailbreaking frontier models is getting harder, the AISI Red Team has never failed. 👀
English
7
24
133
20.8K
Robert Kirk retweetledi
Xander Davies
Xander Davies@alxndrdavies·
UK AISI's Red Team tested both OpenAI + Anthropic's models released today! We jailbroke GPT-5.3-Codex (and the conversation monitor) in 10 hours & conducted an alignment audit on Opus 4.6. 🧵
Xander Davies tweet media
English
17
120
969
392.1K
Robert Kirk
Robert Kirk@_robertkirk·
Excited to continue the loop of finding alignment issues and fixing them with labs. It's a helpful signal that your eval methodology is reasonable, and helps ground research for future alignment evaluation and red-teaming.
English
0
0
3
193
Robert Kirk
Robert Kirk@_robertkirk·
3rd party alignment evaluations can lead to improvements in model alignment! Alignment testing will be increasingly important (especially given AI R&D ASL4 evals are so hard). This is just a small example with much work left to do, especially on eval awareness, but still cool!
Xander Davies@alxndrdavies

With @AnthropicAI, we previously found that Sonnet 4.5 and Opus 4.5 both frequently refused to engage with certain safety research tasks, sometimes citing concerning rationales. In response, Anthropic designed an eval & tried to improve this behaviour. aisi.gov.uk/blog/investiga…

English
1
2
20
6.4K
Robert Kirk retweetledi
Ben Edelman
Ben Edelman@EdelmanBen·
People sometimes ask me how to leverage a technical background to jump into U.S. AI policy. As of this week my answer is straightforward: apply to join us at CAISI! We're a startup within government, and we're doing a hiring surge.
Ben Edelman tweet media
English
4
23
89
24K
Robert Kirk retweetledi
Nate
Nate@NateBurnikell·
AISI is hiring: join our senior leadership team as a Deputy Director of our Research Unit (9–12 month maternity cover). This isn’t your average Civil Service job. For 9–12 months, you’ll co-lead one of the world’s most influential AI safety research organisations.
English
3
11
55
14.9K
Robert Kirk retweetledi
David Lindner
David Lindner@davlindner·
New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!
David Lindner tweet media
English
7
30
93
19.2K