Cozmin Ududec

435 posts

Cozmin Ududec banner
Cozmin Ududec

Cozmin Ududec

@CUdudec

@AISecurityInst Science of Evaluation lead. Ex quantum foundationalist.

Katılım Haziran 2021
1.9K Takip Edilen533 Takipçiler
Cozmin Ududec retweetledi
Stephan Rabanser
Stephan Rabanser@steverab·
Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job! In our paper Log analysis is necessary for credible evaluation of AI agents, we ➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns); ➡️outline four key principles for conducting log analysis effectively; ➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and ➡️give a set of recommendations to improve log analysis quality and adoption. 📄arxiv.org/abs/2605.08545 More details in @PKirgis's thread below ⬇️
Peter Kirgis@PKirgis

New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: arxiv.org/pdf/2605.08545

English
1
5
13
3.2K
Cozmin Ududec retweetledi
Peter Kirgis
Peter Kirgis@PKirgis·
New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: arxiv.org/pdf/2605.08545
Peter Kirgis tweet media
English
6
22
94
17K
Cozmin Ududec retweetledi
AI Security Institute
AI Security Institute@AISecurityInst·
Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵
AI Security Institute tweet media
English
30
126
586
136.1K
Cozmin Ududec retweetledi
Sayash Kapoor
Sayash Kapoor@sayashk·
I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: x.com/sayashk/status… As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. x.com/PKirgis/status… This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work. I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.
Sayash Kapoor tweet media
Greg Burnham@GregHBurnham

Thread with a few notes on this. It’s a disappointing finding, of course. The best we can do is fix it up and learn lessons for future work.

English
2
5
44
13.1K
Cozmin Ududec
Cozmin Ududec@CUdudec·
I'll be mentoring for Pivotal this summer! Apply if you're interested in personas and behaviour dynamics over long trajectories.
Pivotal Research@pivotal_org

Language models read their own outputs as evidence for their current persona, sometimes entrenching it. Cozmin Ududec (@CUdudec) leads the Science of Evaluation team at UK AISI and is taking on Pivotal fellows to study how personas carry over, stabilise, drift, or compound across long conversations.

English
0
1
25
1.7K
Cozmin Ududec retweetledi
AI Security Institute
AI Security Institute@AISecurityInst·
OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵
AI Security Institute tweet media
English
95
397
2.4K
1.8M
Cozmin Ududec retweetledi
Nate
Nate@NateBurnikell·
We (@AISecurityInst) tested GPT-5.5 for its cyber capabilities and safeguards. It's the strongest performing model we've tested on our narrow cyber tasks and solved one of our cyber ranges in 1/10 attempts. We found a universal jailbreak with 6 hours of expert red teaming.
Nate tweet mediaNate tweet media
English
17
55
378
51.1K
Cozmin Ududec
Cozmin Ududec@CUdudec·
The paper and thread also have a lot of useful detail on best practices and pitfalls for running open-world evals well!
English
0
0
1
85
Cozmin Ududec
Cozmin Ududec@CUdudec·
More broadly: are there better ways to run these expensive, low-sample evaluations to get more insight efficiently? One idea is to run an episode end-to-end once, then return to an intermediate progress state, branch, and sample more heavily from that point. Could designs like this help us estimate time-horizons, inference-scaling efficiency, robustness, and harness effects?
English
1
0
1
103
Cozmin Ududec
Cozmin Ududec@CUdudec·
This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings! Glad the AISI SoE team could contribute to this effort.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
5
28
8.2K
Cozmin Ududec
Cozmin Ududec@CUdudec·
This growing variance of solved step at a given budget (or variance in tokens to reach a step) could be a big issue for estimating performance on very long-horizon tasks at very large token budgets.
English
0
1
9
841
Cozmin Ududec
Cozmin Ududec@CUdudec·
One thing I find interesting about this result is the large gap between the best run (dashed red line), and the average over 10 runs (solid heavy red line) for Mythos. At around 80M tokens, the best run is finished, while the average is still at step 20. Put another way, there is a huge variance in the random variable `log(token) to solve step n`!
AI Security Institute@AISecurityInst

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

English
5
2
30
2.6K
Cozmin Ududec
Cozmin Ududec@CUdudec·
One other thought is we likely need to change how we think about measuring performance. Instead of average success rates, it should likely be something like an efficiency metric ($ cost/solve, or the slope of the inference curve).
English
0
0
4
116
Cozmin Ududec
Cozmin Ududec@CUdudec·
Another nice example of the increasing effectiveness of inference scaling on very long and hard tasks, and fast saturation on new tasks! In Nov 2025, we changed our default budget from 10M to 100M tokens for some cyber tasks...which already seems too little.
david rein@idavidrein

@tmkadamcz and I started working on MirrorCode, a new long-horizon software engineering benchmark, last September. I think it’s the best benchmark for measuring AI’s ability to complete very hard (but precisely specified) software tasks—but it’s likely already saturated.

English
1
0
12
592
Cozmin Ududec retweetledi
7vik
7vik@satvikgolechha·
Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling + we share an unexpected result related to CoT faithfulness. 🧵 (1 of 7)
7vik tweet media
English
3
25
184
22K
Cozmin Ududec
Cozmin Ududec@CUdudec·
This is currently my favourite way to present eval results: inference scaling curves, across model generations, split by task difficulty. You can easily see the impact of token budgets, how performance becomes more log-linear over time, and how recent model performance on hard tasks looks like older model performance on easy tasks...
Cozmin Ududec tweet media
AI Security Institute@AISecurityInst

🔓 Can today’s AI agents escape sandbox environments? Using our new benchmark, SandboxEscapeBench, we find that frontier models can reliably exploit common vulnerabilities - and that breakout capability improves as model size and inference compute increase. Read more ⬇️

English
1
4
32
3.2K