Cozmin Ududec

455 posts

Cozmin Ududec

@CUdudec

@AISecurityInst Science of Evaluation lead. Ex quantum foundationalist.

Katılım Haziran 2021

1.9K Takip Edilen559 Takipçiler

Cozmin Ududec retweetledi

AI Security Institute@AISecurityInst·7 Tem

Every recent model we tried surfaced findings worth patching, and the strongest chained multiple flaws into a real attack path. The takeaway is that capable AI can help defenders harden production systems. Read the new case study unpacking what we found (and fixed): aisi.gov.uk/blog/finding-c…

English

1.3K

Cozmin Ududec retweetledi

Sayash Kapoor@sayashk·2 Tem

Update on our long-horizon AI R&D evals: In April, we launched CRUX, a project to regularly run open-world evaluations. These long, messy, real-world tests of what AI agents can actually do. Our second evaluation is underway, and we ask: AI agents automate AI research? There is a lot of interest in studying AI research automation. But most of the systems built so far follow one of three patterns. 1) keep a human in the loop to guide the agent and course-correct along the way. 2) focus on narrow problems where ground truth is clear and progress is easy to verify, as in AutoResearch. 3) use scaffolds engineered for one specific type of research question, so strong results may say more about the scaffold than about the agent's general research ability. These efforts are helpful, but a lot of AI research is much broader. Success is not immediately clear or verifiable. Researchers need to test and reject promising hypotheses, backtrack, consider new or unconventional approaches, and do a lot more to make progress on answering research questions. In CRUX #2, we are trying to test whether agents can answer novel, open-ended AI research questions. - One major risk in such a task is contamination. We want the agent to have access to the internet and all the tools it needs to solve the task, so we can't use research questions from publicly available papers. At the same time, we want high quality papers to serve as the source of challenging research questions. - To address this, we partnered with AI researchers from UKAISI, UToronto, Princeton, and other institutions who have written high-quality papers that aren’t yet public, so there’s no risk of contamination. - The authors pose open-ended research questions without giving away answers. The agent must produce a NeurIPS-quality paper and a reproducible codebase, which the authors of the papers then review. - We built a general-purpose scaffold on OpenClaw and Opus 4.8. (We would have loved to use Fable 5, but given the filters on AI R&D capabilities, we don't want to confound results.) - Agents get generous resource budgets set in consultation with the original authors, such as access to VMs, GPUs, and any other compute needed to answer the question. They also have $3,000 in API credits per paper. We evaluate them on week-long time horizons to make progress on answering the research question, far more than typical agent evals. - The agent needs to manage its own budget. It can track its spend and stay within its limits, and it can modify its scaffold and reasoning effort as it sees fit. - In addition to the final artifacts, such as the paper's code, we are also evaluating the agent's trajectories in depth. When we announced CRUX, we planned to conduct an open-world eval every month. Given the scope and ambition of this project, we have spent a lot more time making sure we are confident in our setup and results. That said, the early results we have are exciting, and we look forward to sharing them soon.

English

209

19K

Cozmin Ududec@CUdudec·2 Tem

Huge thanks to @jjmcfadyen, @ojorgy, @howdoyousayCli ! Full post: aisi.gov.uk/blog/more-comp…

English

Cozmin Ududec@CUdudec·2 Tem

Naively, this implies that the cost of measuring the frontier scales with the frontier itself. So a natural question whether can we forecast success on long, expensive tasks from cheap early signals, such as behaviour, process, or early transcript features?

English

Cozmin Ududec@CUdudec·2 Tem

New post from the Science of Evaluation team @AISecurityInst: how much test-time compute you give an agent changes not just its score, but how fast the frontier appears to move. The cyber time-horizon trend over the past year is ~60% steeper at 50M tokens per task than at 2.5M. 🧵

AI Security Institute@AISecurityInst

Most AI agent evaluations boil capability down to one score. But that number hides a key choice: how much compute the agent was allowed to use. New work from our Science of Evaluation team shows why that matters. 🧵

English

519

Cozmin Ududec retweetledi

David@DavidDAfrica·26 Haz

We tend to think of AIs as "shoggoths wearing a mask." But such masks can be worn lightly, or they can leave impressions, and even change the face underneath. In this work, we show that different types of persona induction and roleplay induce different levels of belief!

Benno Sturgeon@ben_sturgeon

When Role-Playing, Do Models Believe What They Say? (w/ @DavidDAfrica and @realmeatyhuman) LLMs can say “The Earth revolves around the Sun” and then, when roleplaying as an ancient Greek historian, assert the opposite. What changes inside the model when it acts like this? Does it just say things, or does it start to believe the role? 🧵

English

2.2K

Cozmin Ududec@CUdudec·18 Haz

The paper: arxiv.org/abs/2606.17930 Huge credit to @jjmcfadyen, @ojorgy, @HarryCoppock, @kevinlwei.

English

Cozmin Ududec@CUdudec·18 Haz

This points to a concrete reporting standard which includes: the full inference-compute curve and the precise experiment protocol that produced it, budget, stopping rule, feedback, number of submissions, and harness details.

English

Cozmin Ududec@CUdudec·18 Haz

Building on @jjmcfadyen's thread, a bit more context on a problem we're spending a lot of time on: how to ensure frontier-agent evaluations keep up with and actually measure the capabilities we care about. 🧵

Jessica McFadyen@jjmcfadyen

Excited to announce my first preprint from @AISecurityInst ! 🎉 Here, we asked: how much does an AI agent's performance on a benchmark depend on how much compute we give it to do the task? Paper link in thread 👇🧵(1/7)

English

1.4K

Cozmin Ududec retweetledi

David@DavidDAfrica·17 Haz

A while back I posted about prefill awareness, where LLMs can tell their message history has been messed with. Since then we've turned it into a full paper and are excited to share "Prefill Awareness in Language Models", some new work from UK AISI and Constellation 🧵

English

9.2K

Cozmin Ududec retweetledi

EvalEval Coalition@evaluatingevals·11 Haz

🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀 Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇

English

7.1K

Cozmin Ududec retweetledi

Noam Brown@polynoamial·9 Haz

x.com/i/article/2057…

ZXX

416

3.2K

Keşfet

@jjmcfadyen @ojorgy @howdoyousayCli @AISecurityInst @HarryCoppock @kevinlwei @elonmusk @BarackObama