Cozmin Ududec

416 posts

Cozmin Ududec

@CUdudec

@AISecurityInst Testing and Science of Evals. Ex quantum foundationalist.

가입일 Haziran 2021

1.8K 팔로잉466 팔로워

Cozmin Ududec@CUdudec·13h

This is currently my favourite way to present eval results: inference scaling curves, across model generations, split by task difficulty. You can easily see the impact of token budgets, how performance becomes more log-linear over time, and how recent model performance on hard tasks looks like older model performance on easy tasks...

AI Security Institute@AISecurityInst

🔓 Can today’s AI agents escape sandbox environments? Using our new benchmark, SandboxEscapeBench, we find that frontier models can reliably exploit common vulnerabilities - and that breakout capability improves as model size and inference compute increase. Read more ⬇️

English

1.4K

Cozmin Ududec 리트윗함

AI Security Institute@AISecurityInst·2d

English

158

13.8K

Cozmin Ududec 리트윗함

Owen Lewis@is_OwenLewis·18 Mar

🧵 1/14: Just days after Paul Ehrlich (the man whose 1968 book “The Population Bomb” predicted billions would starve) passed away, it’s the perfect moment to celebrate the scientist who proved Ehrlich and other doomsayers spectacularly wrong. Meet Norman Borlaug, the Iowa farm boy who launched the Green Revolution and quite literally saved a billion lives. This is the ultimate story of human ingenuity triumphing over scarcity.

English

813

3.2K

166.1K

Cozmin Ududec 리트윗함

Noam Brown@polynoamial·15 Şub

Perhaps a 🌶️ take but I think the criticisms of @GoogleDeepMind's release are missing the point, and the real problem is that AI labs and safety orgs need to adapt to a world where intelligence is a function of inference compute. When Google says that Deep Think poses no new risks beyond Gemini 3 Pro, they probably mean that Deep Think is a scaffold of Gemini 3 Pro that anyone externally could have constructed on their own anyway. In other words, the capabilities of Deep Think have always been available to anyone willing to pay for Deep Think amounts of inference, simply by scaffolding a bunch of Gemini 3 Pro queries together. Deep Think just makes that more convenient for the casual user. The corollary of this is that capabilities far beyond Gemini 3 Deep Think are already available to anyone willing to scaffold a system together that uses even more inference compute. As a trivial example, you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks. Most Preparedness Frameworks were developed in ~2023 before the era of effective test-time scaling. But today, there is a massive difference on the hardest evals between something like GPT-5.2 Low and GPT-5.2 Extra High. Scaffolds are also much more effective. So if you want to evaluate whether Gemini 3 can, for example, help make a bio weapon, the answer may depend on how much inference compute you give it. In my opinion, the proper solution is to account for inference compute when measuring model capabilities. E.g., if one were to spend $1,000 on inference with a really good scaffold, what performance could be expected on a benchmark? ARC-AGI has already adopted this mindset but few other benchmarks have. Of course, serious entities like state actors could spend well beyond $1,000. Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release! But in the same way that pretraining scaling laws can predict the capabilities of larger pretrained models, performance also scales somewhat cleanly with additional inference compute. In my opinion, it should become standard practice for all system cards to show plots of benchmark performance as a function of inference compute, and safety thresholds should be based on a projection of what performance would look like at $1 million+ of inference compute. If that were the norm, then indeed releasing Deep Think probably would not result in a meaningful safety change compared to Gemini 3 Pro, other than making good scaffolds more easily available to casual users.

The Midas Project@TheMidasProj

1/ Last week, we criticized OpenAI's release of GPT-5.3-Codex. That was a real, albeit narrow, dispute about an otherwise serious safety evaluation process. @Google has released an upgrade of a similar magnitude with no safety results at all. 🧵

English

1.1K

204.8K

Cozmin Ududec@CUdudec·5 Mar

The obvious next question: does this generalize? METR sees different patterns on their tasks. Whether cyber is special or representative is still unclear.

English

Cozmin Ududec@CUdudec·5 Mar

The scaling is also remarkably log-linear: every time you double the token budget, you get roughly the same absolute increase in success rate. This holds across ~4 orders of magnitude (10k to 50M tokens) on AISI's tasks.

English

Cozmin Ududec@CUdudec·5 Mar

Something striking from our cyber scaling work: recent models can productively use 10-50x more tokens than many evaluation budgets allow. We've probably been underestimating capability ceilings.

AI Security Institute@AISecurityInst

AI cyber capabilities are improving rapidly, but are evaluations keeping pace? Alongside @Irregular, we found that recent models can productively use 10-50x larger token budgets than typical evaluation settings allow, with key security implications🧵

English

1.8K

Cozmin Ududec@CUdudec·26 Şub

The paper would not have happened without the effort and conceptual clarity of @DubMagda !

English

Cozmin Ududec@CUdudec·26 Şub

The paper outlines 7 steps: define your question → organise transcripts → manually inspect samples → refine → design scanners → validate → use results. Simple, but makes the difference between rigorous analysis and ad-hoc transcript reading.

English

120

Cozmin Ududec@CUdudec·26 Şub

We collaborated with @MeridianLabs on this paper. They built Inspect Scout, which is an open-source tool for systematic transcript analysis at scale. meridianlabs-ai.github.io/inspect_scout/

English

161

Cozmin Ududec@CUdudec·26 Şub

Agent evals can generate massive logs, which have useful signal about why agents succeed or fail. They can reveal things task scores can't, for example, refusals vs genuine failures, tool access issues, eval awareness, misreported progress.

English

114

Cozmin Ududec@CUdudec·26 Şub

New from the Science of Evaluation Team at @AISafetyInst: a pipeline for rigorous transcript analysis. I think transcript analysis is still underrated, especially as model horizons are getting longer and task environments more complex.

English

1.2K

Cozmin Ududec@CUdudec·25 Şub

The transitions follow sigmoid phase curves, which is consistent with ICL and SFT updating the same underlying belief state over personas. Details in the thread; LessWrong writeup: lesswrong.com/posts/cffGZn8L…

English

Cozmin Ududec@CUdudec·25 Şub

Some interesting early results from @benji_berczi and @koreankiwi1227 from our MATS W26 project. Core finding: weird generalisation can happen via ICL alone! A handful of benign facts in the prompt induces persona transitions and alignment drops.

Benji Berczi@benji_berczi

Anthropic yesterday: LLMs develop personas in post-training! 🤖 Our work today: LLM personas can be elicited just by prompting! Even harmful ones. 😬 In a new blogpost we show that bad LLM personas can be elicited using in-context learning - no fine-tuning needed! Thread 🧵

English

2.3K

Cozmin Ududec@CUdudec·30 Oca

The role is a good fit if you have hands-on LLM/agent experience, enjoy designing and running experiments, and care about getting measurement right. 📍 Apply: job-boards.eu.greenhouse.io/aisi/jobs/4769… 💸 £65k–£145k depending on experience 🗓️ Deadline: February 22, 2026 Reach out if you want to talk about the team, the work, or life at AISI!

English

300

Cozmin Ududec@CUdudec·30 Oca

Things you might work on: - Predicting long-horizon performance from task/model characteristics - Designing tasks with internal structure to track progress - Identifying bottleneck skills that cap what agents can do - Building tools to extract insights from agent transcripts

English

370

Cozmin Ududec@CUdudec·30 Oca

We're hiring a Research Scientist for the Science of Evaluation team at AISI! Apply here by February 22, 2025: job-boards.eu.greenhouse.io/aisi/jobs/4769… Below I talk about why I think this work is genuinely interesting 🧵

English

100

9.6K

탐색

@GoogleDeepMind @DubMagda @MeridianLabs @aisafetyinst @benji_berczi @koreankiwi1227 @elonmusk @BarackObama