Samuele Marro

281 posts

Samuele Marro

Samuele Marro

@MarroSamuele

DPhil student at Oxford's AIMS CDT. Building communication protocols between agents w/ @OxfordTVG and @Microsoft. Head of @idai_institute.

Oxford, UK شامل ہوئے Aralık 2018
320 فالونگ637 فالوورز
پن کیا گیا ٹویٹ
Samuele Marro
Samuele Marro@MarroSamuele·
"Benchmarking at the Edge of Comprehension" is now an Oral at ICML 2026! GPT-3.5 successfully benchmarks GPT-5.2🥳 crb-bench.org
GIF
Samuele Marro@MarroSamuele

How do you benchmark something smarter than yourself? In light of the recent math benchmarks all getting dangerously close to saturation, we at @OxfordTVG are glad to announce Benchmarking Beyond Comprehension (written with @Microsoft). Coolest result: we successfully got GPT-3.5 to judge GPT-5.2-high on hard math topics. In the paper we study the Post-Comprehension Regime, i.e. the setting where coming up with hard enough questions and checking the correctness of the answers are infeasible (or very expensive). In theory, you can get around this by having the LLMs do this job, but then you get an infinite regression problem (how can you trust an LLM to check if an LLM-generated answer is correct?) The solution is an adversarial protocol: - Alice (e.g. GPT-5.2) comes up with a question-answer pair - Bob (e.g. DeepSeek) can either a) Accuse Alice’s question of being ill-posed b) Spot a mistake in Bob’s question c) Answer the question directly - Alice checks Bob’s answer and looks for mistakes - A human judge evaluates specific claims of mistakes The trick is that checking a specific claim of a specific mistake is much easier than coming up with a question or checking an entire answer. This means that humans can still do it even if they don’t understand the question as a whole. And since the game is adversarial, we can compute Elo scores! (Technically we use a bipartite Bradley-Terry model, but whatever) Two cool results: - The resulting Elos are strongly correlated with existing math benchmarks (i.e. the protocol is actually measuring math competence) - When we replace human judges (all of whom had a Master's or PhD in CS/Math) with weak models like GPT-3.5, the rankings do not meaningfully change, even when the models participating in the game are much stronger Basically, we show a pretty scalable form of weak-to-strong benchmarking. Which is cool! Paper here: arxiv.org/abs/2602.14307 Huge thanks to Jialin Yu, @iperboreo_ , @DebOishi, Jiawei Li, Yibo Yang, @EbeyAbraham, @sunandosengupta, Eric Sommerlade, @wooldridgemike, and Philip Torr for their help!

English
2
14
36
6.8K
Samuele Marro ری ٹویٹ کیا
Institute for Decentralized AI
Institute for Decentralized AI@idai_institute·
In the span of three weeks, our institute's worst fears have come to pass: the two top frontier labs have restricted access to their best models. None of this happened under a law, a published standard, or a process anyone can appeal. This sets an incredibly dangerous precedent. To be clear, we believe that frontier models do indeed pose cybersecurity and biosecurity risks. But these risks have to be balanced against the risk of concentration of power. An aligned AI is nonetheless dangerous in the hands of a closed, opaque group of actors that can disable guardrails at any moment. We have published a list of what topics you, as an AI researcher, can work on to stop the increasing concentration of AI capabilities: decentralized-ai.org/research-areas AI centralization can be countered. It's just a matter of finding the right breakthroughs.
OpenAI@OpenAI

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…

English
4
17
66
9.6K
Samuele Marro
Samuele Marro@MarroSamuele·
Definitely feeling conflicted today. On one hand I feel vindicated, as the Mythos & GPT 5.6 news confirmed that I was right to set up @idai_institute (my main motivation was worrying that frontier AI would be soon gatekept). On the other I really really wish it wasn't necessary.
English
0
0
7
290
Samuele Marro
Samuele Marro@MarroSamuele·
Calling all multi-agent AI researchers, join us for the IDAI Workshop on Multi-Agent Safety and Security! A one-day gathering at the Royal Society Of Arts on creating safe, secure and reliable multi-agent systems. Featuring keynotes by @casdewitt (University of Oxford) and Georgios Piliouras (DeepMind). Date: June 29th, 2026 Location: Royal Society of Arts, 8 John Adam Street, London WC2N 6EZ We will also hold a poster session on new developments in multi-agent safety and security. To present a poster, submit an abstract (up to 150 words) to workshop26@decentralized-ai.org Deadline: 24 June 2026, 23:59 AOE The best poster will receive 500 USD of compute provided by the Institute for Decentralized AI. Spots are limited. Join the waitlist here: luma.com/n6acqkas More info: decentralized-ai.org/workshop26 Organized in collaboration with @Microsoft , @idai_institute , @ethereumfndn's dAI Team and supported by @cosmos_inst
Samuele Marro tweet media
English
1
12
31
5.2K
J Rosser
J Rosser@jrosseruk·
Career update! I've joined @NeelNanda5's Language Model Interpretability team as a contractor employed by Adecco, supporting @GoogleDeepMind! I'll be working on interp and data attribution! This comes after a fantastic internship at @cohere with @acyr_l! Lots of exciting work from that time to share soon!
J Rosser tweet media
English
43
10
669
40.2K
Samuele Marro
Samuele Marro@MarroSamuele·
I understand that organizers have to deal with capacity limits, but I feel like this could have been telegraphed much, much better.
ICML Conference@icmlconf

General registration for #ICML2026 is expected to fill up very soon. (Some spots stay reserved for paper authors, sponsors, workshop organizers.) Today is also the deadline for early registration. If you plan to attend in-person, we recommend registering ASAP. Blog for more info:

English
0
0
3
304
Samuele Marro ری ٹویٹ کیا
William Gitta
William Gitta@William__Gitta·
Your browser agent ALREADY leaks which LLM is behind it. In our new paper, we train classifiers that identify the underlying model from actions alone, up to 96% F1 for some models on Wikipedia and Amazon.
William Gitta tweet media
English
3
3
3
275
Kyle
Kyle@kyle_mccleary·
@MarroSamuele How did you get access to GPT 3.5? I thought they retired it from the API a long time ago.
English
1
0
0
32
Samuele Marro
Samuele Marro@MarroSamuele·
Now accepted as a Spotlight paper at ICML 2026!
Samuele Marro@MarroSamuele

How do you benchmark something smarter than yourself? In light of the recent math benchmarks all getting dangerously close to saturation, we at @OxfordTVG are glad to announce Benchmarking Beyond Comprehension (written with @Microsoft). Coolest result: we successfully got GPT-3.5 to judge GPT-5.2-high on hard math topics. In the paper we study the Post-Comprehension Regime, i.e. the setting where coming up with hard enough questions and checking the correctness of the answers are infeasible (or very expensive). In theory, you can get around this by having the LLMs do this job, but then you get an infinite regression problem (how can you trust an LLM to check if an LLM-generated answer is correct?) The solution is an adversarial protocol: - Alice (e.g. GPT-5.2) comes up with a question-answer pair - Bob (e.g. DeepSeek) can either a) Accuse Alice’s question of being ill-posed b) Spot a mistake in Bob’s question c) Answer the question directly - Alice checks Bob’s answer and looks for mistakes - A human judge evaluates specific claims of mistakes The trick is that checking a specific claim of a specific mistake is much easier than coming up with a question or checking an entire answer. This means that humans can still do it even if they don’t understand the question as a whole. And since the game is adversarial, we can compute Elo scores! (Technically we use a bipartite Bradley-Terry model, but whatever) Two cool results: - The resulting Elos are strongly correlated with existing math benchmarks (i.e. the protocol is actually measuring math competence) - When we replace human judges (all of whom had a Master's or PhD in CS/Math) with weak models like GPT-3.5, the rankings do not meaningfully change, even when the models participating in the game are much stronger Basically, we show a pretty scalable form of weak-to-strong benchmarking. Which is cool! Paper here: arxiv.org/abs/2602.14307 Huge thanks to Jialin Yu, @iperboreo_ , @DebOishi, Jiawei Li, Yibo Yang, @EbeyAbraham, @sunandosengupta, Eric Sommerlade, @wooldridgemike, and Philip Torr for their help!

English
6
4
45
7.6K
Samuele Marro
Samuele Marro@MarroSamuele·
I read that blog post in high school and decided in that moment I was going to spend my life doing ML. Thanks @karpathy
Helen Toner@hlntnr

Never forget @karpathy training a recurrent neural net (precursor to transformers) to imitate @paulg in 2015—a thing of syntactic and semantic beauty:

English
0
0
0
116
Samuele Marro
Samuele Marro@MarroSamuele·
Well that's nice! Will do a proper post once everyone's done doing their own
Samuele Marro tweet media
English
1
2
34
1.3K
Samuele Marro ری ٹویٹ کیا
Ziyan Wang
Ziyan Wang@ZiyanWang98·
I’m excited and honored to share that I’ve been selected for the IDAI Fellowship. I’ll be working on multi-agent safety, together with @Adel_Bibi @MarroSamuele @jamesaoldfield Grateful for this opportunity, and very much looking forward to the collaboration ahead.
English
2
1
9
253
Samuele Marro
Samuele Marro@MarroSamuele·
@idavidrein Do you have a name for this phenomenon? I'm trying to get the term "post-comprehension regime" going, but I have the lingering feeling that someone else already gave a better name.
English
0
0
0
120
david rein
david rein@idavidrein·
In 2023 I made GPQA, and it saturated in about two years. Here, the benchmark I was working on probably saturated while we were making it. People have said this before but it bears repeating: AI capabilities are improving faster than our ability to measure them is.
English
3
31
247
25.4K
david rein
david rein@idavidrein·
@tmkadamcz and I started working on MirrorCode, a new long-horizon software engineering benchmark, last September. I think it’s the best benchmark for measuring AI’s ability to complete very hard (but precisely specified) software tasks—but it’s likely already saturated.
Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English
6
26
175
31K
xuan (ɕɥɛn / sh-yen)
xuan (ɕɥɛn / sh-yen)@xuanalogue·
Happy Trans Day of Visibility!🏳️‍⚧️ This is my first TDoV since moving back to Singapore, and the 12th since I began my transition. It's been a long & sometimes difficult journey, but somehow I've ended up as the first (to my knowledge) openly trans professor in the SGean academy.
xuan (ɕɥɛn / sh-yen) tweet media
English
5
75
1.3K
18.2K
Davide Crapis
Davide Crapis@DavideCrapis·
Excited for this collaboration with SQ and @MarroSamuele to be out! This work feels like an initial step into a new paradigm: unleash LLMs into real-world interactions, using economic feedback (money/payoff) to improve security and coordination. We draw from multi-agent systems, but move beyond standard games -> let agents operate in realistic environments that model where they're being used. Two promising follow on things: - Environments with hard commitments and verification - Training robust negotiators and automating security If you're working on similar ideas, reach out.
Shouqiao Wang@Qiaoqiao2001

As agents move into real deployment, static benchmarks stop being enough. What matters is whether an agent remains robust when the other side is learning how to exploit it for profit. In our paper, we study this through profit-driven red teaming.

English
3
4
17
2.1K
Samuele Marro
Samuele Marro@MarroSamuele·
@DavideCrapis Happy to have helped! At the end of the day, economic feedback is the best signal for anything security related.
English
0
0
5
55
Samuele Marro ری ٹویٹ کیا
Marcello Politi
Marcello Politi@Marcello_AI·
𝐏𝐚𝐩𝐞𝐫 𝐨𝐮𝐭! Most AI security evaluations work like this: curate a set of attack prompts, run your agent against them, and score the results. The problem is that 𝗿𝗲𝗮𝗹 𝗮𝗱𝘃𝗲𝗿𝘀𝗮𝗿𝗶𝗲𝘀 𝗱𝗼𝗻'𝘁 𝘄𝗼𝗿𝗸 𝗳𝗿𝗼𝗺 𝗮 𝗳𝗶𝘅𝗲𝗱 𝗹𝗶𝘀𝘁. They probe, adapt, and learn. So what would it look like to stress-test agents the way a real profit-seeking counterparty would? The idea, which we call 𝐩𝐫𝐨𝐟𝐢𝐭-𝐝𝐫𝐢𝐯𝐞𝐧 𝐫𝐞𝐝 𝐭𝐞𝐚𝐦𝐢𝐧𝐠, is simple: instead of handcrafting attacks, you train an opponent whose only goal is to maximise its own payoff. No judge, no attack labels. Just a scalar outcome signal. We tested this across 6 frontier models in 4 economic games, and found that agents that looked strong against static baselines became reliably exploitable once the opponent was optimised. In many cases, 𝗮𝗴𝗲𝗻𝘁𝘀 𝗮𝗰𝗰𝗲𝗽𝘁𝗲𝗱 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀 𝘄𝗼𝗿𝘀𝗲 𝘁𝗵𝗮𝗻 𝘀𝗶𝗺𝗽𝗹𝘆 𝘄𝗮𝗹𝗸𝗶𝗻𝗴 𝗮𝘄𝗮𝘆. What tricks did the attacker discover on its own? Things like fake protocol notices that force the victim to bid near zero in auctions. Or negotiation traps: "𝘈𝘤𝘤𝘦𝘱𝘵 𝘵𝘩𝘪𝘴 𝘥𝘦𝘢𝘭 𝘯𝘰𝘸, 𝘢𝘯𝘥 𝘐 𝘱𝘳𝘰𝘮𝘪𝘴𝘦 𝘺𝘰𝘶 𝘢 𝘮𝘶𝘤𝘩 𝘣𝘦𝘵𝘵𝘦𝘳 𝘰𝘯𝘦 𝘪𝘯 𝘵𝘩𝘦 𝘧𝘶𝘵𝘶𝘳𝘦." The good news is that a lightweight fix helped a lot. 𝗗𝗶𝘀𝘁𝗶𝗹𝗹𝗶𝗻𝗴 𝘁𝗵𝗲 𝘄𝗼𝗿𝘀𝘁 𝗮𝘁𝘁𝗮𝗰𝗸 𝗲𝗽𝗶𝘀𝗼𝗱𝗲𝘀 into a short set of prompt rules for the target agent, neutralised most of the discovered exploits. No retraining needed. The paper was accepted to two workshops: 1️⃣ Agents in the Wild at ICLR 2026 2️⃣ Strategic Engineering at AAMAS Conference 2026. This work covers LLM safety and game theory, and both communities found it relevant! We are working on how AI agents can coordinate and transact in adversarial environments, and this is a first step. You can't build robust coordination systems without first understanding how agents fail under strategic pressure. I'd love to hear from anyone working on multi-agent systems, #LLM security, or #AI in economic settings. New collaborations are always welcome, feel free to contact me! Thanks to my co-authors, @Qiaoqiao2001 (@Columbia ), @MarroSamuele (@UniofOxford ), @DavideCrapis (@ethereumfndn ) and @ARIA_research for the brainstorming sessions that helped shape this work.
Marcello Politi tweet media
English
1
6
13
1K