Stephan Rabanser

10.1K posts

Stephan Rabanser

@steverab

Postdoctoral Researcher @Princeton. Reliable, safe, trustworthy machine learning. Previously: @UofT @VectorInst @TU_Muenchen @Google @awscloud

Princeton, NJ Katılım Nisan 2010

388 Takip Edilen827 Takipçiler

Sabitlenmiş Tweet

Stephan Rabanser@steverab·5 Haz

Very excited to share that our paper "Towards a Science of AI Agent Reliability" was accepted at ICML 2026! See you in Seoul! 🎉 We just released our camera ready version with three important updates (details below). We also recorded a short video on the paper's contributions. Main changes (full discussion at #updates" target="_blank" rel="nofollow noopener">hal.cs.princeton.edu/reliability/#u…): 1️⃣We have added the latest set of frontier models to our evaluation (GPT 5.5, Gemini 3.1 Pro and 3.5 Flash, and Claude Opus 4.7) and find that they are not meaningfully more reliable than previously released models. Agent reliability is still far from being solved. 2️⃣We have updated the definition and measurement of our outcome consistency metric, which contained a typo in the pre-print we initially released. This caused us to under-estimate outcome consistency in our initial set of results. We have updated the paper and our codebase to the corrected metric. Despite this change, our new results show that outcome consistency is still surprisingly low across many reported models. 3️⃣We discovered multiple issues in our HAL Generalist Agent scaffold that we used for our experiments on GAIA. Notably, we discovered multiple instances of answer leakage and agents cheating on our evaluation. This caused us to slightly over-estimate both accuracy and reliability. At the same time, we noticed that the scaffold was overly constrained in terms of permissible software library imports. This caused us to slightly under-estimate both accuracy and reliability. We have done a rigorous audit of the scaffold and have fixed those issues. Overall, we saw that our resulting accuracy and reliability numbers are not meaningfully impacted by this change when compared to our original numbers. 📄Our paper: arxiv.org/abs/2602.16666 📊Our dashboard: hal.cs.princeton.edu/reliability/ 🎥Short video: youtu.be/qftDfEft7U0 Joint work w/ @sayashk, @PKirgis, @khl53182440, @SaitejaUtpala, and @random_walker.

YouTube

English

252

24.8K

Stephan Rabanser retweetledi

Sayash Kapoor@sayashk·5d

Thrilled to share that I am joining UC Berkeley as an Assistant Professor in the School of Information! I start in Fall 2027, and I am recruiting PhD students this cycle. List me in your application if you're interested in frontier AI evaluation, AI policy, and AI's impacts on institutions such as science, law, and medicine. I'm especially keen to work with students interested not just in high-quality research, but also in communicating it with a broad audience such as by public writing and policy impact. Fill out the form in the next tweet to indicate your interest. As for this coming year, I'm moving to Berkeley this fall to start something new with @RishiBommasani and @random_walker. We'll have much more to share soon.

English

115

944

162.2K

Stephan Rabanser@steverab·2 Tem

Very excited to discuss the capability-reliability gap in agentic AI at the @farairesearch Seoul Alignment Workshop next week!

FAR.AI@farairesearch

Confirmed for Seoul Alignment Workshop: @steverab (@Princeton), on the opening panel on the capability-reliability gap. He works on when machine learning systems can be trusted: uncertainty, calibration, knowing when a model should abstain. The premise underneath it: a capable system and a dependable one are not the same, and the gap has to be measured, not assumed.

English

2.1K

Stephan Rabanser retweetledi

Sayash Kapoor@sayashk·2 Tem

Update on our long-horizon AI R&D evals: In April, we launched CRUX, a project to regularly run open-world evaluations. These long, messy, real-world tests of what AI agents can actually do. Our second evaluation is underway, and we ask: AI agents automate AI research? There is a lot of interest in studying AI research automation. But most of the systems built so far follow one of three patterns. 1) keep a human in the loop to guide the agent and course-correct along the way. 2) focus on narrow problems where ground truth is clear and progress is easy to verify, as in AutoResearch. 3) use scaffolds engineered for one specific type of research question, so strong results may say more about the scaffold than about the agent's general research ability. These efforts are helpful, but a lot of AI research is much broader. Success is not immediately clear or verifiable. Researchers need to test and reject promising hypotheses, backtrack, consider new or unconventional approaches, and do a lot more to make progress on answering research questions. In CRUX #2, we are trying to test whether agents can answer novel, open-ended AI research questions. - One major risk in such a task is contamination. We want the agent to have access to the internet and all the tools it needs to solve the task, so we can't use research questions from publicly available papers. At the same time, we want high quality papers to serve as the source of challenging research questions. - To address this, we partnered with AI researchers from UKAISI, UToronto, Princeton, and other institutions who have written high-quality papers that aren’t yet public, so there’s no risk of contamination. - The authors pose open-ended research questions without giving away answers. The agent must produce a NeurIPS-quality paper and a reproducible codebase, which the authors of the papers then review. - We built a general-purpose scaffold on OpenClaw and Opus 4.8. (We would have loved to use Fable 5, but given the filters on AI R&D capabilities, we don't want to confound results.) - Agents get generous resource budgets set in consultation with the original authors, such as access to VMs, GPUs, and any other compute needed to answer the question. They also have $3,000 in API credits per paper. We evaluate them on week-long time horizons to make progress on answering the research question, far more than typical agent evals. - The agent needs to manage its own budget. It can track its spend and stay within its limits, and it can modify its scaffold and reasoning effort as it sees fit. - In addition to the final artifacts, such as the paper's code, we are also evaluating the agent's trajectories in depth. When we announced CRUX, we planned to conduct an open-world eval every month. Given the scope and ambition of this project, we have spent a lot more time making sure we are confident in our setup and results. That said, the early results we have are exciting, and we look forward to sharing them soon.

English

206

18.7K

Stephan Rabanser@steverab·1 Tem

📣 I'll be in Seoul next week to present one main conference paper and four workshop papers at ICML! I'll also be on a panel at the FAR.AI alignment workshop! Reach out if you are around and want to chat about uncertainty, reliability, or AI evals!😊 Details⬇️ 📄Paper 1: Towards a Science of AI Agent Reliability 📍Main conference: Thursday (July 9) • 14:30–16:15 in Hall A • Poster #3408 📍Workshop on Failure Modes in Agentic AI (FAGEN): Friday (July 10) • 10:10–11:00 and 14:40–15:30 in Grand Ballroom 104-105 🔗arxiv.org/abs/2602.16666 🧵x.com/steverab/statu… 📄Paper 2: Log Analysis is Necessary for Credible Evaluation of AI Agents 📍Workshop on Failure Modes in Agentic AI (FAGEN): Friday (July 10) • 10:10–11:00 and 14:40–15:30 in Grand Ballroom 104-105 🔗arxiv.org/abs/2605.08545 🧵x.com/PKirgis/status… 📄Paper 3: Open-World Evaluations for Measuring Frontier AI Capabilities 📍Workshop on Agents in the Wild (AIWILD): Saturday (July 11) • 11:10–12:00 and 16:10–17:00 in Hall B2 🔗arxiv.org/abs/2605.20520 🧵x.com/sayashk/status… 📄Paper 4: Life After Benchmark Saturation: A Case Study of CORE-Bench 📍Workshop on Agents in the Wild (AIWILD): Saturday (July 11) • 11:10–12:00 and 16:10–17:00 in Hall B2 🔗arxiv.org/abs/2606.26158 🧵x.com/nityndg/status… 🗣️Panel on the AI capability–reliability gap 📍FAR.AI Seoul Alignment Workshop: Monday (July 6) 🔗far.ai/events/event-l… Also, my advisor @random_walker is going to deliver a keynote on Thursday (July 9) at 13:30 in Hall C: icml.cc/virtual/2026/i…. Don't miss it!

Nitya Nadgir@nityndg

Can AI agents help researchers reproduce research more quickly? We conducted an uplift study. The answer is yes: researchers reproduced papers > 2x faster using Codex with GPT-5.4 xhigh. In a new paper, we show many other results.

English

15K

Stephan Rabanser retweetledi

Nitya Nadgir@nityndg·1 Tem

English

11.7K

Stephan Rabanser retweetledi

Xingyu Fu@XingyuFu2·17 Haz

🚀 New paper: Context-Aware RL for Agentic and Multimodal LLMs 👉 LLMs often fail not because the answer is impossible, but because they miss the one decisive clue hidden in a long trace or image. 🔥 We introduce ContextRL: RL that teaches models to identify which context actually supports an answer. ✅ +2.2% on 5 agentic benchmarks ✅ +1.8% across 12 VQA benchmarks ✅ Works for coding agents & multimodal reasoning ✅ Same contrastive data, but better objective — not data augmentation 🧠 The key idea: don’t only reward the final answer. Reward the model for grounding it in the right evidence. 📄 Paper: xupy2003.github.io/ContextRL_Webs… This work is done with amazing collaborators at Princeton and Davis! @PeiyangX622 @BangzhengL @letti_liu @karthik_r_n @viswanathpramod @prateekmittal_ 🙌 A huge shoutout to everyone! #Agentic #Agents #CodingAgents #RL #Multimodal #ComputerVision #LLM #VLM #MachineLearning #AI

English

138

22.1K

Stephan Rabanser retweetledi

Hayoung Jung@hayounggjung·11 Haz

First paper of my PhD with my amazing advisors! There’s been a ton of hype and media coverage on OpenEvidence as an “AI co-pilot for clinicians”… and our long-horizon benchmark puts them to the test!! Our results suggest they are far from reliable for downstream use.

Manoel@manoelribeiro

New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread 🧵 w/ @hayounggjung, @korolova & others

English

2.5K

Stephan Rabanser@steverab·25 May

At last week's developer conference, Google claimed that their newest frontier model produced an operating system from just a single prompt and ~$900 in API cost. At first sight, this seems impressive. But on closer look, the evidence is much thinner than the headline suggests. Most notably: - The "single prompt" framing suggests that the agent could do this from just a few sentences with high-level instructions. But the prompt itself is many thousands of lines long and it is unclear what instructions Google provided in the prompt (and how much effort it took to even come up with the prompt in the first place). - There is a lot of OS code on the internet and it is often attempted as a class project in college OS classes. Based on the information provided in Google's blog post, it is unclear to what extent the agent simply copied a well-known implementation from the internet. - In such long-running, complex implementation tasks, it is important to understand what degree of human intervention was performed to help the agent achieve its goals. However, Google remains ambiguous about the level of hand-holding they performed in this experiment. They say that "no additional guidance or corrections from a human" were necessary, yet they document instances of imposing anti-cheating mechanisms between runs. - Many key artifacts, such as the code, the prompt, and agent logs are unreleased. This makes it impossible for external researchers to verify these marketing claims. - To Google's credit, they did release the overall cost and token budget. These details often remain undisclosed, and sharing them is a first step in the right direction. More detail in our writeup at normaltech.ai/p/did-googles-… w/ @sayashk @RishiBommasani Andrew Schwartz @random_walker

English

4.7K

Stephan Rabanser@steverab·13 May

Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job! In our paper Log analysis is necessary for credible evaluation of AI agents, we ➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns); ➡️outline four key principles for conducting log analysis effectively; ➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and ➡️give a set of recommendations to improve log analysis quality and adoption. 📄arxiv.org/abs/2605.08545 More details in @PKirgis's thread below ⬇️

Peter Kirgis@PKirgis

New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: arxiv.org/pdf/2605.08545

English

3.4K

Stephan Rabanser retweetledi

Sara Hooker@sarahookr·27 Nis

Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.

Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English

191

41.3K

Stephan Rabanser@steverab·24 Nis

Check out our poster on Sat, Apr 25, 2026 11:15 AM – 1:45 PM PDT in Pavilion 3 P3-#1625! ICLR link: iclr.cc/virtual/2026/p… Paper: arxiv.org/abs/2506.04203 Joint work with @youheyork, Fangcheng Fu, @Renee42581826 (lead authors) and @Jintao_Zhang_, @niclane7, @Hades317.

English

156

Stephan Rabanser@steverab·24 Nis

Also, our plan isn't fixed: it reshapes itself to the quality bar. Drop the quality target from 90 → 85 on the same trace, and Cascadia routes 21% (not 50%) to the 671B and reallocates 4 of its GPUs to the smaller models. So the same system yields a very different cascade.

English

Stephan Rabanser@steverab·24 Nis

Sadly wont be at ICLR but if you are make sure to check out our model cascading work! Big LLMs give great answers but they're costly. Small LLMs are fast but weaker. What if you could get the quality of the big one at the latency of the small one most of the time? Meet CASCADIA, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving.

English

2.1K

Keşfet

@RishiBommasani @random_walker @farairesearch @PeiyangX622 @BangzhengL @letti_liu @karthik_r_n @viswanathpramod