Stephan Rabanser

10.1K posts

Stephan Rabanser banner
Stephan Rabanser

Stephan Rabanser

@steverab

Postdoctoral Researcher @Princeton. Reliable, safe, trustworthy machine learning. Previously: @UofT @VectorInst @TU_Muenchen @Google @awscloud

Princeton, NJ Katılım Nisan 2010
385 Takip Edilen700 Takipçiler
Sabitlenmiş Tweet
Stephan Rabanser
Stephan Rabanser@steverab·
Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
Stephan Rabanser tweet media
English
2
6
16
2.5K
Stephan Rabanser
Stephan Rabanser@steverab·
Working with agents for the past months has me convinced that outcome-only evaluation is a flawed approach to benchmarking. You need to look at the logs to understand if the agent really did its job! In our paper Log analysis is necessary for credible evaluation of AI agents, we ➡️introduce a taxonomy of threats to credible evaluation of AI agents (including construct validity and safety evaluation concerns); ➡️outline four key principles for conducting log analysis effectively; ➡️present a case study of how log analysis helped us to find a variety of benchmarking errors on τ-bench; and ➡️give a set of recommendations to improve log analysis quality and adoption. 📄arxiv.org/abs/2605.08545 More details in @PKirgis's thread below ⬇️
Peter Kirgis@PKirgis

New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: arxiv.org/pdf/2605.08545

English
1
5
12
3.1K
Stephan Rabanser retweetledi
Sara Hooker
Sara Hooker@sarahookr·
Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
8
21
192
41K
Stephan Rabanser
Stephan Rabanser@steverab·
Also, our plan isn't fixed: it reshapes itself to the quality bar. Drop the quality target from 90 → 85 on the same trace, and Cascadia routes 21% (not 50%) to the 671B and reallocates 4 of its GPUs to the smaller models. So the same system yields a very different cascade.
Stephan Rabanser tweet media
English
1
0
1
40
Stephan Rabanser
Stephan Rabanser@steverab·
Sadly wont be at ICLR but if you are make sure to check out our model cascading work! Big LLMs give great answers but they're costly. Small LLMs are fast but weaker. What if you could get the quality of the big one at the latency of the small one most of the time? Meet CASCADIA, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving.
Stephan Rabanser tweet media
English
1
0
2
1.9K
Stephan Rabanser retweetledi
Gillian Hadfield
Gillian Hadfield@ghadfield·
Glad to be a part of this initiative to develop open-world evaluations for AI. We need the ability to assess just how capable agents are becoming in order to anticipate and respond to the impact they can have on real world systems and transactions. An agent that can successfully act on the general instruction “build an app and get it posted in the App Store” is one that brings us closer to an economy of agents, with significant implications for how markets behave and need regulating arxiv.org/pdf/2509.01063
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
6
23
6K
Stephan Rabanser retweetledi
Peter Kirgis
Peter Kirgis@PKirgis·
Yesterday, we announced CRUX, a project that aims to conduct regular “open-world evaluations,” where we will be testing the ability of AI agents to complete long-horizon tasks in messy, real-world environments. @sayashk's post dives into the details; here are a few of my own thoughts about why this is worth doing.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
3
11
4K
Stephan Rabanser retweetledi
Cozmin Ududec
Cozmin Ududec@CUdudec·
This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings! Glad the AISI SoE team could contribute to this effort.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
5
28
8.2K
Stephan Rabanser retweetledi
Sayash Kapoor
Sayash Kapoor@sayashk·
Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.
Sayash Kapoor tweet media
English
15
53
252
93.6K
Stephan Rabanser retweetledi
Arvind Narayanan
Arvind Narayanan@random_walker·
📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves. I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well. In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation. The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec. Paper: cruxevals.com/open-world-eva… HTML version: normaltech.ai/p/open-world-e… CRUX website: cruxevals.com
Arvind Narayanan tweet media
English
2
20
94
12.2K
Stephan Rabanser
Stephan Rabanser@steverab·
📄Paper draft: cruxevals.com/open-world-eva… 📝Substack essay: normaltech.ai/p/open-world-e… 🕸️Website: cruxevals.com 🪵Full agent logs: docent.transluce.org/dashboard/b649… 💡Share your own CRUX ideas: docs.google.com/forms/d/e/1FAI… We are excited to run more instances of CRUX in the future! Grateful to have worked on this with many collaborators across academia, industry, non-profits, and government: @sayashk, @PKirgis, Andrew Schwartz, @random_walker, @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, @CUdudec!
English
0
2
7
724
Stephan Rabanser
Stephan Rabanser@steverab·
📈𝗕𝗲𝘆𝗼𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 A robust evaluation ecosystem requires both approaches! We still need bottom-up testing with detailed, task-level specifications. But we must pair this with top-down testing: long-running tasks that test how agents handle real-world ambiguity.
Stephan Rabanser tweet media
English
1
0
0
104
Stephan Rabanser
Stephan Rabanser@steverab·
Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
Stephan Rabanser tweet media
English
2
6
16
2.5K