Erica

@ericavaneee

PhD @stanfordeng | AI Fellow @jumptrading, Ex-@amazonscience Reliable intelligence for the real world.

Palo Alto Katılım Şubat 2025

53 Takip Edilen137 Takipçiler

Sabitlenmiş Tweet

Erica@ericavaneee·17 May

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 terms-bench.github.io

English

232

33.8K

Erica@ericavaneee·17 May

Joint work across @StanfordHAI, @StanfordEng, and @StanfordGSB with @fangzhao_zhang, @aneeshpappu, @elb4tu , @jose_blanchet, @Susan_Athey, @liujiashuo77, and @james_y_zou. Thanks to @ivanleomk , @osanseviero, @o_lacombe, and @GoogleDeepMind for hosting the Gemma open-model event in SF where we first presented this ❤️🚀!

English

813

Erica@ericavaneee·17 May

English

232

33.8K

Erica@ericavaneee·17 May

To our knowledge, this is the first benchmark to bring verifier-based evaluation (the paradigm behind progress in math, code, and DB agents) into a multi-turn social-strategic domain. 💡The payoff: you can see where models break, not just whether they do.

English

Erica@ericavaneee·17 May

Three tiers, increasing in real-world grounding: • Synthetic suite: controlled Bayesian-game environments • Catalog-grounded: real product price data • Procurement chains: stateful multi-agent commercial settings Verifier-based eval at each tier: the environment itself, not an LLM judge, scores the agent.

English

1.3K

Keşfet

@StanfordHAI @StanfordEng @StanfordGSB @fangzhao_zhang @aneeshpappu @elb4tu @jose_blanchet @Susan_Athey