Erica

4 posts

Erica banner
Erica

Erica

@ericavaneee

PhD @stanfordeng | AI Fellow @jumptrading, Ex-@amazonscience Reliable intelligence for the real world.

Palo Alto Katılım Şubat 2025
53 Takip Edilen137 Takipçiler
Sabitlenmiş Tweet
Erica
Erica@ericavaneee·
We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 terms-bench.github.io
Erica tweet media
English
21
27
232
33.8K
Erica
Erica@ericavaneee·
We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 terms-bench.github.io
Erica tweet media
English
21
27
232
33.8K
Erica
Erica@ericavaneee·
To our knowledge, this is the first benchmark to bring verifier-based evaluation (the paradigm behind progress in math, code, and DB agents) into a multi-turn social-strategic domain. 💡The payoff: you can see where models break, not just whether they do.
Erica tweet media
English
1
0
4
1K
Erica
Erica@ericavaneee·
Three tiers, increasing in real-world grounding: • Synthetic suite: controlled Bayesian-game environments • Catalog-grounded: real product price data • Procurement chains: stateful multi-agent commercial settings Verifier-based eval at each tier: the environment itself, not an LLM judge, scores the agent.
Erica tweet mediaErica tweet media
English
2
0
4
1.3K