EvalEval Coalition

82 posts

EvalEval Coalition banner
EvalEval Coalition

EvalEval Coalition

@evaluatingevals

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Katılım Haziran 2025
7 Takip Edilen456 Takipçiler
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
Amazing effort by all the authors, working chairs and coalition members! If you're going to be at ICML, come say Hi! Stay tuned for updates on ICML socials 👋
English
0
1
4
175
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🚀EvalEval is 2/2 accepted at @icmlconf 2026 🚀 1⃣ Who Evaluates AI's Social Impact? Mapping Coverage and Gaps in First and Third Party Evaluations 2⃣ When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation Details below 🧵
English
3
8
25
2.3K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
New blog post on cost of AI Evals! Check it out 👇
Avijit Ghosh@evijit

AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: evalevalai.com/research/2026/… Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

English
0
1
10
945
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
3 days left! 📷 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-ac… (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE
English
0
6
10
3.6K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 (1/2)
English
1
1
10
853
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
Sitting on results from papers or leaderboards? Whether you use lm-eval, Inspect AI, or HELM, we have low-lift converters ready to go. 🦾 💾 GitHub: github.com/evaleval/every… 📜 Co-authorship on the shared task paper for qualifying contributors 📅 Deadline: May 1, 2026
English
0
0
10
640
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🧪 Your LLM evaluation results could help the whole field 🚀 🧑‍🔬 Our ACL Shared task is out! We’re building a unified, crowdsourced database to create a common language for AI evaluation reporting. And we need your data. (1/2) evalevalai.com/events/shared-…
English
1
8
60
17.5K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/every…
English
2
16
47
12.5K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
We're seeking submissions on: 🔍 Evaluation validity & reliability 🌍 Sociotechnical impacts ⚙️ Infrastructure & costs 🤝 Community-centered approaches Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome. Check out the CFP: evalevalai.com/events/2026-ac…
English
0
0
5
286
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🚨 The next edition of EvalEval Workshop is coming to @aclmeeting 2026! 🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" 🎇 📢 CFP is now open!!! More details ⏬ 📍 San Diego 📝 Submission deadline: Mar 12, 2026
English
1
7
28
2.8K