EvalEval Coalition

82 posts

EvalEval Coalition

@evaluatingevals

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Katılım Haziran 2025

7 Takip Edilen456 Takipçiler

EvalEval Coalition@evaluatingevals·1 May

Amazing effort by all the authors, working chairs and coalition members! If you're going to be at ICML, come say Hi! Stay tuned for updates on ICML socials 👋

English

175

EvalEval Coalition@evaluatingevals·1 May

Working groups led by @evijit, @AnkaReuel, @akhtarmubashara and Jenny Chim! Preprints: Social Impacts -> arxiv.org/pdf/2511.05613 Benchmark Saturation -> arxiv.org/pdf/2602.16763

English

149

EvalEval Coalition@evaluatingevals·1 May

🚀EvalEval is 2/2 accepted at @icmlconf 2026 🚀 1⃣ Who Evaluates AI's Social Impact? Mapping Coverage and Gaps in First and Third Party Evaluations 2⃣ When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation Details below 🧵

English

2.3K

EvalEval Coalition@evaluatingevals·29 Nis

New blog post on cost of AI Evals! Check it out 👇

Avijit Ghosh@evijit

AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: evalevalai.com/research/2026/… Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗

English

945

EvalEval Coalition@evaluatingevals·17 Mar

3 days left! 📷 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-ac… (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE

English

3.6K

EvalEval Coalition@evaluatingevals·11 Mar

📄 Submission Link: openreview.net/group?id=aclwe… 🔗 Workshop Website: evalevalai.com/events/2026-ac… See you in San Diego! 🏖️ (2/2)

English

216

EvalEval Coalition@evaluatingevals·11 Mar

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 (1/2)

English

853

EvalEval Coalition@evaluatingevals·9 Mar

Sitting on results from papers or leaderboards? Whether you use lm-eval, Inspect AI, or HELM, we have low-lift converters ready to go. 🦾 💾 GitHub: github.com/evaleval/every… 📜 Co-authorship on the shared task paper for qualifying contributors 📅 Deadline: May 1, 2026

English

640

EvalEval Coalition@evaluatingevals·9 Mar

🧪 Your LLM evaluation results could help the whole field 🚀 🧑‍🔬 Our ACL Shared task is out! We’re building a unified, crowdsourced database to create a common language for AI evaluation reporting. And we need your data. (1/2) evalevalai.com/events/shared-…

English

17.5K

EvalEval Coalition@evaluatingevals·17 Şub

Read the full announcement: evalevalai.com/infrastructure… Shared Task: evalevalai.com/events/shared-… Project Webpage: evalevalai.com/projects/every… #AIEvaluation #EvalEval

English

415

EvalEval Coalition@evaluatingevals·17 Şub

Thankful to our partners for the feedback: CAISI, @AiEleuther, @huggingface, @NomaSecurity, @TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

English

652

EvalEval Coalition@evaluatingevals·17 Şub

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/every…

English

12.5K

EvalEval Coalition@evaluatingevals·17 Şub

We're seeking submissions on: 🔍 Evaluation validity & reliability 🌍 Sociotechnical impacts ⚙️ Infrastructure & costs 🤝 Community-centered approaches Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome. Check out the CFP: evalevalai.com/events/2026-ac…

English

286

EvalEval Coalition@evaluatingevals·17 Şub

🚨 The next edition of EvalEval Workshop is coming to @aclmeeting 2026! 🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" 🎇 📢 CFP is now open!!! More details ⏬ 📍 San Diego 📝 Submission deadline: Mar 12, 2026

English

2.8K

Keşfet

@evijit @AnkaReuel @akhtarmubashara @icmlconf @AiEleuther @huggingface @NomaSecurity @TrustibleAI