EvalEval Coalition

76 posts

EvalEval Coalition

@evaluatingevals

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Beigetreten Haziran 2025

7 Folgt410 Follower

EvalEval Coalition@evaluatingevals·2d

3 days left! 📷 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-ac… (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE

English

3.2K

EvalEval Coalition@evaluatingevals·11 Mar

📄 Submission Link: openreview.net/group?id=aclwe… 🔗 Workshop Website: evalevalai.com/events/2026-ac… See you in San Diego! 🏖️ (2/2)

English

115

EvalEval Coalition@evaluatingevals·11 Mar

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 (1/2)

English

646

EvalEval Coalition@evaluatingevals·9 Mar

Sitting on results from papers or leaderboards? Whether you use lm-eval, Inspect AI, or HELM, we have low-lift converters ready to go. 🦾 💾 GitHub: github.com/evaleval/every… 📜 Co-authorship on the shared task paper for qualifying contributors 📅 Deadline: May 1, 2026

English

590

EvalEval Coalition@evaluatingevals·9 Mar

🧪 Your LLM evaluation results could help the whole field 🚀 🧑‍🔬 Our ACL Shared task is out! We’re building a unified, crowdsourced database to create a common language for AI evaluation reporting. And we need your data. (1/2) evalevalai.com/events/shared-…

English

16.7K

EvalEval Coalition@evaluatingevals·17 Şub

Read the full announcement: evalevalai.com/infrastructure… Shared Task: evalevalai.com/events/shared-… Project Webpage: evalevalai.com/projects/every… #AIEvaluation #EvalEval

English

379

EvalEval Coalition@evaluatingevals·17 Şub

Thankful to our partners for the feedback: CAISI, @AiEleuther, @huggingface, @NomaSecurity, @TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

English

610

EvalEval Coalition@evaluatingevals·17 Şub

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/every…

English

11.9K

EvalEval Coalition@evaluatingevals·17 Şub

We're seeking submissions on: 🔍 Evaluation validity & reliability 🌍 Sociotechnical impacts ⚙️ Infrastructure & costs 🤝 Community-centered approaches Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome. Check out the CFP: evalevalai.com/events/2026-ac…

English

262

EvalEval Coalition@evaluatingevals·17 Şub

🚨 The next edition of EvalEval Workshop is coming to @aclmeeting 2026! 🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" 🎇 📢 CFP is now open!!! More details ⏬ 📍 San Diego 📝 Submission deadline: Mar 12, 2026

English

2.7K

EvalEval Coalition@evaluatingevals·11 Ara

Thank you to everyone who attended, presented at, spoke at, or helped organize this workshop. You rock! Special thanks to the UK @AISecurityInst for cohosting and their support.

English

120

EvalEval Coalition@evaluatingevals·11 Ara

It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations! We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!

English

414

EvalEval Coalition@evaluatingevals·9 Ara

Join us tonight for a social at 7 pm at Stone Brewing! RSVP here: partiful.com/e/zn8tz8e0Qt8l…

English

196

EvalEval Coalition@evaluatingevals·8 Ara

EvalEval is back! Our view today for the 2025 EvalEval Workshop at the beautiful @UCSD campus. We have an exciting program planned, full of wonderful discussions and people on all things evals Agenda: evalevalai.com/events/worksho… Can't be here? Join us live: meet.google.com/ozx-dsnz-gcr?h…

English

1.1K

EvalEval Coalition@evaluatingevals·21 Kas

@aardauzunoglu @tli104 @DanielKhashabi 🏆 Summary This work proposes 🤔rethinking evaluation beyond single-number scoring. HARMONY exposes when benchmark reflect true general competence vs. over-specialization. The authors recommend reporting HARMONY alongside accuracy. 🔗Paper link: arxiv.org/pdf/2509.25671

English

171

EvalEval Coalition@evaluatingevals·21 Kas

@aardauzunoglu @tli104 @DanielKhashabi 📊 Key Findings 1. Some benchmarks are less reliable than we might think. 2. The mean–variance HARMONY plane reveals benchmark reliability. 3. Balancing via pruning stabilizes accuracy. 4. Scaling trends vary by model family 5/n

English

196

EvalEval Coalition@evaluatingevals·21 Kas

✨ Weekly AI Evaluation Paper Spotlight ✨ What if the average performance scores we trust are actually hiding a benchmark’s flaws? 📰“The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks” (@aardauzunoglu, @tli104, @DanielKhashabi) introduces HARMONY. 1/n

English

936

Entdecken

@AiEleuther @huggingface @NomaSecurity @TrustibleAI @aclmeeting @AISecurityInst @UCSD @aardauzunoglu