EvalEval Coalition

76 posts

EvalEval Coalition banner
EvalEval Coalition

EvalEval Coalition

@evaluatingevals

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Beigetreten Haziran 2025
7 Folgt410 Follower
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
3 days left! 📷 Writing, wrote, or just submitted a paper? Commit it to the EvalEval workshop at ACL 2026 in San Diego! evalevalai.com/events/2026-ac… (including ARR Submissions, non-archival, positions, and extended abstracts!) Submission Deadline: March 19th, 2026 AoE
English
0
5
9
3.2K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026. If your work touches AI evaluation, submit! We welcome: ✅ Regular papers ✅ ARR submissions ✅ Non-archival work ✅ Position papers ✅ Extended abstracts 📅 Deadline: March 19 (1/2)
English
1
1
10
646
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
Sitting on results from papers or leaderboards? Whether you use lm-eval, Inspect AI, or HELM, we have low-lift converters ready to go. 🦾 💾 GitHub: github.com/evaleval/every… 📜 Co-authorship on the shared task paper for qualifying contributors 📅 Deadline: May 1, 2026
English
0
0
10
590
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🧪 Your LLM evaluation results could help the whole field 🚀 🧑‍🔬 Our ACL Shared task is out! We’re building a unified, crowdsourced database to create a common language for AI evaluation reporting. And we need your data. (1/2) evalevalai.com/events/shared-…
English
1
7
58
16.7K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀 A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧 A tale of broken AI evals 🧵👇 evalevalai.com/projects/every…
English
2
15
45
11.9K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
We're seeking submissions on: 🔍 Evaluation validity & reliability 🌍 Sociotechnical impacts ⚙️ Infrastructure & costs 🤝 Community-centered approaches Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome. Check out the CFP: evalevalai.com/events/2026-ac…
English
0
0
5
262
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
🚨 The next edition of EvalEval Workshop is coming to @aclmeeting 2026! 🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" 🎇 📢 CFP is now open!!! More details ⏬ 📍 San Diego 📝 Submission deadline: Mar 12, 2026
English
1
7
28
2.7K
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
Thank you to everyone who attended, presented at, spoke at, or helped organize this workshop. You rock! Special thanks to the UK @AISecurityInst for cohosting and their support.
English
0
0
1
120
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations! We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!
EvalEval Coalition tweet media
English
1
2
11
414
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
@aardauzunoglu @tli104 @DanielKhashabi 📊 Key Findings 1. Some benchmarks are less reliable than we might think. 2. The mean–variance HARMONY plane reveals benchmark reliability. 3. Balancing via pruning stabilizes accuracy. 4. Scaling trends vary by model family 5/n
EvalEval Coalition tweet media
English
1
1
4
196
EvalEval Coalition
EvalEval Coalition@evaluatingevals·
✨ Weekly AI Evaluation Paper Spotlight ✨ What if the average performance scores we trust are actually hiding a benchmark’s flaws? 📰“The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks” (@aardauzunoglu, @tli104, @DanielKhashabi) introduces HARMONY. 1/n
EvalEval Coalition tweet media
English
1
5
12
936