Advertisement · 728 × 90

Posts by EvalEval Coalition

2026 ACL Workshop on Evaluating AI in Practice This workshop focuses on AI evaluation in practice, centering the tensions and collaborations between model developers and evaluation researchers and aims to surface practical insights from across the...

3 days left!
📃 Writing, wrote, or just submitted a paper?
Commit it to the EvalEval workshop at ACL 2026 in San Diego!
evalevalai.com/events/2026-...
(including ARR Submissions, non-archival, positions, and extended abstracts!)
Submission Deadline: March 19th, 2026 AoE

1 month ago 4 1 0 1
ACL 2026 Workshop EvalEval Welcome to the OpenReview homepage for ACL 2026 Workshop EvalEval

⏳ 9 more days! We extended the submission deadline for the EvalEval Workshop @ ACL 2026.

If your work touches AI evaluation, submit!

We welcome:
✅ Regular papers
✅ ARR submissions
✅ Non-archival work
✅ Position papers
✅ Extended abstracts

📅 Deadline: March 19
🌐 evalevalai.com/events/2026-...

1 month ago 8 3 0 0
Preview
Every Eval Ever: Toward a Common Language for AI Eval Reporting The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. We’re working to resolve AI evaluation fragmentation, improving formatting, settings, and...

Read the full announcement: evalevalai.com/infrastructu...
Shared Task: evalevalai.com/events/share...
Project Webpage: evalevalai.com/projects/eve...

#AIEvaluation #EvalEval

2 months ago 0 0 0 0

Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research 🤝

2 months ago 2 0 1 0

How can you help?

We are launching a shared task alongside our workshop at @aclmeeting.bsky.social

→ Two tracks: public + proprietary eval data
→ Co-authorship for qualifying contributors
→ Workshop at ACL 2026 (San Diego)
→ Deadline: May 1, 2026 📅

2 months ago 1 0 1 0

What we built:

📋 Metadata schema for cross-framework comparison
🔧 Validation via Hugging Face Jobs
🔌 Converters (Inspect AI, HELM, lm-eval-harness)
📊 Community repo organized by benchmark/model/run

✨ Captures scores AND context: settings, prompts, example-level data

2 months ago 1 0 1 0

This has real costs!

🔬 Signal buried in noise, can't tell if differences reflect model capability or just setup
📦 Evaluation debt piles up silently across the ecosystem
🔎Redundant re-runs of expensive evaluations

🌟That's where Every Eval Ever comes

2 months ago 1 0 1 0
Advertisement

🤔Consider the scenario

LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU

Same model. Same benchmark name. Different prompts, settings, extraction methods.

💡Which score is right? Both? Neither? We can't compare. 🤷

2 months ago 1 0 1 0
Every Eval Ever | EvalEval Coalition

🚀 Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting 🚀

A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch 🔧

A tale of broken AI evals 🧵👇

evalevalai.com/projects/eve...

2 months ago 12 4 1 4
https://evalevalai.com/events/2026-acl-workshop/

We're seeking submissions on:

🔍 Evaluation validity & reliability
🌍 Sociotechnical impacts
⚙️ Infrastructure & costs
🤝 Community-centered approaches

Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome.

Check out the full CFP: t.co/JRSr50V7Y6

2 months ago 1 0 0 0

🚨 The next edition of EvalEval Workshop is coming to
@aclmeeting.bsky.social 2026!

🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" 🎇

📢 CFP is now open!!! More details ⏬

📍 San Diego
📝 Submission deadline: Mar 12, 2026

2 months ago 6 3 1 0

Thank you to everyone who attended, presented at, spoke at, or helped organize this workshop. You rock! Special thanks to the UK AI Security Institute for cohosting and their support.

4 months ago 0 0 0 0
Post image

It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations!

We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!

4 months ago 5 1 1 0

📜Paper: arxiv.org/pdf/2511.056...
📝Blog: tinyurl.com/blogAI1

🤝At EvalEval, we are a coalition of researchers working towards better AI evals. Interested in joining us? Check out: evalevalai.com 7/7 🧵

5 months ago 0 0 0 0
Post image

Continued..

📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑‍💻 Sensitive content gets the most attention, as it’s easier to define and measure

🛡️Solution? Standardized reporting & safety policies (6/7)

5 months ago 0 0 1 0
Post image

Key Takeaways:

⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)

5 months ago 1 0 1 0
Advertisement

💡 We also interviewed developers from for-profit and non-profit orgs to understand why some disclosures happen and why others don’t.

💬 TLDR: Incentives and constraints shape reporting (4/7)

5 months ago 0 0 1 0

📊 What we did:

🔎 Analyzed 186 first-party release reports from model developers & 183 post-release evaluations (third-party)
📏 Scored 7 social impact dimensions: bias, harmful content, performance disparities, environmental costs, privacy, financial costs, & labor (3/7)

5 months ago 0 0 1 0

While general capability evaluations are common, social impact assessments, covering bias, fairness, and privacy, etc., are often fragmented or missing. 🧠

🎯Our goal: Explore the AI Eval landscape to answer who evaluates what and identify gaps in social impact evals!! (2/7)

5 months ago 1 0 1 0
Post image

🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)

5 months ago 11 3 1 0

Note: General registration is constrained by space capacity! Please note that attendance will be confirmed by the organizers based on space availability. Accepted posters will be invited to register for free and attend the workshop in person!

5 months ago 0 0 0 0

📮 We are inviting students and early-stage researchers to submit an Abstract (Max 500 words) to be presented as posters during interactive session. Submit here: tinyurl.com/AbsEval

We have a rock-star lineup of AI researchers and an amazing program. Please RSVP at the earliest! Stay tuned!

5 months ago 0 0 1 1

🚨 EvalEval is back - now in San Diego!🚨

🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025

Details below! ⬇️
evalevalai.com/events/works...

5 months ago 3 1 1 1
Advertisement

💡This paper was brought to you as part of our spotlight series featuring papers on evaluation methods & datasets, the science of evaluation, and many more.

📸Interested in working on better AI evals? Join us: evalevalai.com

5 months ago 2 0 0 0

🚫 The approach also avoids mislabeled data and delays benchmark saturation, continuing to distinguish model improvements even at high performance levels.

📑Read more: arxiv.org/abs/2509.11106

5 months ago 2 0 1 0
Post image

📊Results & Findings

🧪 Experiments across 6 LLMs and 6 major benchmarks:

🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!

5 months ago 1 0 1 0

It combines two key ideas:

✍️Item Response Theory: Models LLM performance in a latent ability space based on item difficulty and discrimination across models
🧨Dynamic Item Selection: Adaptive benchmarking-weaker models get easier items, while stronger models face harder ones

5 months ago 1 0 1 0
Post image

🔍How to address this? 🤔

🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.

Continued...👇

5 months ago 1 0 1 0

⚠️ Evaluation results can be noisy and prone to variance & labeling errors.
🧱As models advance, benchmarks tend to saturate quickly, reducing their longterm usefulness.
🪃Existing approaches typically tackle just one of these problems (e.g., efficiency or validity)

What now⁉️

5 months ago 1 0 1 0

💣Current SOTA benchmarking setups face several systematic issues:

📉It’s often unclear which benchmark(s) to choose, while evaluating on all available ones is too expensive, inefficient, and not always aligned with the intended capabilities we want to measure.

More 👇👇

5 months ago 1 0 1 0