Prolific (@joinprolific) Bsky

AI Leaderboard & LLM Evals Benchmark | HUMAINE by Prolific | Prolific Check the latest HUMAINE AI leaderboard. See how top LLMs rank in real-world use through statistically rigorous human evaluation and benchmarks.

HUMAINE is Prolific's human-centered, demographically stratified LLM evaluation framework, published at #ICLR2026. 48,325 participants across 22 demographic groups in the US and UK, 42 models, 5 evaluation dimensions derived from factor analysis.

Paper and leaderboard: www.prolific.com/humaine 6/6

2 days ago 0 0 0 0

As a living benchmark, HUMAINE tracks whether demographic disagreement patterns remain stable as new models enter the field.

So far, age consistently produces the largest heterogeneity across model generations. 5/6

2 days ago 0 0 1 0

These patterns hold across new model generations. Improvements in conversational engagement and communication style do not reliably translate to improvements in trust perceptions.

All four models in our latest release are weakest on trust and safety. 4/6

2 days ago 0 0 1 0

Trust and safety is the least discriminative dimension and the hardest to move.

Models scoring identically on task performance rank up to 7 positions apart on trust. Open-ended conversation is an imprecise tool for eliciting safety-relevant behavior. 3/6

2 days ago 0 0 1 0

Of the three demographic axes studied, age produces the most preference heterogeneity, causing 2x more rank shuffling than ethnicity or politics. Models tuned on tech-savvy feedback systematically underperform for older populations.

The gap is in the audience, not the model. 2/6

2 days ago 0 0 1 0

🧵 Aggregate leaderboards conceal systematic preference disagreement across populations.

HUMAINE data shows a model's P(best) drops from 98.8% to 41.9% across evaluation dimensions. The contender changes every time you change the metric. 1/6

#MLSky #AcademicSky #Research #ICLR2026

2 days ago 1 0 1 0

Nora Petrova and John Burden of Prolific will also be presenting "The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries" at the I Can't Believe It's Not Better (ICBINB) Workshop, nominated for the Didactic Award.

openreview.net/forum?id=ev58E64j0n

See you at #ICLR2026.

4 days ago 0 0 0 0

"Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility" by Maurice Chiodo*, Dennis Müller, Paul Siewert, Jean-Luc Wetherall, Zoya Yasmine, John Burden

Ethics in Mathematics Project: www.ethics-in-mathematics.com

openreview.net/forum?id=KR8viVTrX4

4 days ago 0 0 1 0

"Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework" by Nora Petrova, Andrew Gordon, Enzo Blindow

We evaluated frontier models with demographically stratified participants across demographic groups in the US and UK.

openreview.net/pdf?id=kVaE2kYjtV

4 days ago 1 0 1 0

An image displaying three research papers authored by researchers from Prolific. The image also reads "Prolific is coming to ICLR 2026" and displays the Prolific booth number and conference date.

Our researchers have had two papers accepted to ICLR 2026!

Two poster sessions at the conference, plus presenting at the ICBINB workshop across the week in Rio. Thread below for information. 🧵

@iclr-conf.bsky.social #ICLR2026 #MLSky #AIResearch 🧪

4 days ago 1 0 1 0

Job ad: I'm hiring for a Research scientist to join my team at @joinprolific.bsky.social

If you've ever wondered who's working on the hard questions in online research — data quality, sampling methodology, the effect of AI on how research gets done — this is that job. [1/4]

1 week ago 15 10 1 0

Thanks Bonnie, I can understand your point about the limited space for the many screener options and have passed the feedback along to the product team!

1 week ago 1 0 0 0

OSF

Pre-print: osf.io/preprints/ps...

Credit: @andrewgordon.bsky.social, Simon Jones, @davmicrot.bsky.social, Felipe Affonso, Justin Sulik, David Hauser, Karine Pepin.

2 weeks ago 0 0 0 0

This means the broader question should be about where researchers are recruiting human participants from, as opposed to whether their sample contains AI agents.

2 weeks ago 0 0 2 0

A more consequential finding: human data quality varied substantially by platform type.

Direct panels outperformed hybrid networks, which outperformed marketplace aggregators, across nearly all behavioral measures. Cheapest platforms per respondent were the most expensive per quality respondent.

2 weeks ago 0 0 1 0

A pre-registered study recruiting 5,200 respondents across 10 platforms, spanning direct panels, hybrid networks, and marketplace aggregators.

AI agent infiltration was absent across platforms tested, and automated responses were more consistent with traditional scripted bots than LLM-based agents.

2 weeks ago 0 0 1 0

New pre-print: "AI Agent Prevalence and Data Quality Across Multiple Online Sample Providers,” led by Prolific's research sciences team and researchers from @msftresearch.bsky.social, Oklahoma State University, @lmu.de, Queen's University, and The Research Heads. 🧵

#AcademicSky #Research

2 weeks ago 8 3 2 0

OSF

New preprint out today (osf.io/preprints/ps...). We tested whether AI agents are actually infiltrating online surveys.

Spoiler alert: they aren't

Thread 🧵

[1/9]

2 weeks ago 134 63 2 10

Starting today, if an AI agent is detected in your Prolific study, we’ll give you twice the cost of that non-human participant back.

You pay for human data. You expect human data. We’re backing that with our 100% Human Guarantee.

Learn more: www.prolific.com/100-human-gu... #AcademicSky #Research

3 weeks ago 10 3 0 0

As of today, if an AI agent is detected in your Prolific study, you'll get twice the cost of that participant back. We’re calling this our 100% Human Guarantee.

Years of investing into @joinprolific.bsky.social's system has made us confident in data integrity.

www.prolific.com/100-human-gu...

3 weeks ago 3 2 0 0

New working paper on online research data quality, led by @univie.ac.at, reveals that pass rates on quality checks vary wildly by source. Pretty interesting.

Prolific: 90% | Lab: 80% | Bilendi: 73% | Moblab: 55% | MTurk: 9% | AI agents: 0%

github.com/survey-data-... CC @jyusof.bsky.social

3 weeks ago 2 1 1 0

How to Bot-Proof Your Online Research This series cuts through the noise to give researchers practical, actionable strategies for protecting their work from AI agents and bots

Using 5 agents including a custom “white hat” model closer to Westwood (2025), Prolific's checks achieved 100% accuracy, outperforming all others.

Join @andrewgordon.bsky.social in tomorrow's live demo to learn more: resources.prolific.com/how-to-bot-p...

3 weeks ago 0 0 0 0

We tested the most accurate methods for identifying agentic AI in online research.

Our research team compared attention checks, consistency checks, cognitive traps, reCAPTCHA, and behavioral tracking against our new bot authenticity checks. 🧵

#AcademicSky #Research

3 weeks ago 1 0 1 0

GitHub - survey-data-quality-lab/mission-possible: Data and companion code for “Mission Possible: Data Quality in Online Surveys”. Data and companion code for “Mission Possible: Data Quality in Online Surveys”. - survey-data-quality-lab/mission-possible

The two-stage recruitment method the team propose is described as "an intently simple method, easy to implement across a variety of platforms."

Code and materials also on GitHub. Good read overall for those collecting data online: github.com/survey-data-... (4/4)

1 month ago 1 0 0 0

None of the agents passed all checks, and a novel video attention check alone filtered out 83% of them.

As AI tools get better at mimicking survey behavior, thoughtfully designed checks can still do the work. (3/4)

1 month ago 1 0 1 0

Pass rates on five data quality checks ranged from 90% down to single digits depending on where you recruit.

Prolific respondents had the highest pass rate of all platforms tested (90%), outperforming even the in-person lab benchmark (80%). (2/4)

1 month ago 0 0 1 0

Independent research by Celebi, Exley, Harrs, Kivimaki, @martaserragarcia.bsky.social & @jyusof.bsky.social (2026) compares data quality across online participants, AI agents, & human subjects in the lab, with interesting platform variation. (1/4)

🔗 www.ifo.de/en/cesifo/pu... #AcademicSky #EconSky

1 month ago 12 4 1 0

Our “How to bot-proof your online research” webinar series continues tomorrow, March 19th, with Dr Simon Jones and Rath Bala showing you how to verify participants on Prolific.

👉 resources.prolific.com/how-to-bot-p...

Catch us live from 4:30 PM GMT / 12:30 PM EDT / 9:30 AM PDT. #AcademicSky

1 month ago 4 0 0 0

Thanks @talha-ozudogru.bsky.social yes we originally offered authenticity checks with Gorilla integration. Moving forward we're looking to expand compatibility with additional platforms based on community feedback!

1 month ago 0 0 0 0

Interesting findings @cstrauch.bsky.social, thanks for sharing. This vulnerability is why we implement robust checks, monitoring, and in-study tooling to prevent AI misuse - hope to continue collaborating with the community as we adapt these systems.

1 month ago 1 0 0 0

Posts by Prolific