HUMAINE is Prolific's human-centered, demographically stratified LLM evaluation framework, published at #ICLR2026. 48,325 participants across 22 demographic groups in the US and UK, 42 models, 5 evaluation dimensions derived from factor analysis.
Paper and leaderboard: www.prolific.com/humaine 6/6
Posts by Prolific
As a living benchmark, HUMAINE tracks whether demographic disagreement patterns remain stable as new models enter the field.
So far, age consistently produces the largest heterogeneity across model generations. 5/6
These patterns hold across new model generations. Improvements in conversational engagement and communication style do not reliably translate to improvements in trust perceptions.
All four models in our latest release are weakest on trust and safety. 4/6
Trust and safety is the least discriminative dimension and the hardest to move.
Models scoring identically on task performance rank up to 7 positions apart on trust. Open-ended conversation is an imprecise tool for eliciting safety-relevant behavior. 3/6
Of the three demographic axes studied, age produces the most preference heterogeneity, causing 2x more rank shuffling than ethnicity or politics. Models tuned on tech-savvy feedback systematically underperform for older populations.
The gap is in the audience, not the model. 2/6
đź§µ Aggregate leaderboards conceal systematic preference disagreement across populations.
HUMAINE data shows a model's P(best) drops from 98.8% to 41.9% across evaluation dimensions. The contender changes every time you change the metric. 1/6
#MLSky #AcademicSky #Research #ICLR2026
Nora Petrova and John Burden of Prolific will also be presenting "The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries" at the I Can't Believe It's Not Better (ICBINB) Workshop, nominated for the Didactic Award.
openreview.net/forum?id=ev58E64j0n
See you at #ICLR2026.
"Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility" by Maurice Chiodo*, Dennis MĂĽller, Paul Siewert, Jean-Luc Wetherall, Zoya Yasmine, John Burden
Ethics in Mathematics Project: www.ethics-in-mathematics.com
openreview.net/forum?id=KR8viVTrX4
"Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework" by Nora Petrova, Andrew Gordon, Enzo Blindow
We evaluated frontier models with demographically stratified participants across demographic groups in the US and UK.
openreview.net/pdf?id=kVaE2kYjtV
An image displaying three research papers authored by researchers from Prolific. The image also reads "Prolific is coming to ICLR 2026" and displays the Prolific booth number and conference date.
Our researchers have had two papers accepted to ICLR 2026!
Two poster sessions at the conference, plus presenting at the ICBINB workshop across the week in Rio. Thread below for information. đź§µ
@iclr-conf.bsky.social #ICLR2026 #MLSky #AIResearch đź§Ş
Job ad: I'm hiring for a Research scientist to join my team at @joinprolific.bsky.social
If you've ever wondered who's working on the hard questions in online research — data quality, sampling methodology, the effect of AI on how research gets done — this is that job. [1/4]
Thanks Bonnie, I can understand your point about the limited space for the many screener options and have passed the feedback along to the product team!
Pre-print: osf.io/preprints/ps...
Credit: @andrewgordon.bsky.social, Simon Jones, @davmicrot.bsky.social, Felipe Affonso, Justin Sulik, David Hauser, Karine Pepin.
This means the broader question should be about where researchers are recruiting human participants from, as opposed to whether their sample contains AI agents.
A more consequential finding: human data quality varied substantially by platform type.
Direct panels outperformed hybrid networks, which outperformed marketplace aggregators, across nearly all behavioral measures. Cheapest platforms per respondent were the most expensive per quality respondent.
A pre-registered study recruiting 5,200 respondents across 10 platforms, spanning direct panels, hybrid networks, and marketplace aggregators.
AI agent infiltration was absent across platforms tested, and automated responses were more consistent with traditional scripted bots than LLM-based agents.
New pre-print: "AI Agent Prevalence and Data Quality Across Multiple Online Sample Providers,” led by Prolific's research sciences team and researchers from @msftresearch.bsky.social, Oklahoma State University, @lmu.de, Queen's University, and The Research Heads. 🧵
#AcademicSky #Research
New preprint out today (osf.io/preprints/ps...). We tested whether AI agents are actually infiltrating online surveys.
Spoiler alert: they aren't
Thread đź§µ
[1/9]
Starting today, if an AI agent is detected in your Prolific study, we’ll give you twice the cost of that non-human participant back.
You pay for human data. You expect human data. We’re backing that with our 100% Human Guarantee.
Learn more: www.prolific.com/100-human-gu... #AcademicSky #Research
As of today, if an AI agent is detected in your Prolific study, you'll get twice the cost of that participant back. We’re calling this our 100% Human Guarantee.
Years of investing into @joinprolific.bsky.social's system has made us confident in data integrity.
www.prolific.com/100-human-gu...
New working paper on online research data quality, led by @univie.ac.at, reveals that pass rates on quality checks vary wildly by source. Pretty interesting.
Prolific: 90% | Lab: 80% | Bilendi: 73% | Moblab: 55% | MTurk: 9% | AI agents: 0%
github.com/survey-data-... CC @jyusof.bsky.social
Using 5 agents including a custom “white hat” model closer to Westwood (2025), Prolific's checks achieved 100% accuracy, outperforming all others.
Join @andrewgordon.bsky.social in tomorrow's live demo to learn more: resources.prolific.com/how-to-bot-p...
We tested the most accurate methods for identifying agentic AI in online research.
Our research team compared attention checks, consistency checks, cognitive traps, reCAPTCHA, and behavioral tracking against our new bot authenticity checks. đź§µ
#AcademicSky #Research
The two-stage recruitment method the team propose is described as "an intently simple method, easy to implement across a variety of platforms."
Code and materials also on GitHub. Good read overall for those collecting data online: github.com/survey-data-... (4/4)
None of the agents passed all checks, and a novel video attention check alone filtered out 83% of them.
As AI tools get better at mimicking survey behavior, thoughtfully designed checks can still do the work. (3/4)
Pass rates on five data quality checks ranged from 90% down to single digits depending on where you recruit.
Prolific respondents had the highest pass rate of all platforms tested (90%), outperforming even the in-person lab benchmark (80%). (2/4)
Independent research by Celebi, Exley, Harrs, Kivimaki, @martaserragarcia.bsky.social & @jyusof.bsky.social (2026) compares data quality across online participants, AI agents, & human subjects in the lab, with interesting platform variation. (1/4)
đź”— www.ifo.de/en/cesifo/pu... #AcademicSky #EconSky
Our “How to bot-proof your online research” webinar series continues tomorrow, March 19th, with Dr Simon Jones and Rath Bala showing you how to verify participants on Prolific.
👉 resources.prolific.com/how-to-bot-p...
Catch us live from 4:30 PM GMT / 12:30 PM EDT / 9:30 AM PDT. #AcademicSky
Thanks @talha-ozudogru.bsky.social yes we originally offered authenticity checks with Gorilla integration. Moving forward we're looking to expand compatibility with additional platforms based on community feedback!
Interesting findings @cstrauch.bsky.social, thanks for sharing. This vulnerability is why we implement robust checks, monitoring, and in-study tooling to prevent AI misuse - hope to continue collaborating with the community as we adapt these systems.